Open timchen0618 opened 5 months ago
Thanks for your interest in this project! For Table 6, I do not use the "fact" field. Instead, I use the retrieval chunks, which include the "fact."
So you do retrieval even for "ground-truth evidence" in table 6? What are the difference between that and the data in toy_data/
? Could you provide the "ground-truth evidence"?
I didn't actually perform retrieval but found the corresponding "fact" sentence in the original documents, then extracted the relevant context, making the entire paragraph either 512 or 256 in size (the same size as the retrieval chunk). The data in toy_data are the chunks that were actually retrieved.
Yes, now I understand. Is there any chance you could provide the ground truth chunks you extracted from the relevant context? Or could you point me to the data if it's already there?
Sorry, I did not keep the data. But you can get a similar chunk using the following code:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
# This demo uses NLTK, but in the paper, we use tokenizer from BAAI/bge-large-en-v1.5.
# text: the target document
# substring: the "fact" sentence
def find_chunk_around_substring(text, substring, chunk_size=512):
# Tokenize the entire text and the substring
tokens = word_tokenize(text)
substring_tokens = word_tokenize(substring)
# Find the start and end indices of the substring tokens in the main token list
start_index = -1
for i in range(len(tokens) - len(substring_tokens) + 1):
if tokens[i:i+len(substring_tokens)] == substring_tokens:
start_index = i
break
if start_index == -1:
return None # Substring not found
# Calculate the token indices to include in the chunk
end_index = start_index + len(substring_tokens) - 1
pre_tokens = max(0, start_index - ((chunk_size - len(substring_tokens)) // 2))
post_tokens = end_index + 1 + ((chunk_size - len(substring_tokens)) // 2)
# Adjust to ensure the total number of tokens approximates chunk_size
if post_tokens - pre_tokens > chunk_size:
if pre_tokens > 0:
pre_tokens += (post_tokens - pre_tokens - chunk_size)
if pre_tokens < 0:
pre_tokens = 0
if post_tokens > len(tokens):
post_tokens = len(tokens)
# Reconstruct the chunk by joining the selected range of tokens
chunk_tokens = tokens[pre_tokens:post_tokens]
chunk_text = ' '.join(chunk_tokens) # Simple space join, might not preserve original whitespace exactly
return chunk_text
Hi, thank you for the good work! I wonder if the "ground-truth evidence" used is the "fact" field in the "evidence_list" of each instance in the file
dataset/MultiHopRAG.json
. I am asking because there is no prompt provided for the generation, and it seems the lengths of these evidences are pretty short, making this task relatively easy?