yixuantt / MultiHop-RAG

Repository for "MultiHop-RAG: A Dataset for Evaluating Retrieval-Augmented Generation Across Documents" (COLM 2024)
214 stars 15 forks source link

What is the ground-truth evidence used for "ground-truth evidence" results in Table 6? #6

Open timchen0618 opened 5 months ago

timchen0618 commented 5 months ago

Hi, thank you for the good work! I wonder if the "ground-truth evidence" used is the "fact" field in the "evidence_list" of each instance in the file dataset/MultiHopRAG.json. I am asking because there is no prompt provided for the generation, and it seems the lengths of these evidences are pretty short, making this task relatively easy?

yixuantt commented 5 months ago

Thanks for your interest in this project! For Table 6, I do not use the "fact" field. Instead, I use the retrieval chunks, which include the "fact."

timchen0618 commented 5 months ago

So you do retrieval even for "ground-truth evidence" in table 6? What are the difference between that and the data in toy_data/? Could you provide the "ground-truth evidence"?

yixuantt commented 5 months ago

I didn't actually perform retrieval but found the corresponding "fact" sentence in the original documents, then extracted the relevant context, making the entire paragraph either 512 or 256 in size (the same size as the retrieval chunk). The data in toy_data are the chunks that were actually retrieved.

timchen0618 commented 5 months ago

Yes, now I understand. Is there any chance you could provide the ground truth chunks you extracted from the relevant context? Or could you point me to the data if it's already there?

yixuantt commented 5 months ago

Sorry, I did not keep the data. But you can get a similar chunk using the following code:

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# This demo uses NLTK, but in the paper, we use tokenizer from BAAI/bge-large-en-v1.5.
# text: the target document 
# substring: the "fact" sentence
def find_chunk_around_substring(text, substring, chunk_size=512):
    # Tokenize the entire text and the substring
    tokens = word_tokenize(text)
    substring_tokens = word_tokenize(substring)

    # Find the start and end indices of the substring tokens in the main token list
    start_index = -1
    for i in range(len(tokens) - len(substring_tokens) + 1):
        if tokens[i:i+len(substring_tokens)] == substring_tokens:
            start_index = i
            break

    if start_index == -1:
        return None  # Substring not found

    # Calculate the token indices to include in the chunk
    end_index = start_index + len(substring_tokens) - 1
    pre_tokens = max(0, start_index - ((chunk_size - len(substring_tokens)) // 2))
    post_tokens = end_index + 1 + ((chunk_size - len(substring_tokens)) // 2)

    # Adjust to ensure the total number of tokens approximates chunk_size
    if post_tokens - pre_tokens > chunk_size:
        if pre_tokens > 0:
            pre_tokens += (post_tokens - pre_tokens - chunk_size)
    if pre_tokens < 0:
        pre_tokens = 0
    if post_tokens > len(tokens):
        post_tokens = len(tokens)

    # Reconstruct the chunk by joining the selected range of tokens
    chunk_tokens = tokens[pre_tokens:post_tokens]
    chunk_text = ' '.join(chunk_tokens)  # Simple space join, might not preserve original whitespace exactly

    return chunk_text