wikipedia corpus - Githubissues

hljjjmssyh commented 1 week ago

Great job! Could you share your Wikipedia corpus for retrieval? I'm curious about the data amount and the method for calculating top-n recall metrics.

CY-SCUT commented 1 week ago

me too

weizhepei commented 1 week ago

Sure! The retrieval corpus (Wikipedia) can be downloaded using this command wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz.

The dataset used in our work is also available here. The recall metrics measures the fraction of samples where the correct answer to the query is mentioned in the top-k retrieved documents.

Please let us know if you have any further questions!

hljjjmssyh commented 1 week ago

Thanks a lot. Another point of interest is why different datasets use different retrieval methods. In my opinion, BM25 and DPR can represent sparse retrieval and dense retrieval, respectively. What is the purpose of using other retrieval methods, such as Contriever and DTR?

weizhepei commented 4 days ago

Thanks for bringing this up! We actually follow previous works to set up the retrieval process in each benchmark. For example, Self-RAG used Contriever for PopQA and TriviaQA, In-Context RALM used DPR for NQ, ALCE used GTR for ASQA, and FLARE used BM25 for 2WikiMultiHopQA. This provides diverse retrieval environments that help validate the flexibility and generalizability of our method.

Though our InstructRAG is agnostic to the choice of retrievers, I think it’s possible to further improve the RAG performance by enhancing the retrieval process with more advanced retrievers, which could help reduce noise in the retrieved documents.

hljjjmssyh commented 3 days ago

Thanks for your reply. I have another question regarding vanilla SFT. When I try to reproduce the results of vanilla SFT on PopQA dataset, I get an accuracy of 44.3, which is significantly different from what was reported in the paper. Could you clarify if there are any specific settings related to vanilla SFT that I might be missing? Additionally, I noticed that if the answer appears in the LLM's reply, it is considered a positive result. However, it seems that if the LLM generates a longer response, it is more likely to achieve a higher score.

weizhepei commented 3 days ago

That’s a bit unusual, and I’d suggest checking if there’s any misalignment in your training or evaluation process. For your reference, the training details for vanilla SFT are provided in Appendix B, and we did not apply any specific tricks during its training. This training script is configured for training InstructRAG-FT but can be straightforwardly adapted for training vanilla SFT. The only caveat is to ensure that your environment aligns with the configurations specified in our repository, as differences in library versions (e.g., transformers, PyTorch) can lead to non-trivial discrepancies. If you still encounter difficulties reproducing vanilla SFT, feel free to reach out, and we’ll be happy to assist!

Yes, your understanding of the evaluation metrics is correct. We actually discussed these limitations in both Section 3.4 and Section 5. While such pattern-matching based metrics are standard for question-answering tasks, they rely solely on lexical similarity and fail to capture semantic meaning. Moreover, these metrics can suffer from the length bias, as longer responses tend to achieve higher accuracy. To address these shortcomings, we recommend validating the model using the LLM-as-a-judge evaluation over the pattern-matching based evaluation, which allows the judge to consider semantic equivalence and is expected to yield a more fair evaluation (see our Table 5).

hljjjmssyh commented 3 days ago

Thank you for your prompt feedback. I followed the training details for vanilla SFT provided in Appendix B. However, I haven't found the construction process of the LLM's output. There are multiple answers provided for one question in the PopQA dataset. Therefore, I would like to know the output format of the training data.

kfchenhn commented 3 days ago

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%?

weizhepei commented 2 days ago

Thank you for your prompt feedback. I followed the training details for vanilla SFT provided in Appendix B. However, I haven't found the construction process of the LLM's output. There are multiple answers provided for one question in the PopQA dataset. Therefore, I would like to know the output format of the training data.

@hljjjmssyh I think you can reuse our data preparation script and simply replace sample['rationale'] with the answer in https://github.com/weizhepei/InstructRAG/blob/main/src/data_utils.py#L143. For samples with multiple answers, you can randomly choose one to format the data.

weizhepei commented 2 days ago

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%?

@kfchenhn I just tested the model hosted on our HF repo, and it works well for me.

You can follow setup.sh to configure the environment. Feel free to let us know or open a new issue if you need further assistance!

kfchenhn commented 2 days ago

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%?

@kfchenhn I just tested the model hosted on our HF repo, and it works well for me.

You can follow setup.sh to configure the environment. Feel free to let us know or open a new issue if you need further assistance!

The code, model, and environment I used are exactly the same as those in your repo, but I still cannot achieve the same results as you. I suggest you check whether the online and offline content is consistent

weizhepei / InstructRAG

wikipedia corpus #5