zhudotexe / fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models (ACL 2024)
https://fanoutqa.com/
MIT License
35 stars 2 forks source link

About The Retrieval Source #12

Open yunfan42 opened 1 week ago

yunfan42 commented 1 week ago

First of all, I would like to express my sincere gratitude to all the authors for their outstanding work and for open-sourcing such an excellent dataset. šŸ‘ šŸ‘ šŸ‘

I am currently attempting to conduct some tests with RAG (Retrieval-Augmented Generation) on this dataset, and I have some confusion regarding the retrieval source that I hope the authors can help clarify.

In RAG, it is necessary to first chunk and index the Wikipedia pages that may be used for retrieval.

The paper's Section 3.1 mentions that all QA involves a total of 4,121 Wikipedia articles. Is this the complete retrieval source?

Or should I use the author-provided wikicache.tar.gz file (~9.43GB)? (Of course, this would consume a massive amount of Embedding Tokens and take a significant amount of time.)

Based on my understanding, this cached Wikipedia is preliminarily filtered out through BM25 based on the Question across all Wikipedia page Titles. I am not sure if this is correct.

Additionally, I would like to ask where I can directly download the actual 4,121 Wikipedia articles that are used.

zhudotexe commented 1 week ago

The Open Book setting is conducted over all of English Wikipedia as a knowledge base. If you have the ability to self-host Wikipedia, the SQL dump is provided at https://datasets.mechanus.zhu.codes/fanoutqa/enwiki-20231120-pages-articles-multistream.xml.bz2, though this will take a couple hundred GB of disk space!

We recommend using the pip library provided by this repo to download Wikipedia articles as needed. The library will ensure that the text downloaded is the revision as of November 2023. The wikicache.tar.gz file can be used to prepopulate the cache for this library, but is optional - it's all of the files downloaded onto my machine when we were running experiments, not necessarily filtered in any way.

We don't recommend embedding all of Wikipedia - that would be prohibitively expensive for all but large organizations! Instead, your model should use a search tool to find relevant article titles, then retrieve from the text of individual articles returned by the search tool (i.e., index individual pages instead of the entire knowledge base).