Closed bernaljg closed 3 days ago
Thank you for your attention to our work.
The CLQA-Wiki is constructed based on a subset of Wikidata, so it cannot be guaranteed that the corresponding answer will always be found in Wikipedia. The accuracy of the answers to these questions is guaranteed by Wikidata. If you need to use external retrieval, this dataset does not restrict the search source when used.
Does that mean that Table 6 in your paper is using Wikidata somehow or are you still using Wikipedia as a retrieval corpus even though the answers are not guaranteed to be there?
Thank you for your attention.
In Table 6, we use Wikipedia as retrieval corpus, which is mentioned in "Implementation Details" in Section 5.1. We need to clarify that this is a dataset used to evaluate QA capabilities rather than retrieval capabilities. Retrieval is just an auxiliary tool to enhance QA capabilities, so there is no need to guarantee that the retrieval source and the construction source of the questions are strictly consistent. Some other QA datasets also have inconsistent data sources during construction and retrieval. Here are some examples:
In addition, Wikidata is extracted from Wikipedia, and theoretically, the data in Wikidata can be sourced from Wikipedia, so we use Wikipedia as the retrieval corpus. We believe that this approach does not affect the rationality and fairness of the comparison.
We hope this can solve your doubts.
Thank you for the thorough answer!
Hi, thanks so much for making the dataset available!
I wanted to clarify something. Is there a guarantee that there are files in the Wikipedia dump that provide answers to all the questions in CLQA-Wiki?