CLQA Wikipedia Supporting Documents

bernaljg commented 3 days ago

Hi, thanks so much for making the dataset available!

I wanted to clarify something. Is there a guarantee that there are files in the Wikipedia dump that provide answers to all the questions in CLQA-Wiki?

wjj0122 commented 3 days ago

Thank you for your attention to our work.

The CLQA-Wiki is constructed based on a subset of Wikidata, so it cannot be guaranteed that the corresponding answer will always be found in Wikipedia. The accuracy of the answers to these questions is guaranteed by Wikidata. If you need to use external retrieval, this dataset does not restrict the search source when used.

bernaljg commented 3 days ago

Does that mean that Table 6 in your paper is using Wikidata somehow or are you still using Wikipedia as a retrieval corpus even though the answers are not guaranteed to be there?

wjj0122 commented 3 days ago

Thank you for your attention.

In Table 6, we use Wikipedia as retrieval corpus, which is mentioned in "Implementation Details" in Section 5.1. We need to clarify that this is a dataset used to evaluate QA capabilities rather than retrieval capabilities. Retrieval is just an auxiliary tool to enhance QA capabilities, so there is no need to guarantee that the retrieval source and the construction source of the questions are strictly consistent. Some other QA datasets also have inconsistent data sources during construction and retrieval. Here are some examples:

Mintaka dataset(https://aclanthology.org/2022.coling-1.138.pdf). The original paper mentioned that it does not restrict the data source for dataset construction, but in practical use, Wikipedia can also be used as a search source.
Bamboo dataset(http://arxiv.org/abs/2210.03350). It uses Wikipedia as the source for dataset building, but the author chose to use Google search engine as an external retrieval tool when using it.

In addition, Wikidata is extracted from Wikipedia, and theoretically, the data in Wikidata can be sourced from Wikipedia, so we use Wikipedia as the retrieval corpus. We believe that this approach does not affect the rationality and fairness of the comparison.

We hope this can solve your doubts.

bernaljg commented 3 days ago

Thank you for the thorough answer!

zjukg / LPKG

CLQA Wikipedia Supporting Documents #2