Open nbalepur opened 4 weeks ago
Yes, that's right - not all of the questions are single-answer (in fact, I think most of them are not). In Appendix A of our paper most of our provided examples are questions whose answers are key-value pairs.
I wish this was clearer in the paper---Figure 1, Sec 4.2, and the use of "top-level answer" made me believe that the goal was to synthesize subanswers into a single answer (versus just string concatenation). If you're curious, only 24/334 of the devset questions have a single answer. Maybe this could be made clearer in the repo?
Thanks anyway, the work is interesting and it would be cool to have a version that does require combining and reasining over these subanswers into a single answer
Gotcha. The main thing we wanted to focus on was the fan-out operation itself and how that requires multi-turn retrieval (in the Open Book setting at least); in another current work we're using this dataset to compare single-agent vs multi-agent systems which influenced some of the wording there. So there is indeed a stronger focus on the fan-out part rather than the fan-in/aggregation task (IMO, in a multi-agent system where in theory each subquery is handled by an individual agent, this reduces to a fairly standard reading comprehension/reasoning task ala GSM8K etc). This lets us isolate the challenges in the task easier but I agree that it would be more challenging if each question also involved an aggregation step.
All of the examples shown in the paper imply that the questions have a single answer, but actually most of the questions are just several QA pairs combined into one (i.e. need to provide several answers to be correct).
Is this right? I don't find many examples like "What is the total number of employees in the five largest banks in the world?" that require synthesizing multiple sub-answers into a final answer