zuucan / NeedleInAHaystack-PLUS

To assess the longtext capabilities more comprehensively, we propose Needle-in-a-Haystack PLUS, which shifts the focus from simple fact retrieval to more challenging single-document/multi-document question answering tasks.
8 stars 0 forks source link

NeedleInAHaystack-PLUS

To assess the longtext capabilities more comprehensively, we propose Needle-in-a-Haystack PLUS, which shifts the focus from simple fact retrieval to more challenging single-document/multi-document question answering tasks.

How to evaluate on NeedleInAHaystack-PLUS

Load Data

Our test data can be download in NeedleInAHaystack-PLUS.

Data Format

All datas in NeedleInAHaystack-PLUS are standardized to the following format:

Single-document QA

{
    "id": "The unique identifier for each test data.",
    "context": "The long context of the single-document question answering task.",
    "context_length": "The length of haystack ranges from 1,000 to 128,000 tokens with equal intervals, totaling 15 different lengths.",
    "depth_percent": "The position of the needle in the haystack.",
    "input": "The questions of the question single-document answering task.",
    "dataset": "needle_squad",
    "answers": "A List of all true answers.",
}

Multi-document QA

{
    "id": "The unique identifier for each test data.",
    "context": "The long context of the single-document question answering task.",
    "context_length": "The length of haystack ranges from 1,000 to 128,000 tokens with equal intervals, totaling 15 different lengths.",
    "depth_percent1": "The position of the first needle in the haystack.",
    "depth_percent2": "The position of the second needle in the haystack.",
    "input": "The questions of the question single-document answering task.",
    "dataset": "needle_hotpotqa",
    "answers": "A List of all true answers.",
}

Results Visualization

The invocation time of the APIs:

Acknowledgement

NeedleInAHaystack-PLUS is based on the datasets proposed by previous researchers, including NeedleInAHaystack, Squad, HotpotQA.

Citation

@misc{zhao2024longagent,
      title={LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration}, 
      author={Jun Zhao and Can Zu and Hao Xu and Yi Lu and Wei He and Yiwen Ding and Tao Gui and Qi Zhang and Xuanjing Huang},
      year={2024},
      eprint={2402.11550},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}