LLM-Agent-Benchmark-List

🤗We greatly appreciate any contributions via PRs, issues, emails, or other methods.

⏳ Continuous update...

:book: Introduction

In the swiftly evolving landscape of artificial intelligence, Large Language Models (LLMs) have emerged as a pivotal cornerstone, revolutionizing how we interact with and harness the power of natural language processing. However, as LLMs gain widespread application in both research and industry sectors, the imperative shifts towards evaluating their efficacy rather than perpetuating a cycle of unbridled performance iterations. This paradigm shift raises critical questions: i) what to evaluate? ii) where to evaluate? iii)How to evaluate? Diverse research endeavors have proposed varying interpretations and methodologies in response to these queries. The aim of this work is to methodically review and organize benchmarks that are both LLMs and agent-powered, thereby providing a streamlined resource for those journeying towards Artificial General Intelligence (AGI).

:dizzy: List

Survey

[2023/07] A Survey on Evaluation of Large Language Models. Yupeng Chang ( Jilin University) et al. arXiv. [paper] [project page]
[2023/10] Evaluating Large Language Models: A Comprehensive Survey. Zishan Guo (Tianjin) et al. arXiv. [paper] [project page]

ToolUse

[2023/10] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Minghao Li (Alibaba) et al. arXiv. [paper] [project page]
[2023/05] On the Tool Manipulation Capability of Open-source Large Language Models. Qiantong Xu (SambaNova) et al. arXiv. [paper] [project page]
[2023/10] ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. Yujia Qin (Tsinghua) et al. arXiv. [paper] [project page]
[2024/01] T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step. Zehui Chen (USTC) et al. arXiv. [paper] [project page]

Reasoning

[2024/01] NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes. Lizhou Fan (Michigan) et al. arXiv. [paper] [project page]
[2024/01] A Survey of Reasoning with Foundation Models. Jiankai Sun (CUHK) et al. arXiv. [paper] [project page]
[2024/01] LLMs for Relational Reasoning: How Far are We? Zhiming Li (NTU) et al. arXiv. [paper]
[2023/08] Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond. Fangzhi Xu ( Xi’an Jiaotong University) et al. arXiv. [paper] [project page]
[2023/11] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change. Karthik Valmeekam (ASU) et al. arXiv. [paper] [project page]
[2023/12] Understanding Social Reasoning in Language Models with Language Models. Kanishk Gandhi (Stanford) et al. arXiv. [paper] [project page]

Knowledge

[2023/12] Trends in Integration of Knowledge and Large Language Models: A Survey and Taxonomy of Methods, Benchmarks, and Applications. Zhangyin Feng (Harbin Institute of Technology) et al. arXiv. [paper]
[2023/09] Benchmarking Large Language Models in Retrieval-Augmented Generation. Jiawei Chen (CIPL) et al. arXiv. [paper]

Graph

[2023/09] Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis. Chang Liu (Colorado School of Mines) et al. arXiv. [paper] [project page]

Video

[2023/10] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. Yaofang Liu (Tencent) et al. arXiv. [paper] [project page]

Code

[2023/08] Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation. Zhiqiang Yuan (Fudan) et al. arXiv. [paper]
[2024/01] CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. Alex Gu (Meta) et al. arXiv. [paper]
[2023/10] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Jiawei Liu (UIUC) et al. arXiv. [paper] [project page]
[2023/10] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Carlos E. Jimenez (Princeton University) et al. ICLR. [paper] [project page]
[2024/06] BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. Terry Yue Zhuo (Monash University) et al. arXiv. [paper] [project page]

Alignment

[2023/12] AlignBench: Benchmarking Chinese Alignment of Large Language Models. Xiao Liu (KEG) et al. arXiv. * [paper] [project page]

Agent

[2024/02] Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization Wenqi Zhang (Zhejiang University) et al. arXiv. [paper] [project page]
[2024/01] Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives Wenqi Zhang (Zhejiang University) et al. arXiv. [paper]
[2023/10] AgentBench: Evaluating LLMs as Agents. Xiao Liu (Tsinghua) et al. arXiv. [paper] [project page]
[2023/08] AgentSims: An Open-Source Sandbox for Large Language Model Evaluation. Jiaju Lin (PTA Studio) et al. arXiv. [paper]
[2023/12] Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives. Chen Gao (Tsinghua) et al. arXiv. * [paper]
[2024/01] CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation. Quan Tu (GSAI) et al. arXiv. [paper] [project page]
[2023/10] WebArena: A Realistic Web Environment for Building Autonomous Agents. Shuyan Zhou (CMU) et al. arXiv. [paper] [project page]
[2024/01] Evaluating Language-Model Agents on Realistic Autonomous Tasks. Megan Kinniment (METR ) et al. arXiv. [paper] [project page]
[2023/11] MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration. Lin Xu (NUS) et al. arXiv. [paper] [project page]
[2023/12] LLF-Bench: Benchmark for Interactive Learning from Language Feedback. Ching-An Cheng (Microsoft) et al. arXiv. [paper] [project page]

Multimodal

[2023/11] M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models. Wenxuan Zhang (Alibaba) et al. arXiv. [paper] [project page]
[2023/11] A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. Yejin Bang (CAiRE) et al. arXiv. [paper] [project page]
[2023/12] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. Chaoyou Fu (Tencent) et al. arXiv. [paper] [project page]
[2024/01] Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision. Haoning Wu (Nanyang Technological University) et al. arXiv. [paper] [project page]
[2023/10] Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning. Mustafa Shukor (Sorbonne University) et al. ICLR. [paper] [project page]
[2023/12] InfiMM-Eval: Complex Open-ended Reasoning Evaluation for Multi-modal Large Language Models. Xiaotian Han (ByteDance) et al. arXiv. * [paper] [project page]

Others

[2023/11] Benchmarking Foundation Models with Language-Model-as-an-Examiner. Yushi Bai (Tsinghua) et al. arXiv. [paper]
[2023/08] LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking. Fahim Dalvi (HBKU) et al. arXiv. [paper] [project page]
[2024/01] PromptBench: A Unified Library for Evaluation of Large Language Models. Kaijie Zhu (MSRA) et al. arXiv. [paper] [project page]
[2023/11] TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs. Shuyi Xie (Tencent) et al. arXiv. [paper] [project page]
[2024/01] Benchmarking LLMs via Uncertainty Quantification. Fanghua Ye (Tencent) et al. arXiv. [paper] [project page]
[2023/12] LLMEval: A Preliminary Study on How to Evaluate Large Language Models. Yue Zhang (Fudan) et al. arXiv. [paper] [project page]
[2023/08] Large Language Models are not Fair Evaluators. Peiyi Wang (Peking) et al. arXiv. [paper] [project page]
[2023/09] Evaluating Cognitive Maps and Planning in Large Language Models with CogEval. Ida Momennejad (Microsoft) et al. arXiv. [paper]
[2023/04] SocialDial: A Benchmark for Socially-Aware Dialogue Systems. Haolan Zhan (Monash) et al. SIGIR. [paper] [project page]
[2023/07] Emotional Intelligence of Large Language Models. Xuena Wang (Tsinghua) et al. arXiv. [paper]
[2024/01] Assessing and Understanding Creativity in Large Language Models. Yunpu Zhao (USTC) et al. arXiv. [paper]
[2023/11] Rethinking Benchmark and Contamination for Language Models with Rephrased Samples. Shuo Yang (UC Berkeley) et al. arXiv. [paper] [project page]

Dataset

[2023/10] Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios. Yilun Zhao (Yale) EMNLP. [paper] [project page]
[2023/10] Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering. Fangkai Yang (Microsoft) EMNLP.[paper] [project page]

ZhiHu

[2023/11] LLM Evaluation 如何评估一个大模型？ Isaac 张雯轩 [project page]
[2023/11] 基于LLM Agent评估研究 龚旭东 [project page]

zhangxjohn / LLM-Agent-Benchmark-List

readme