zhangxjohn / LLM-Agent-Benchmark-List

A banchmark list for evaluation of large language models.
Apache License 2.0
67 stars 1 forks source link
agent benchmark large-language-models llm survey

LLM-Agent-Benchmark-List

🤗We greatly appreciate any contributions via PRs, issues, emails, or other methods.

Continuous update...

:book: Introduction

In the swiftly evolving landscape of artificial intelligence, Large Language Models (LLMs) have emerged as a pivotal cornerstone, revolutionizing how we interact with and harness the power of natural language processing. However, as LLMs gain widespread application in both research and industry sectors, the imperative shifts towards evaluating their efficacy rather than perpetuating a cycle of unbridled performance iterations. This paradigm shift raises critical questions: i) what to evaluate? ii) where to evaluate? iii)How to evaluate? Diverse research endeavors have proposed varying interpretations and methodologies in response to these queries. The aim of this work is to methodically review and organize benchmarks that are both LLMs and agent-powered, thereby providing a streamlined resource for those journeying towards Artificial General Intelligence (AGI).

:dizzy: List

Survey


ToolUse


Reasoning


Knowledge


Graph


Video


Code


Alignment


Agent


Multimodal


Others


Dataset


ZhiHu