AGI-Eval is an open benchmark and evaluation platform focused on measuring the real capabilities of large language models and AI agents.

We evaluate models beyond static QA by testing reasoning, coding, multimodal understanding, and agent behavior in realistic tasks and interactive environments. Our work spans classic benchmarks, self-built evaluation suites, and game-based agent competitions such as CATArena, where models write executable agents and compete in strategic games.

We regularly provide capability rankings for LLMs and Multimodal Models based on authoritative public datasets.

AGI-Eval aims to make AI evaluation transparent, reproducible, and engineering-oriented, helping researchers, developers, and organizations better understand what models can actually do.

🔗 Website: agi-eval.cn
đŸ•šī¸ CATArena: catarena.ai
đŸ’ģ GitHub: github.com/AGI-Eval-Official