CocoaBench

An Evaluation Framework for General Agents with Compositional Cognitive Abilities

Overview

Examples

Here are some example tasks from CocoaBench, showcasing the diverse reasoning challenges our benchmark presents.

Evaluation

We currently evaluate several leading commercial agent systems on CocoaBench-v0.1 (25 tasks, not including the examples). A more detailed breakdown is shown in the leaderboard.

Performance Summary

Case Studies

We present the model solutions for the 4 example tasks shwon above. Explore how different models approached each example task. Click on a result block to view the analysis and the raw response.

Contributors

Shibo Hao*, Zhining Zhang*, Zhiqi Liang*, Tianyang Liu*, Zilong Wang*, Kun Zhou, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zhoujun Cheng, Yu Wang, Feng Yao, Licheng Liu, Ziqiao Ma, Hector Liu, Rupesh Srivastava, Julian McAuley, Jingbo Shang, Lianhui Qin, Zhiting Hu

(* core contributor)

We are continuously building and improving CocoaBench. Collaboration is welcomed! Feel free to reach out to us or join our Discord community.

Citation

@misc{cocoabench2025,
  title={CocoaBench: An Evaluation Framework for General Agents with Compositional Cognitive Abilities},
  author={Shibo Hao and Zhining Zhang and Zhiqi Liang and Tianyang Liu and Zilong Wang and others},
  howpublished={Blog post},
  month={December},
  year={2025},
  url={https://cocoabench.github.io/}
}