An Evaluation Framework for General Agents with Compositional Cognitive Abilities
Here are some example tasks from CocoaBench, showcasing the diverse reasoning challenges our benchmark presents.
We currently evaluate several leading commercial agent systems on CocoaBench-v0.1 (25 tasks, not including the examples). A more detailed breakdown is shown in the leaderboard.
We present the model solutions for the 4 example tasks shwon above. Explore how different models approached each example task. Click on a result block to view the analysis and the raw response.
Shibo Hao*, Zhining Zhang*, Zhiqi Liang*, Tianyang Liu*, Zilong Wang*, Kun Zhou, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zhoujun Cheng, Yu Wang, Feng Yao, Licheng Liu, Ziqiao Ma, Hector Liu, Rupesh Srivastava, Julian McAuley, Jingbo Shang, Lianhui Qin, Zhiting Hu
(* core contributor)
We are continuously building and improving CocoaBench. Collaboration is welcomed! Feel free to reach out to us or join our Discord community.
@misc{cocoabench2025,
title={CocoaBench: An Evaluation Framework for General Agents with Compositional Cognitive Abilities},
author={Shibo Hao and Zhining Zhang and Zhiqi Liang and Tianyang Liu and Zilong Wang and others},
howpublished={Blog post},
month={December},
year={2025},
url={https://cocoabench.github.io/}
}