CocoaBench

Evaluating Unified Digital Agents in the Wild

Overview

CocoaBench is a benchmark for unified digital agents built from 153 human-designed, long-horizon tasks that require flexible composition of vision, search, and coding.

Compositional by design. Most tasks require the combinatino of vision (GUI interaction, visual understanding), search (information synthesis), and coding (terminal use, algorithms).
Minimal specification, automatic evaluation. Each task is defined only by an instruction and an evaluation script over the response. No fixed infrastructure is assumed. No LLM as judge.
Cocoa-Agent scaffold. We build Cocoa-Agent, a lightweight framework integrated with Sandbox-AIO, enabling controlled comparison across backbones and reproducible evaluation.

Experiments show that current agents remain far from reliable on CocoaBench. The best evaluated system achieves only 45.1% success rate, with substantial room for improvement in reasoning & planning, tool use & execution, and visual grounding.

Examples

Here are some example tasks (excluded from CocoaBench to avoid contamination), showcasing the diverse reasoning challenges our benchmark presents.

Evaluation

We evaluate representative agent systems on CocoaBench v1.0 (153 tasks).

Existing Agents

Cocoa-Agent

Case Studies

We present the model solutions for the 4 example tasks shwon above. Explore how different models approached each example task. Click on a result block to view the analysis and the raw response.

@misc{cocoabench2025,
  title={CocoaBench: An Evaluation Framework for General Agents with Compositional Cognitive Abilities},
  author={Shibo Hao and Zhining Zhang and Zhiqi Liang and Tianyang Liu and Zilong Wang and others},
  howpublished={Blog post},
  month={December},
  year={2025},
  url={https://cocoabench.github.io/}
}