CocoaBench

Evaluating Unified Digital Agents in the Wild

Overview

CocoaBench is a benchmark for unified digital agents built from 153 human-designed, long-horizon tasks that require flexible composition of vision, search, and coding.

Experiments show that current agents remain far from reliable on CocoaBench. The best evaluated system achieves only 45.1% success rate, with substantial room for improvement in reasoning & planning, tool use & execution, and visual grounding.

Examples

Here are some example tasks (excluded from CocoaBench to avoid contamination), showcasing the diverse reasoning challenges our benchmark presents.

Evaluation

We evaluate representative agent systems on CocoaBench v1.0 (153 tasks).

Existing Agents

Cocoa-Agent

Case Studies

We present the model solutions for the 4 example tasks shwon above. Explore how different models approached each example task. Click on a result block to view the analysis and the raw response.

Citation