April 7, 2026

CocoaBench v1.0

By The CocoaBench Team


We are releasing CocoaBench v1.0, a benchmark for evaluating unified digital agents on complex, long-horizon tasks that require flexible composition of vision, search, and coding. Our paper is available here.

CocoaBench teaser: a shopping task requiring GUI interaction, vision, search, and coding
Figure 1. CocoaBench evaluates agents on tasks that require flexible composition of core capabilities. The shopping example shown here illustrates the multi-step, compositional nature of the benchmark.

What is CocoaBench?

CocoaBench consists of 153 human-authored tasks spanning 9 domains — Business, Culture, Education, Life, Logic & Puzzles, Science, Sports, Technology, and Travel. Each task is defined minimally: a natural language instruction and an automatic evaluation function over the agent's final output. No fixed runtime, tool ecosystem, or interaction mode is assumed, making the benchmark compatible with any agent infrastructure.

Unlike benchmarks that test a single capability in isolation, CocoaBench is compositional by design — 98% of tasks require agents to combine multiple capabilities within a single run. For instance, an agent might need to visually inspect a webpage, search for supplementary information online, and write code to synthesize the results into a structured answer. Evaluation is fully automatic: every task ships with a verifiable evaluation script, so correctness does not rely on LLM judges or human review.

CocoaBench task statistics: domain distribution, resource types, and capability requirements
Figure 2. Statistics of CocoaBench. (a) Distribution across 9 task domains. (b) Distribution of resource types required by the tasks. (c) Human-annotated key capabilities required per task — Vision, Search, and Coding are not mutually exclusive.

Key results

We evaluated representative existing agent systems as well as a range of model backbones under our shared Cocoa-Agent scaffold. The best-performing system — Codex and OpenClaw with GPT-5.4 as backbone — achieves only 45.1% success rate, underscoring substantial room for improvement.

Existing Agents

Cocoa-Agent

Figure 3. Overall performance on CocoaBench for representative existing agent systems (top) and model backbones under the shared Cocoa-Agent scaffold (bottom).

Error analysis across 712 failure trajectories reveals three recurring failure modes:

Error distribution across Cocoa-Agent evaluated models
Figure 4. Aggregate failure distribution across all models evaluated under Cocoa-Agent. The inner ring shows the three top-level failure categories; the outer ring details fine-grained subcategories.

Notably, stronger models allocate a higher share of actions to code execution, suggesting that programmatic processing is a key strategy for the structured reasoning and output formatting CocoaBench tasks demand.

Cocoa-Agent

Alongside the benchmark, we release Cocoa-Agent, a lightweight shared scaffold built on top of AIO Sandbox — an all-in-one Docker runtime that integrates browser, shell, and file system in a single container. Cocoa-Agent adopts a ReAct-based loop with both DOM-level and screenshot-based GUI control, terminal execution, file manipulation, and sandboxed code execution. Its modular design enables controlled backbone comparisons and reproducible parallel evaluation.

Join the discussion on our Discord.


All posts