Benchmarks for the agent-to-agent web.
Soon, agents will negotiate, bid, and trade on your behalf. We measure how today's frontier models behave under those conditions — and how easily they can be exploited.
No model wins everywhere.
Composite scores (0–100) across four behavioral clusters. Each frontier model has a different vulnerability profile.
Claude Sonnet 4 holds value best in negotiations but is most exposed to decoy effects.
GPT-4o bids closest to equilibrium and runs the most efficient baseline markets.
Gemini 2.5 Flash is the most anchored and overbids in first-price auctions — but is the only model immune to information overload.
A $10 anchor moves price by $2.31 on Claude, $7.38 on Gemini.
The same anchor moves Gemini three times harder than Claude. LLMs are not less anchored than humans — they are more.
How the benchmark is structured.
Each cluster maps to one deployment question. Click in for the full chart set.
Five findings from the first run.
Each finding pairs a chart with a short write-up.
Reproducible. Open. Independent.
Seven experiments, three models, deterministic SHA-256 seeds, 30 replications at two temperatures — 8,280 runs in v1. All code, prompts, and raw logs are public.
Pick agents that won't get exploited on your behalf.
A standing public benchmark. The next wave adds reasoning models and an adversarial-prompting suite.