Paper
·Independent benchmark · v1, 2026

Benchmarks for the agent-to-agent web.

Soon, agents will negotiate, bid, and trade on your behalf. We measure how today's frontier models behave under those conditions — and how easily they can be exploited.

Headline numbers
0.000
anchoring coefficient (LLM buyers)
vs. ≈0.50 in human bargaining studies
$0.00
surplus extracted per negotiation
informed seller vs. naive buyer
0%
gains-from-trade destroyed
in a market of anchored buyer agents
The first benchmark

No model wins everywhere.

Composite scores (0–100) across four behavioral clusters. Each frontier model has a different vulnerability profile.

Cluster scores by model
Composite score (0–100, higher = better) on each of the four benchmark clusters. Scores are normalised within the current model roster, so adding a new model rescales all polygons.
Source · composite of all dimensions per cluster

Claude Sonnet 4 holds value best in negotiations but is most exposed to decoy effects.

GPT-4o bids closest to equilibrium and runs the most efficient baseline markets.

Gemini 2.5 Flash is the most anchored and overbids in first-price auctions — but is the only model immune to information overload.

Cluster 01 · Negotiation integrity

A $10 anchor moves price by $2.31 on Claude, $7.38 on Gemini.

The same anchor moves Gemini three times harder than Claude. LLMs are not less anchored than humans — they are more.

Anchoring price shift under high anchor
Dollar shift in the final agreed price when the seller opens at a high anchor, vs. the baseline opening. Lower = more anchor-resistant.
Source · negotiation · AnchorHigh OLS coefficient
Research

Five findings from the first run.

Each finding pairs a chart with a short write-up.

How we test

Reproducible. Open. Independent.

Seven experiments, three models, deterministic SHA-256 seeds, 30 replications at two temperatures — 8,280 runs in v1. All code, prompts, and raw logs are public.

Current roster
Claude Sonnet 4
Anthropic · May 2025
Scorecard →
GPT-4o
OpenAI · May 2024
Scorecard →
Gemini 2.5 Flash
Google DeepMind · Apr 2025
Scorecard →
Next wave: reasoning-class models (Opus, GPT-5, Gemini 3).

Pick agents that won't get exploited on your behalf.

A standing public benchmark. The next wave adds reasoning models and an adversarial-prompting suite.