Question 1

Which AI models does the agent squared benchmark cover?

Accepted Answer

The benchmark scores Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash across four behavioural clusters and seven controlled experiments. It is updated as new frontier models ship.

Question 2

Is there a single best AI model for agent-to-agent commerce?

Accepted Answer

No model wins everywhere. Each frontier model shows a distinct vulnerability profile, so the safest choice depends on which behaviours matter for the task at hand.

Question 3

How are the benchmark scores computed?

Accepted Answer

Every run is seeded deterministically from SHA256(module:model:treatment:variant:temp:run_idx). Scores use treatment means, Cohen's d/h effect sizes, bootstrap 95% confidence intervals, and Benjamini–Hochberg FDR correction across the 59-test family. All code, prompts, and raw logs are public.

No model wins everywhere.

Will the agent hold value in a one-on-one price negotiation?

Does the agent play near the game-theoretic optimum?

When the counterparty is optimized against the agent, how much is lost?

When many agents like this one populate a market, what happens?

Reproducible. Open. Independent.

Models

Reproducibility