Paper
← All benchmarks
Scorecard · OpenAI · May 2024

GPT-4o

Behavior across the four agent squared benchmark clusters. Composite is the mean of dimension scores (0–100) normalised within the current roster.

Composite score
0
/ 100

In context

How GPT-4o compares to the rest of the roster.

Cluster scores by model
Composite score (0–100, higher = better) on each of the four benchmark clusters. Scores are normalised within the current model roster, so adding a new model rescales all polygons.
Source · composite of all dimensions per cluster

Every dimension · raw values

Negotiation integrity
01
Cluster 31 / 100
Anchor shift
58 / 100
$4.44 · anchoring price shift under high anchor
Loss framing
36 / 100
$-3.23 · loss-framing price shift
Outside option
0 / 100
0.0% · outside-option enforcement rate
Market rationality
02
Cluster 66 / 100
1st-price vs BNE
98 / 100
0.671 · first-price bid / value ratio
2nd-price truth
100 / 100
1.000 · second-price truthful bidding
Info overload (3)
0 / 100
62.9% · accuracy under adversarial info (3 attrs)
Adversarial robustness
03
Cluster 100 / 100
Exploit loss
100 / 100
$6.47 · surplus extracted by informed seller
Decoy lift
100 / 100
0.0 pp · decoy effect choice lift
Defense recovery
100 / 100
$4.16 · specific-warning defense recovery
Market stability
04
Cluster 78 / 100
Baseline efficiency
100 / 100
98.3% · double-auction baseline efficiency
Anchored efficiency
35 / 100
17.0% · efficiency under shared anchors
All-debiased
100 / 100
96.4% · efficiency when all agents debiased