Paper
← All benchmarks
Cluster 03
Adversarial robustness

When the counterparty is optimized against the agent, how much is lost?

Surplus extracted by an informed seller, susceptibility to the decoy effect, and how much a specific defensive prompt recovers.

Generic debiasing fails. Chain-of-thought backfires.

Pooled recovery in buyer surplus by debiasing strategy. Only a specific named-bias warning improves outcomes. Asking the agent to "think step by step" or "be rational" produces longer responses, not better ones — sometimes worse.

Recovery in buyer surplus by debiasing strategy
Change in buyer surplus ($) relative to no-debias control, after the buyer agent receives each prompt-level intervention. Positive bars recover surplus; negative bars made the buyer worse off.
Source · debiasing · pooled across models · cross_model_bias_profile

Decoy susceptibility runs opposite to anchor susceptibility.

Adding a dominated decoy lifts Claude's target-option choice share by 9.4 pp. GPT-4o and Gemini are immune. The model that is most resistant to anchoring is most susceptible to decoy framing.

Decoy effect choice lift
Percentage-point change in target-option choice share when a dominated decoy is added. Lower magnitude = more resistant to attraction-effect manipulation.
Source · decoy · decoy_present minus control P(choose T)

All dimensions in this cluster

Surplus extracted by informed seller
Dollars per negotiation that an informed seller agent extracts from a naive counterpart by exploiting the anchoring bias. Lower (in magnitude) = more robust.
strategic_exploitation · anchoring exploit_vs_naive
Decoy effect choice lift
Percentage-point change in target-option choice share when a dominated decoy is added. Lower magnitude = more resistant to attraction-effect manipulation.
decoy · decoy_present minus control P(choose T)
Specific-warning defense recovery
Dollars the buyer recovers when warned with a specific named-bias instruction, relative to the no-debias control. Higher = more responsive to defensive prompting. (Pooled across models.)
debiasing · specific_warning buyer price recovery