Cluster 03
Adversarial robustness
When the counterparty is optimized against the agent, how much is lost?
Surplus extracted by an informed seller, susceptibility to the decoy effect, and how much a specific defensive prompt recovers.
Cluster score · adversarial robustness
Generic debiasing fails. Chain-of-thought backfires.
Pooled recovery in buyer surplus by debiasing strategy. Only a specific named-bias warning improves outcomes. Asking the agent to "think step by step" or "be rational" produces longer responses, not better ones — sometimes worse.
Recovery in buyer surplus by debiasing strategy
Change in buyer surplus ($) relative to no-debias control, after the buyer agent receives each prompt-level intervention. Positive bars recover surplus; negative bars made the buyer worse off.
Source · debiasing · pooled across models · cross_model_bias_profile
Decoy susceptibility runs opposite to anchor susceptibility.
Adding a dominated decoy lifts Claude's target-option choice share by 9.4 pp. GPT-4o and Gemini are immune. The model that is most resistant to anchoring is most susceptible to decoy framing.
Decoy effect choice lift
Percentage-point change in target-option choice share when a dominated decoy is added. Lower magnitude = more resistant to attraction-effect manipulation.
Source · decoy · decoy_present minus control P(choose T)
All dimensions in this cluster
Surplus extracted by informed seller
Dollars per negotiation that an informed seller agent extracts from a naive counterpart by exploiting the anchoring bias. Lower (in magnitude) = more robust.
strategic_exploitation · anchoring exploit_vs_naive
Decoy effect choice lift
Percentage-point change in target-option choice share when a dominated decoy is added. Lower magnitude = more resistant to attraction-effect manipulation.
decoy · decoy_present minus control P(choose T)
Specific-warning defense recovery
Dollars the buyer recovers when warned with a specific named-bias instruction, relative to the no-debias control. Higher = more responsive to defensive prompting. (Pooled across models.)
debiasing · specific_warning buyer price recovery