Paper
← All findings

Finding 05 · Limits of debiasing

Generic debiasing doesn't work; specific warnings barely do

TL;DR. Telling an LLM to "think carefully" or "reason step by step" does not measurably reduce anchoring. A specific named-bias warning recovers ~50% of the surplus lost to an exploit-prompted seller. Generic rationality nudges and CoT are ineffective — sometimes counterproductive.

Recovery in buyer surplus by debiasing strategy
Change in buyer surplus ($) relative to no-debias control, after the buyer agent receives each prompt-level intervention. Positive bars recover surplus; negative bars made the buyer worse off.
Source · debiasing · pooled across models · cross_model_bias_profile

Why it matters

The standard playbook for "bias problems in LLMs" is a prompt fix. If prompt fixes worked, the cost of safe deployment would be near zero. They don't, reliably, in our data — which means the cost is higher than current practice assumes.

What we tested

Re-ran the bilateral anchored-negotiation game with four buyer-side conditions:

240 negotiations per model.

What we found

Implication

Defensive prompting is dimension-specific. There is no single "be smarter" instruction. Verbosity is not vigilance. Hard constraints — reservation prices, walk-away rules, deterministic policy wrappers — remain the only reliable defense at scale.

Reproduce

python -m agent_bias_study --module debiasing