Finding 05 · Limits of debiasing
Generic debiasing doesn't work; specific warnings barely do
TL;DR. Telling an LLM to "think carefully" or "reason step by step" does not measurably reduce anchoring. A specific named-bias warning recovers ~50% of the surplus lost to an exploit-prompted seller. Generic rationality nudges and CoT are ineffective — sometimes counterproductive.
Why it matters
The standard playbook for "bias problems in LLMs" is a prompt fix. If prompt fixes worked, the cost of safe deployment would be near zero. They don't, reliably, in our data — which means the cost is higher than current practice assumes.
What we tested
Re-ran the bilateral anchored-negotiation game with four buyer-side conditions:
no_debias— control.specific_warning— explicit named-bias instruction.generic_rationality— "be careful, think rationally."chain_of_thought— "think step by step before responding."
240 negotiations per model.
What we found
- Specific warning: meaningful reduction in anchoring magnitude. Recovers roughly half the exploit-loss from Finding 02.
- Generic rationality: no detectable improvement. In some configurations, slightly worse — likely because the agent generates more elaborate justifications for whatever number it was anchored to.
- Chain-of-thought: no detectable improvement, despite producing visibly more reasoning text.
Implication
Defensive prompting is dimension-specific. There is no single "be smarter" instruction. Verbosity is not vigilance. Hard constraints — reservation prices, walk-away rules, deterministic policy wrappers — remain the only reliable defense at scale.
Reproduce
python -m agent_bias_study --module debiasing