Finding 05 · Limits of debiasing

Generic debiasing doesn't work; specific warnings barely do

TL;DR. Telling an LLM to "think carefully" or "reason step by step" does not measurably reduce anchoring. A specific named-bias warning recovers ~50% of the surplus lost to an exploit-prompted seller. Generic rationality nudges and CoT are ineffective — sometimes counterproductive.

Recovery in buyer surplus by debiasing strategy

Change in buyer surplus ($) relative to no-debias control, after the buyer agent receives each prompt-level intervention. Positive bars recover surplus; negative bars made the buyer worse off.

Source · debiasing · pooled across models · cross_model_bias_profile

Why it matters

The standard playbook for "bias problems in LLMs" is a prompt fix. If prompt fixes worked, the cost of safe deployment would be near zero. They don't, reliably, in our data — which means the cost is higher than current practice assumes.

What we tested

Re-ran the bilateral anchored-negotiation game with four buyer-side conditions:

no_debias — control.
specific_warning — explicit named-bias instruction.
generic_rationality — "be careful, think rationally."
chain_of_thought — "think step by step before responding."

240 negotiations per model.

What we found

Specific warning: meaningful reduction in anchoring magnitude. Recovers roughly half the exploit-loss from Finding 02.
Generic rationality: no detectable improvement. In some configurations, slightly worse — likely because the agent generates more elaborate justifications for whatever number it was anchored to.
Chain-of-thought: no detectable improvement, despite producing visibly more reasoning text.

Implication

Defensive prompting is dimension-specific. There is no single "be smarter" instruction. Verbosity is not vigilance. Hard constraints — reservation prices, walk-away rules, deterministic policy wrappers — remain the only reliable defense at scale.

Reproduce

python -m agent_bias_study --module debiasing

← All findings Read the full paper →