Paper
Research

Findings from the first benchmark.

Three frontier models, seven experiments, 8,280 runs. Each card opens a write-up; the paper and replication code are linked at the bottom.