Research
Findings from the first benchmark.
Three frontier models, seven experiments, 8,280 runs. Each card opens a write-up; the paper and replication code are linked at the bottom.
Three frontier models, seven experiments, 8,280 runs. Each card opens a write-up; the paper and replication code are linked at the bottom.