Paper
Benchmarks · v1

One benchmark, four clusters, three models.

Every dimension we measure rolls into one of four clusters. Every cluster maps to a deployment question.

The cross-cluster view

No model wins everywhere.

Strengths and weaknesses are systematic, not random — and they differ between models in ways that matter for deployment.

Cluster scores by model
Composite score (0–100, higher = better) on each of the four benchmark clusters. Scores are normalised within the current model roster, so adding a new model rescales all polygons.
Source · composite of all dimensions per cluster