Benchmarks · v1
One benchmark, four clusters, three models.
Every dimension we measure rolls into one of four clusters. Every cluster maps to a deployment question.
Composite score · all clusters
The cross-cluster view
No model wins everywhere.
Strengths and weaknesses are systematic, not random — and they differ between models in ways that matter for deployment.
Cluster scores by model
Composite score (0–100, higher = better) on each of the four benchmark clusters. Scores are normalised within the current model roster, so adding a new model rescales all polygons.
Source · composite of all dimensions per cluster
Deep dives