§01How a model gets scored
Each submission travels through a deterministic six-step pipeline. The runner is hermetic — it pulls the artefact, runs against a fixed test split, and posts results back over a signed webhook.
§02Contamination resistance
The private test set rotates every quarter. 50 of the 200 airfoils are replaced with newly-procedurally-perturbed parents under a fixed seed schedule. Submissions cannot pre-train on the test set because it didn't exist when the model was trained.
The 150 stable airfoils give cross-quarter comparability. A model submitted in 2026Q2 and re-run in 2027Q1 will show its drift against the rotated 50 alongside its persistence on the stable 150.
§03The composite score, derived from first principles
The composite is a weighted sum of five normalised metrics. Each weight is chosen so that a real, useful improvement on that metric matches its peers in scoring impact — see the annotations.
Lower composite is better. The leaderboard's default sort is by composite ascending.
§04What we archive
Every submission's artefact is sealed in R2 for 24 months. If contamination is later suspected we can replay the same artefact against a fresh test set. The bundle below is what gets archived per submission.
§05Audit and replay
If a submission is suspected of having seen the test set during training, we can replay it against a freshly-rotated quarter without the submitter touching anything. The score difference is the audit signal.
§06SU2 CFD configuration
Every ground-truth label in RotorBench-Aero v0.1 is produced by the same deterministic SU2 configuration. The parameters below are frozen for the lifetime of this major version.
Failed cases
If SU2 fails to converge for a test case, the case is excluded from that quarter's evaluation and flagged in the report. Models are scored on the remaining converged cases, with the exclusion count noted in the report footer.
Reference validation
The SU2 configuration was validated against XFOIL for NACA 0012 at Re = 3×10⁶, α = 0°–10°. Cₗ agreement is within 1.2 %; Cₘ agreement is within 8 % in the attached-flow regime.
§07OOD generalisation subset
The 200-airfoil test set is split into a stable core and a geometrically out-of-distribution (OOD) subset. The OOD score measures how well a model transfers beyond the UIUC training distribution.
§08Statistical confidence & rank uncertainty
Composite scores are accompanied by 95% confidence intervals. Rank differences smaller than the confidence interval overlap are labelled statistically indistinguishable in every published report.
§09Scoring weight rationale
The six weights in the composite formula are not arbitrary. Each is calibrated so that a meaningful, practically significant improvement on that metric produces comparable scoring impact to its peers.
| Term | Weight | Rationale |
|---|---|---|
| MAE_Cl | 1.0× | Baseline reference weight — lift coefficient error is the primary design-loop signal |
| MAE_Cd | 10× | Cd range is ~10× narrower than Cl; equal absolute error on Cd would otherwise be swamped in the score |
| MAE_Cm | 0.5× | Pitching moment matters for stability analysis but is secondary to lift/drag for most 2D design workflows |
| 1 − ρ_L/D | 0.2× | L/D rank correlation is already bounded in [0, 1]; a small weight is sufficient to reward correct ranking |
| OOD score | 0.1× | Generalisation is a guardrail, not the dominant signal; penalises poor OOD transfer without masking in-distribution accuracy |
| latency_p50 (ms) | 0.001× | Tiebreaker only — significant only above ~100× latency difference; avoids penalising accurate-but-slower models |
The weights were set by the Comparotor team following analysis of the UIUC dataset distribution and SU2 label variance. They are fixed for RotorBench-Aero v0.1 and will be reviewed for v0.2 with design-partner input.
§10Benchmark versioning policy
Version numbers signal compatibility. Minor versions allow cross-quarter comparison; major versions mark breaking changes where historical scores are not directly comparable.
§11Dispute & replay policy
Submitters can request a verified replay of any run within 90 days. Score disputes are acknowledged within 2 business days and resolved within 14.
Score dispute process
Disputes about test-set composition or scoring weights are treated as benchmark feedback, not individual run disputes, and are considered during the next version review.
§12Known limitations
Comparotor publishes its limitations explicitly. We believe benchmark transparency requires acknowledging scope boundaries at least as prominently as capabilities.
Want the full spec?
The complete SPEC.md covers the SU2 RANS generation pipeline, the full D1 schema, and the OpenAPI contract.