§02

Methodology

How we score models, why the test set rotates, and the audit trail behind every published number.

rotorbench-aero · v0.1 · 2026Q2

§01How a model gets scored

Each submission travels through a deterministic six-step pipeline. The runner is hermetic — it pulls the artefact, runs against a fixed test split, and posts results back over a signed webhook.

§02Contamination resistance

The private test set rotates every quarter. 50 of the 200 airfoils are replaced with newly-procedurally-perturbed parents under a fixed seed schedule. Submissions cannot pre-train on the test set because it didn't exist when the model was trained.

The 150 stable airfoils give cross-quarter comparability. A model submitted in 2026Q2 and re-run in 2027Q1 will show its drift against the rotated 50 alongside its persistence on the stable 150.

§03The composite score, derived from first principles

The composite is a weighted sum of five normalised metrics. Each weight is chosen so that a real, useful improvement on that metric matches its peers in scoring impact — see the annotations.

composite = 1·MAE_Cl + 10·MAE_Cd + 0.5·MAE_Cm + 0.2·(1 − ρ_LD) + 0.1·OOD_score + 0.001·latency_p50_ms

1·MAE_Cl

primary lift error — directly used in design loops

10·MAE_Cd

weighted up because Cd ranges ~10× narrower than Cl

0.5·MAE_Cm

moment matters less in 2D scoring; halved

0.2·(1 − ρ_LD)

rank correlation already in [0, 1]; small weight is enough

0.1·OOD_score

OOD is a separate guardrail, not the dominant signal

0.001·latency_p50_ms

tiebreaker only — lets a fast model edge a marginally-more-accurate slow one

Lower composite is better. The leaderboard's default sort is by composite ascending.

§04What we archive

Every submission's artefact is sealed in R2 for 24 months. If contamination is later suspected we can replay the same artefact against a fresh test set. The bundle below is what gets archived per submission.

r2://comparotor-submissions/<submission_id>/retained 24 months

MODEL

model.onnxONNX submissions only12–500 MB

§05Audit and replay

If a submission is suspected of having seen the test set during training, we can replay it against a freshly-rotated quarter without the submitter touching anything. The score difference is the audit signal.

§06SU2 CFD configuration

Every ground-truth label in RotorBench-Aero v0.1 is produced by the same deterministic SU2 configuration. The parameters below are frozen for the lifetime of this major version.

su2 solver parameters

SolverSU2 v7.5 RANS (Reynolds-Averaged Navier-Stokes)

Turbulence modelSpalart-Allmaras one-equation model

Mesh topologyC-type structured mesh, ~80,000 cells

Wall resolutiony⁺ ≈ 1 at wall

Boundary conditionsFreestream velocity; angle of attack set via flow direction; far-field boundary at 20 chord lengths

Convergence criterionResidual reduction of 6 orders of magnitude or 5,000 iterations (whichever first)

Failed cases

If SU2 fails to converge for a test case, the case is excluded from that quarter's evaluation and flagged in the report. Models are scored on the remaining converged cases, with the exclusion count noted in the report footer.

Reference validation

The SU2 configuration was validated against XFOIL for NACA 0012 at Re = 3×10⁶, α = 0°–10°. Cₗ agreement is within 1.2 %; Cₘ agreement is within 8 % in the attached-flow regime.

§07OOD generalisation subset

The 200-airfoil test set is split into a stable core and a geometrically out-of-distribution (OOD) subset. The OOD score measures how well a model transfers beyond the UIUC training distribution.

test-set composition200 airfoils total

Core set150 airfoils — stable across quarters, sourced from the UIUC dataset distribution

OOD subset50 airfoils — geometrically out-of-distribution: supercritical sections, wind-turbine profiles (thick trailing edges), and eVTOL-relevant low-Re sections

OOD score definitionComposite MAE on the 50 OOD airfoils, normalised by the core-set MAE

InterpretationOOD score < 1.0: model generalises well beyond training distribution. OOD score > 1.0: model degrades on unseen geometries.

RotationThe OOD subset rotates quarterly alongside the stable-150 rotation, but independently — new OOD airfoils are selected from outside the current UIUC training distribution each quarter.

§08Statistical confidence & rank uncertainty

Composite scores are accompanied by 95% confidence intervals. Rank differences smaller than the confidence interval overlap are labelled statistically indistinguishable in every published report.

confidence & rank rules

Confidence interval methodBootstrap resampling over 1,000 samples drawn from the 200 test cases; 95% CI reported with every composite score

Statistically indistinguishableRank changes smaller than the confidence interval overlap between the two models are flagged as “statistically indistinguishable” in published reports

Tie thresholdModels within 0.002 composite score of each other are considered tied for ranking purposes

Quarter-to-quarter validityScore comparisons across quarters are valid only on the stable-150 subset. The 50 rotating OOD airfoils contribute exclusively to the OOD score component and must not be compared across quarters.

§09Scoring weight rationale

The six weights in the composite formula are not arbitrary. Each is calibrated so that a meaningful, practically significant improvement on that metric produces comparable scoring impact to its peers.

Term	Weight	Rationale
MAE_Cl	1.0×	Baseline reference weight — lift coefficient error is the primary design-loop signal
MAE_Cd	10×	Cd range is ~10× narrower than Cl; equal absolute error on Cd would otherwise be swamped in the score
MAE_Cm	0.5×	Pitching moment matters for stability analysis but is secondary to lift/drag for most 2D design workflows
1 − ρ_L/D	0.2×	L/D rank correlation is already bounded in [0, 1]; a small weight is sufficient to reward correct ranking
OOD score	0.1×	Generalisation is a guardrail, not the dominant signal; penalises poor OOD transfer without masking in-distribution accuracy
latency_p50 (ms)	0.001×	Tiebreaker only — significant only above ~100× latency difference; avoids penalising accurate-but-slower models

The weights were set by the Comparotor team following analysis of the UIUC dataset distribution and SU2 label variance. They are fixed for RotorBench-Aero v0.1 and will be reviewed for v0.2 with design-partner input.

§10Benchmark versioning policy

Version numbers signal compatibility. Minor versions allow cross-quarter comparison; major versions mark breaking changes where historical scores are not directly comparable.

RotorBench-Aero v{major}.{minor}

Minor version (v0.1 → v0.2)Updated test-set composition, scoring weights, or SU2 configuration. All historical scores remain comparable within the same major version.

Major version (v0 → v1)Breaking changes to the prediction task or dataset scope. Historical scores are not directly comparable across major versions.

Frozen specificationEach version has a frozen specification stored in the public docs at the time of release.

Quarterly test-set rotationNot a version change — it is the expected anti-contamination mechanism. Rotation operates within a fixed benchmark version.

§11Dispute & replay policy

Submitters can request a verified replay of any run within 90 days. Score disputes are acknowledged within 2 business days and resolved within 14.

replay eligibility

Replay windowAny submitter may request a replay within 90 days of the original evaluation date.

Replay guaranteeThe replay re-runs the exact artefact (verified by SHA-256) against the same frozen test set archived in R2. Replay results are authoritative.

Discrepancy handlingIf replay results differ from the original by more than floating-point tolerance, the discrepancy is investigated and published in the run history.

Score dispute process

01Email [email protected] with your run ID within 30 days of the run report date.

02Acknowledgement within 2 business days.

03Resolution within 14 business days.

Disputes about test-set composition or scoring weights are treated as benchmark feedback, not individual run disputes, and are considered during the next version review.

§12Known limitations

Comparotor publishes its limitations explicitly. We believe benchmark transparency requires acknowledging scope boundaries at least as prominently as capabilities.

v0.1 scope boundaries

2D onlyRotorBench-Aero v0.1 tests only 2D airfoil sections. 3D rotor, propeller, and turbomachinery effects are not evaluated. A 3D extension is planned for v1.0.

Attached flowThe SU2 configuration and UIUC dataset are most reliable in the attached-flow regime (α < 12° for most sections). Post-stall and deep-stall predictions are less well-validated.

Incompressible + low-subsonicTest cases cover Ma 0.05–0.5. Transonic and compressible effects are not the primary focus of v0.1.

Single turbulence modelLabels are generated with Spalart-Allmaras only. Alternative RANS closures (k-ω SST) and LES validation are not included.

No uncertainty propagationScores measure point-estimate accuracy. Uncertainty quantification (UQ) output from probabilistic models is not currently scored.

Latency on a single platformLatency is measured on the Comparotor evaluation runner hardware. Real-world latency will vary with deployment hardware, accelerator availability, and batching strategy.

Want the full spec?

The complete SPEC.md covers the SU2 RANS generation pipeline, the full D1 schema, and the OpenAPI contract.