Skip to content
§02

Methodology

How we score models, why the test set rotates, and the audit trail behind every published number.

rotorbench-aero · v0.1 · 2026Q2

§01How a model gets scored

Each submission travels through a deterministic six-step pipeline. The runner is hermetic — it pulls the artefact, runs against a fixed test split, and posts results back over a signed webhook.

POST /submissionsclient → API01R2 archiveartefact sealed02Cloudflare Queueeval-runner pulls03Inference200 airfoils × 240 ops04SU2 oraclehigh-fidelity labels05POST /webhooksHMAC-signed metrics06

§02Contamination resistance

The private test set rotates every quarter. 50 of the 200 airfoils are replaced with newly-procedurally-perturbed parents under a fixed seed schedule. Submissions cannot pre-train on the test set because it didn't exist when the model was trained.

2026Q2stable 150 airfoilsQ1Q2Q3Q42026Q3stable 150 airfoilsQ1Q2Q3Q42026Q4stable 150 airfoilsQ1Q2Q3Q42027Q1stable 150 airfoilsQ1Q2Q3Q4stablerotated this quarter

The 150 stable airfoils give cross-quarter comparability. A model submitted in 2026Q2 and re-run in 2027Q1 will show its drift against the rotated 50 alongside its persistence on the stable 150.

§03The composite score, derived from first principles

The composite is a weighted sum of five normalised metrics. Each weight is chosen so that a real, useful improvement on that metric matches its peers in scoring impact — see the annotations.

composite = MAE_Cl + 10·MAE_Cd + 0.5·MAE_Cm + 0.2·(1 − ρ_LD) + 0.1·OOD_score + 0.001·latency_p50_ms
MAE_Cl
primary lift error — directly used in design loops
10·MAE_Cd
weighted up because Cd ranges ~10× narrower than Cl
0.5·MAE_Cm
moment matters less in 2D scoring; halved
0.2·(1 − ρ_LD)
rank correlation already in [0, 1]; small weight is enough
0.1·OOD_score
OOD is a separate guardrail, not the dominant signal
0.001·latency_p50_ms
tiebreaker only — lets a fast model edge a marginally-more-accurate slow one

Lower composite is better. The leaderboard's default sort is by composite ascending.

§04What we archive

Every submission's artefact is sealed in R2 for 24 months. If contamination is later suspected we can replay the same artefact against a fresh test set. The bundle below is what gets archived per submission.

r2://comparotor-submissions/<submission_id>/retained 24 months
MODEL
model.onnx12–500 MB
META
wrapper.json<2 KB
META
container.txt<256 B
ATTEST
submitter.txt<512 B
MANIFEST
SHA-256SUMS<512 B
META
submitted_at.txt<32 B

§05Audit and replay

If a submission is suspected of having seen the test set during training, we can replay it against a freshly-rotated quarter without the submitter touching anything. The score difference is the audit signal.

01original score2026Q2 test set02replay artefact2026Q3 test set03suspect contaminationif Δ-score > 2σ04audit signalpublished in run history

§06SU2 CFD configuration

Every ground-truth label in RotorBench-Aero v0.1 is produced by the same deterministic SU2 configuration. The parameters below are frozen for the lifetime of this major version.

su2 solver parameters
SolverSU2 v7.5 RANS (Reynolds-Averaged Navier-Stokes)
Turbulence modelSpalart-Allmaras one-equation model
Mesh topologyC-type structured mesh, ~80,000 cells
Wall resolutiony⁺ ≈ 1 at wall
Boundary conditionsFreestream velocity; angle of attack set via flow direction; far-field boundary at 20 chord lengths
Convergence criterionResidual reduction of 6 orders of magnitude or 5,000 iterations (whichever first)

Failed cases

If SU2 fails to converge for a test case, the case is excluded from that quarter's evaluation and flagged in the report. Models are scored on the remaining converged cases, with the exclusion count noted in the report footer.

Reference validation

The SU2 configuration was validated against XFOIL for NACA 0012 at Re = 3×10⁶, α = 0°–10°. Cₗ agreement is within 1.2 %; Cₘ agreement is within 8 % in the attached-flow regime.

§07OOD generalisation subset

The 200-airfoil test set is split into a stable core and a geometrically out-of-distribution (OOD) subset. The OOD score measures how well a model transfers beyond the UIUC training distribution.

test-set composition200 airfoils total
Core set150 airfoils — stable across quarters, sourced from the UIUC dataset distribution
OOD subset50 airfoils — geometrically out-of-distribution: supercritical sections, wind-turbine profiles (thick trailing edges), and eVTOL-relevant low-Re sections
OOD score definitionComposite MAE on the 50 OOD airfoils, normalised by the core-set MAE
InterpretationOOD score < 1.0: model generalises well beyond training distribution. OOD score > 1.0: model degrades on unseen geometries.
RotationThe OOD subset rotates quarterly alongside the stable-150 rotation, but independently — new OOD airfoils are selected from outside the current UIUC training distribution each quarter.

§08Statistical confidence & rank uncertainty

Composite scores are accompanied by 95% confidence intervals. Rank differences smaller than the confidence interval overlap are labelled statistically indistinguishable in every published report.

confidence & rank rules
Confidence interval methodBootstrap resampling over 1,000 samples drawn from the 200 test cases; 95% CI reported with every composite score
Statistically indistinguishableRank changes smaller than the confidence interval overlap between the two models are flagged as “statistically indistinguishable” in published reports
Tie thresholdModels within 0.002 composite score of each other are considered tied for ranking purposes
Quarter-to-quarter validityScore comparisons across quarters are valid only on the stable-150 subset. The 50 rotating OOD airfoils contribute exclusively to the OOD score component and must not be compared across quarters.

§09Scoring weight rationale

The six weights in the composite formula are not arbitrary. Each is calibrated so that a meaningful, practically significant improvement on that metric produces comparable scoring impact to its peers.

TermWeightRationale
MAE_Cl1.0×Baseline reference weight — lift coefficient error is the primary design-loop signal
MAE_Cd10×Cd range is ~10× narrower than Cl; equal absolute error on Cd would otherwise be swamped in the score
MAE_Cm0.5×Pitching moment matters for stability analysis but is secondary to lift/drag for most 2D design workflows
1 − ρ_L/D0.2×L/D rank correlation is already bounded in [0, 1]; a small weight is sufficient to reward correct ranking
OOD score0.1×Generalisation is a guardrail, not the dominant signal; penalises poor OOD transfer without masking in-distribution accuracy
latency_p50 (ms)0.001×Tiebreaker only — significant only above ~100× latency difference; avoids penalising accurate-but-slower models

The weights were set by the Comparotor team following analysis of the UIUC dataset distribution and SU2 label variance. They are fixed for RotorBench-Aero v0.1 and will be reviewed for v0.2 with design-partner input.

§10Benchmark versioning policy

Version numbers signal compatibility. Minor versions allow cross-quarter comparison; major versions mark breaking changes where historical scores are not directly comparable.

RotorBench-Aero v{major}.{minor}
Minor version (v0.1 → v0.2)Updated test-set composition, scoring weights, or SU2 configuration. All historical scores remain comparable within the same major version.
Major version (v0 → v1)Breaking changes to the prediction task or dataset scope. Historical scores are not directly comparable across major versions.
Frozen specificationEach version has a frozen specification stored in the public docs at the time of release.
Quarterly test-set rotationNot a version change — it is the expected anti-contamination mechanism. Rotation operates within a fixed benchmark version.

§11Dispute & replay policy

Submitters can request a verified replay of any run within 90 days. Score disputes are acknowledged within 2 business days and resolved within 14.

replay eligibility
Replay windowAny submitter may request a replay within 90 days of the original evaluation date.
Replay guaranteeThe replay re-runs the exact artefact (verified by SHA-256) against the same frozen test set archived in R2. Replay results are authoritative.
Discrepancy handlingIf replay results differ from the original by more than floating-point tolerance, the discrepancy is investigated and published in the run history.

Score dispute process

01Email [email protected] with your run ID within 30 days of the run report date.
02Acknowledgement within 2 business days.
03Resolution within 14 business days.

Disputes about test-set composition or scoring weights are treated as benchmark feedback, not individual run disputes, and are considered during the next version review.

§12Known limitations

Comparotor publishes its limitations explicitly. We believe benchmark transparency requires acknowledging scope boundaries at least as prominently as capabilities.

v0.1 scope boundaries
2D onlyRotorBench-Aero v0.1 tests only 2D airfoil sections. 3D rotor, propeller, and turbomachinery effects are not evaluated. A 3D extension is planned for v1.0.
Attached flowThe SU2 configuration and UIUC dataset are most reliable in the attached-flow regime (α < 12° for most sections). Post-stall and deep-stall predictions are less well-validated.
Incompressible + low-subsonicTest cases cover Ma 0.05–0.5. Transonic and compressible effects are not the primary focus of v0.1.
Single turbulence modelLabels are generated with Spalart-Allmaras only. Alternative RANS closures (k-ω SST) and LES validation are not included.
No uncertainty propagationScores measure point-estimate accuracy. Uncertainty quantification (UQ) output from probabilistic models is not currently scored.
Latency on a single platformLatency is measured on the Comparotor evaluation runner hardware. Real-world latency will vary with deployment hardware, accelerator availability, and batching strategy.

Want the full spec?

The complete SPEC.md covers the SU2 RANS generation pipeline, the full D1 schema, and the OpenAPI contract.