Skill Rating for Olympic and Paralympic Sport

Definition

Plain-language

Olympic-sport skill rating is a different statistical problem from general skill rating. It is defined by three structural constraints that off-the-shelf systems (Elo, Glicko, TrueSkill, OpenSkill) do not address: (1) the benchmark calendar is sparse — one World Championships and one Olympic Games per cycle, with World Cups attended unevenly across nations for geographic or qualification reasons; (2) the comparison graph is disconnected — top athletes from different regions may never directly compete, so their ratings are local maxima inside isolated pockets, not points on a shared global scale; and (3) within-event volatility is high and consequential — a single bad day at a peak event, weighted as if it were ordinary evidence, can crater the rating of the actually-best athlete. The correct response is not a better Elo K-factor but a different kind of model entirely: a hierarchical Bayesian state-space model with heavy-tailed observation noise, geometric event-importance weighting, partial-identification handling for disconnected graph components, and explicit between-/within-season trajectory decomposition. The rating is no longer a single number per athlete — it is a posterior distribution, often a set-valued one when the data does not support a point estimate.

Formal sketch. Let $r_{i, t} \in R$ be the latent skill of athlete $i$ at time $t$ . A multi-competitor scored event $e$ at time $t_{e}$ with importance weight $w_{e}$ generates an observed score $y_{i, e}$ for each athlete $i \in A_{e}$ . The model is

r_{i, t} y_{i, e} Pr [π_{e} ∣ {r_{i, t_{e}}}, w_{e}] = r_{i, t - 1} + η_{i} + ϵ_{i, t}^{state}, = g (ϕ_{i, e} - r_{i, t_{e}}) + ϵ_{i, e}^{obs}, \propto k \prod \frac{exp ( w _{e} \cdot r _{π_{e} (k), t_{e}} )}{\sum _{j \in A_{e}^{(k)}} exp ( w _{e} \cdot r _{j, t_{e}} )}, ϵ_{i, t}^{state} ϵ_{i, e}^{obs} \sim N (0, τ_{i}^{2}) \sim t_{ν} (0, σ_{i}^{2})

where $η_{i}$ is athlete-specific drift (career arc), $ϕ_{i, e}$ is a within-season form term, $g$ is a monotone score transform (2205.10746), $t_{ν}$ is Student-t noise replacing Gaussian to give heavy-tail robustness (2502.18206), and the last line is Plackett–Luce over the observed finishing order $π_{e}$ with geometric event weight $w_{e}$ (1907.05082). When the comparison graph induced by ${A_{e}}$ is disconnected, the posterior over $r_{i, t}$ is reported as a set rather than a point (2410.18272).

Intuition. The thing the Olympic problem is asking — how good is this athlete, given that they race the world’s best only once a year, never meet half the field, and might have food poisoning on the day — is fundamentally a problem of honest uncertainty quantification. Off-the-shelf rating systems answer it by producing a single number with overconfident precision. The hierarchical Bayesian formulation above answers it by producing a posterior whose width reflects what the data actually supports. That includes producing intervals rather than point estimates when the comparison graph is too sparse to identify a global ranking. The honest answer to “is the Belgian or the Australian better?” is sometimes “we don’t have the data to say”; Elo and Glicko cannot give that answer, this model can.

The three failure modes

The off-the-shelf rating literature (Skill Rating Algorithms) is dominated by problems with dense, well-connected comparison graphs — chess servers, MOBA matchmaking queues, LLM arenas. Olympic sport has none of those properties. The failure modes:

Failure mode	Why off-the-shelf systems break	Fix
Sparse benchmark calendar	Elo/Glicko/TrueSkill assume regular comparison density; their bias under attendance selection is unaddressed	Hierarchical Bayesian state-space with event-importance weights
Disconnected regional pockets	A point estimate across disconnected components is mathematically meaningless	Partial identification + stochastic block model
High volatility / bad-day weighting	Gaussian noise / fixed-K SGD treats a meltdown as skill decline	Student-t observation noise + season-trajectory decomposition

The recommended stack

1. Plackett–Luce + monotone score transform

2205.10746 — Athlete rating in multi-competitor games with scored outcomes via monotone transformations. The closest paper in the literature to the Olympic-discipline problem. Multi-competitor scored events (athletics, swimming, biathlon, cross-country) get a monotone transform applied to observed scores before the Plackett–Luce fit. The transform is what gives robustness to a bad day: a 0.5 s slow 100m is not 10× worse than a 0.05 s slow one, but raw-time Elo would treat it that way.

For judged disciplines (gymnastics, diving, figure skating, freestyle skiing): add a judge-bias term per 1807.10055, which separates athlete skill from judge effects in a rigorous way.

For head-to-head combat sports (judo, taekwondo, wrestling): hierarchical BT, but the comparison graph will be classified-by-impairment and naturally disconnected — read §4 below first.

2. Geometric event-importance weighting

1907.05082 — How should we score athletes and candidates: geometric scoring rules. Designed for the Olympic-style problem where one Worlds is worth more than five regional opens. Geometric scoring rules down-weight the long tail of low-importance results by construction — a finish at a tertiary event cannot dominate the rating no matter how good. The principled version of every hand-tuned federation point system.

3. Volatility handling — heavy tails + season decomposition

Three complementary papers:

2502.18206 — Robust Kalman filtering via normal variance mixtures. Student-t observation noise; 3σ days update the skill posterior much less than Gaussian noise would.
2405.17214 — Between- and within-season trajectories in elite athletic performance. Decomposes “peaking for Worlds” from “career arc”; bad finishes at the peak event get attributed to within-season variance, not declining skill.
2101.08175 — Bayesian GARCH for sports data. Models volatility itself as latent, so naturally-volatile athletes get wider posteriors rather than spurious certainty.

4. Disconnected-graph handling — partial identification

The most important and least obvious choice. Almost no off-the-shelf system handles it honestly.

2410.18272 — Partially Identified Rankings from Pairwise Interactions. When the comparison graph has multiple components or weak bridges, the right answer is a set of ratings consistent with the data, reported as intervals — not a single number. The honest answer to “Athlete A only races Europeans, Athlete B only races Americans, who is better?” is “between 3rd and 7th globally, with these specific candidates” — not a misleading point.
2511.03467 — The Bradley–Terry Stochastic Block Model. When disconnection is partial (a few cross-region matches per year), models regional pockets as blocks with partial pooling. Information flows between blocks through sparse cross-region matches; each block keeps its own scale.
2304.06821 — Ranking from Pairwise Comparisons in General Graphs and Graphs with Locality. Sample-complexity bounds — how many cross-component comparisons are required to recover a global ranking? Direct input to calendar design.
2207.01455 — Dynamic Ranking and Translation Synchronization. Components that connect intermittently over time (different eras / continents) get aligned via time-overlapping career arcs.
2002.08853 — A General Pairwise Comparison Model for Extremely Sparse Networks. BT-MLE consistency conditions under sparsity; tells you when the data is even adequate before fitting.

5. Calendar design — the upstream fix

1207.6430 — Optimal Data Collection For Informative Rankings Expose Well-Connected Graphs. The cheapest way to get a good rating is to schedule matches that connect the graph — a handful of well-placed cross-region matches per year dominates clever statistical modelling on a badly-connected calendar. Send to whoever designs the federation calendar.

1109.3701 — Active Ranking using Pairwise Comparisons. Active-learning view: which next matches reduce posterior uncertainty most? Useful for wildcard/invitation decisions.

Federation case studies

1806.08259 — Dynamic Network 3—0 FIFA Rankings. Forensic critique of the old FIFA ranking + replacement design. The exploitability section is the key read — the old system rewarded avoiding strong opponents, the same pathology as athletes skipping each others’ World Cups.
2201.00691 — FIFA ranking evaluation of the new Elo-based system. Comparable for a federation-scale redesign.
1705.05831 — ATP points system is predictively worse than Elo despite more data, because it isn’t statistically motivated. The direct argument against hand-tuned point systems.
2411.02000 — Bayesian biathlon performance modelling. Worked Olympic-discipline example with the right ingredients.
2409.05714 — Dynamic ranking for the Men’s Ice Hockey World (Junior) Championships. Single-benchmark-event forecasting; the Olympic shape.
2510.14723 — Bayesian Olympic medal table; cross-discipline aggregation for national strength.

Evaluation

The right evaluation set is the thing the system will be used to predict. For Olympic-discipline rating that is one benchmark event per year, with one peak Games per cycle — not a stream of arbitrary pairwise matches. The evaluation design follows from this.

Primary: the last Olympic cycle as a held-out window

Hold out the most recent complete Olympic cycle (≈4 years, 4 Worlds + 1 Games per discipline) as the final test set. Train on everything strictly before the cycle starts; do not retrain inside the cycle. This is closer to deployment than rolling-origin retrain — federations fit once per quad and live with it — and it forces honesty about new-athlete cold-start in a way that rolling retrain papers over.

Three caveats on the cycle holdout

One Olympics ≠ one evaluation. A single Games is n = 1 on the headline metric; bootstrap CIs over it are meaningless. Report metrics across all ~5 benchmark events in the cycle (4 Worlds + 1 Games), with the Games as a highlighted line and the Worlds as the variance-reduction backbone. Bootstrap over events.
Hyperparameters need their own holdout. Prior strength, season-decay half-life, Student-t degrees of freedom, SBM block prior — these silently overfit if tuned on the test cycle. Use the prior Olympic cycle as the development set: train ≤ T−8y, dev T−8y…T−4y (lock hyperparameters), refit ≤ T−4y, test T−4y…T.
Sport mix is non-stationary across cycles. Breaking, sport climbing, and skateboarding entered in 2020–24; karate left. Cold-start athletes in newly-added sports will dominate the loss without telling you anything about rating quality. Stratify reported metrics by “stable” vs “new” disciplines, or restrict the comparison set.

Metrics — all four; any one alone is gameable

Metric	What it catches
Spearman ρ on full finishing order	Global rank quality
NDCG@10, podium top-3 hit rate	What federations and broadcasters actually care about
Log-loss on induced pairwise win probabilities	Properly scored probabilistic forecast
90% credible-interval coverage on individual placings	Honesty — punishes models that hide uncertainty; the test where partial-identification (2410.18272) earns its keep

Diagnostics within the held-out cycle

Not replacements for the cycle-level metrics — additional probes that localize where a model is winning or losing.

Bridge-match holdout. Within the cycle, score log-loss on cross-region / cross-block matches separately from within-block matches. A model that scores well within-block but badly on bridges is hiding a non-identified ranking — directly tests the SBM (2511.03467) and partial-ID claims.
Bad-day stress test. Identify “meltdown finishes” — a defending top-5 athlete finishing outside top 20 at a Worlds/Games, conditional on normal season form. Check that the model’s next-event prediction does not crater. Direct empirical test for the Student-t observation noise (2502.18206) and season-trajectory decomposition (2405.17214). Vanilla Elo and Glicko fail this loudly.

What not to use

Random pairwise holdout across all events — leaks future state through the time-varying latent skill.
A single “test season” — one season’s noise dominates; ~5 benchmark events is the floor for talking about systematic differences.
Accuracy on binary “winner predicted” — throws away the rank, and benchmark events are routinely decided by tenths of a percent.

What this is not

Not a recommendation to deploy raw Elo, Glicko-2, or TrueSkill on Olympic data. They will produce confident-looking numbers that are quietly wrong: biased toward whichever region the athlete competes in most, brittle to single-event volatility, and silently averaging over disconnected comparison-graph components as if they were the same scale.

Not a single-line library install. Vanilla Glicko-2 is one config line; this stack is a PyMC / Stan / NumPyro programme on the order of 200–400 lines. The cost is real but one-time. The output is calibrated win-probabilities with honest uncertainty intervals — including the answer “we don’t have enough data to rank these two”, which off-the-shelf systems cannot produce.

Not a Paralympic-specific review. The arXiv-indexed literature is thin on Paralympic-specific rating; the recommendations above transfer at the statistical level but the impairment-class structure is a layer that needs its own follow-up search.

Honest caveats

Adversarial dynamics at federation scale. If the ranking determines Olympic qualification, athletes and federations will optimise against it. Read the FIFA exploitability critique (1806.08259) before deploying.
Compute is not the constraint. A full posterior over a 500-athlete population with ~5 years of competition data is a 10-second NUTS run on a MacBook Pro M5 Max. No GPU. Build the right model.
The retrieval that produced this review has corpus-level Recall@100 ≈ 0.76 — ~24% of relevant work may be missed. Re-run this survey once Phase 2 dense + Phase 3 cross-encoder retrieval ships; the Paralympic gap in particular is the kind of semantic-rather-than-lexical miss that dense retrieval is built for. Provenance: Rapid Literature Search for Sports AI.

Daniel ML Evans

Explorer

Skill Rating for Olympic and Paralympic Sport

Skill Rating for Olympic and Paralympic Sport

Definition

The three failure modes

The recommended stack

1. Plackett–Luce + monotone score transform

2. Geometric event-importance weighting

3. Volatility handling — heavy tails + season decomposition

4. Disconnected-graph handling — partial identification

5. Calendar design — the upstream fix

Federation case studies

Evaluation

Primary: the last Olympic cycle as a held-out window

Three caveats on the cycle holdout

Metrics — all four; any one alone is gameable

Diagnostics within the held-out cycle

What not to use

What this is not

Honest caveats

Graph View

Table of Contents