Skip to main content
Documentation
Version: Latest

Elo

Overview

elo is a feedback-driven selection algorithm that ranks candidate models using an Elo-style rating system based on pairwise comparisons.

It aligns to config/algorithm/selection/elo.yaml.

Paper: RouteLLM: Simple and Effective LLM Routing — uses the Bradley-Terry model for pairwise preference learning.

Key Advantages

  • Reuses historical pairwise feedback instead of only current-request heuristics.
  • Ratings improve over time as more comparisons arrive (online learning).
  • Supports category-aware weighting for routes with distinct workloads.
  • Configurable time decay to gradually forget stale comparisons.
  • Optional cost-aware selection to balance quality vs. price.

Algorithm Principle

Elo rating is built on the Bradley-Terry model, which estimates the probability that model A is preferred over model B:

P(AB)=11+10(RBRA)/400P(A \succ B) = \frac{1}{1 + 10^{(R_B - R_A) / 400}}

After each pairwise comparison, ratings are updated:

RA=RA+K(SAEA)R_A' = R_A + K \cdot (S_A - E_A)

Where:

  • KK is the learning rate (k_factor, default 32)
  • SAS_A is the actual outcome (1 = win, 0 = loss, 0.5 = tie)
  • EA=P(AB)E_A = P(A \succ B) is the expected score

When category_weighted is enabled, each decision maintains independent per-category ratings, so a model's performance in "math" doesn't affect its "coding" rating.

Select Flow

Feedback Flow

Feedback API

Submit pairwise feedback to update Elo ratings:

curl -X POST http://localhost:8000/api/v1/feedback \
-H "Content-Type: application/json" \
-d '{
"query": "Solve: 2x + 5 = 15",
"winner_model": "gpt-4",
"loser_model": "llama-3.2-3b",
"decision_name": "math_reasoning"
}'

Feedback fields:

FieldRequiredDescription
queryYesOriginal query text
winner_modelYesPreferred model name
loser_modelNoRejected model (for pairwise)
tieNoBoth models performed equally
decision_nameNoCategory context for category-weighted Elo
user_idNoUser identifier for per-user tracking
confidenceNoFeedback confidence (0.0-1.0)

What Problem Does It Solve?

When model quality shifts over time, static priority lists and one-shot heuristics become stale. elo lets the router learn from pairwise feedback so candidate ranking reflects observed wins instead of frozen assumptions.

When to Use

  • You collect route-level feedback or quality comparisons.
  • Ranking should improve over time as more comparisons arrive.
  • One route sees repeatable workloads where a rating system is useful.
  • You want online learning without retraining.

Known Limitations

  • Requires sufficient pairwise comparisons before ratings stabilize (controlled by min_comparisons).
  • Cold start: new models begin at initial_rating regardless of actual capability.
  • No semantic understanding of queries — purely feedback-driven.

Configuration

algorithm:
type: elo
elo:
initial_rating: 1500 # Starting rating for new models
k_factor: 32 # Learning rate (higher = more volatile)
category_weighted: true # Per-category Elo ratings
decay_factor: 0.0 # Time decay for old comparisons (0 = none)
min_comparisons: 5 # Minimum comparisons before rating is stable
cost_scaling_factor: 0.0 # Cost consideration (0 = ignore)
storage_path: /var/lib/vsr/elo_ratings.json # Persist ratings
auto_save_interval: 1m # Auto-save frequency

Parameters

ParameterTypeDefaultDescription
initial_ratingfloat1500Starting Elo rating for new models
k_factorfloat32Rating volatility (range 1–100)
category_weightedbooltrueEnable per-category ratings
decay_factorfloat0.0Time decay for old comparisons (0–1)
min_comparisonsint5Minimum comparisons for stable rating
cost_scaling_factorfloat0.0Cost penalty per $1M tokens (0 = ignore)
storage_pathstringFile path to persist Elo ratings
auto_save_intervalstring1mAuto-save interval (e.g. 5m, 30s)