Elo
Overview
elo is a feedback-driven selection algorithm that ranks candidate models using an Elo-style rating system based on pairwise comparisons.
It aligns to config/algorithm/selection/elo.yaml.
Paper: RouteLLM: Simple and Effective LLM Routing — uses the Bradley-Terry model for pairwise preference learning.
Key Advantages
- Reuses historical pairwise feedback instead of only current-request heuristics.
- Ratings improve over time as more comparisons arrive (online learning).
- Supports category-aware weighting for routes with distinct workloads.
- Configurable time decay to gradually forget stale comparisons.
- Optional cost-aware selection to balance quality vs. price.
Algorithm Principle
Elo rating is built on the Bradley-Terry model, which estimates the probability that model A is preferred over model B:
After each pairwise comparison, ratings are updated:
Where:
- is the learning rate (
k_factor, default 32) - is the actual outcome (1 = win, 0 = loss, 0.5 = tie)
- is the expected score
When category_weighted is enabled, each decision maintains independent per-category ratings, so a model's performance in "math" doesn't affect its "coding" rating.
Select Flow
Feedback Flow
Feedback API
Submit pairwise feedback to update Elo ratings:
curl -X POST http://localhost:8000/api/v1/feedback \
-H "Content-Type: application/json" \
-d '{
"query": "Solve: 2x + 5 = 15",
"winner_model": "gpt-4",
"loser_model": "llama-3.2-3b",
"decision_name": "math_reasoning"
}'
Feedback fields:
| Field | Required | Description |
|---|---|---|
query | Yes | Original query text |
winner_model | Yes | Preferred model name |
loser_model | No | Rejected model (for pairwise) |
tie | No | Both models performed equally |
decision_name | No | Category context for category-weighted Elo |
user_id | No | User identifier for per-user tracking |
confidence | No | Feedback confidence (0.0-1.0) |
What Problem Does It Solve?
When model quality shifts over time, static priority lists and one-shot heuristics become stale. elo lets the router learn from pairwise feedback so candidate ranking reflects observed wins instead of frozen assumptions.
When to Use
- You collect route-level feedback or quality comparisons.
- Ranking should improve over time as more comparisons arrive.
- One route sees repeatable workloads where a rating system is useful.
- You want online learning without retraining.
Known Limitations
- Requires sufficient pairwise comparisons before ratings stabilize (controlled by
min_comparisons). - Cold start: new models begin at
initial_ratingregardless of actual capability. - No semantic understanding of queries — purely feedback-driven.
Configuration
algorithm:
type: elo
elo:
initial_rating: 1500 # Starting rating for new models
k_factor: 32 # Learning rate (higher = more volatile)
category_weighted: true # Per-category Elo ratings
decay_factor: 0.0 # Time decay for old comparisons (0 = none)
min_comparisons: 5 # Minimum comparisons before rating is stable
cost_scaling_factor: 0.0 # Cost consideration (0 = ignore)
storage_path: /var/lib/vsr/elo_ratings.json # Persist ratings
auto_save_interval: 1m # Auto-save frequency
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_rating | float | 1500 | Starting Elo rating for new models |
k_factor | float | 32 | Rating volatility (range 1–100) |
category_weighted | bool | true | Enable per-category ratings |
decay_factor | float | 0.0 | Time decay for old comparisons (0–1) |
min_comparisons | int | 5 | Minimum comparisons for stable rating |
cost_scaling_factor | float | 0.0 | Cost penalty per $1M tokens (0 = ignore) |
storage_path | string | — | File path to persist Elo ratings |
auto_save_interval | string | 1m | Auto-save interval (e.g. 5m, 30s) |