Skip to main content
Documentation

RL Driven

Overview

Version: Latest

RL Driven

Overview

rl_driven is a selection algorithm for online exploration and personalization. It supports multiple sub-modes: Thompson Sampling, Router-R1 (LLM-as-router), and Concurrent (arena mode).

It aligns to config/algorithm/selection/rl-driven.yaml.

Papers:

Key Advantages

  • Supports exploration instead of always exploiting the current best model.
  • Thompson Sampling provides a principled exploration/exploitation balance.
  • Router-R1 mode uses an LLM to reason about routing decisions.
  • Per-user personalization adapts routing over time.
  • Implicit feedback support (auto-detected satisfaction signals).

Sub-Modes

1. Thompson Sampling (default)

Uses Bayesian posterior sampling for exploration/exploitation. Each model's success probability is modeled as a Beta distribution Beta(α,β)\text{Beta}(\alpha, \beta):

θmBeta(αm,βm)\theta_m \sim \text{Beta}(\alpha_m, \beta_m)

At each request, a sample is drawn for each candidate model, and the model with the highest sample is selected:

m=argmaxmθmm^* = \arg\max_m \theta_m

After feedback, the distribution is updated:

  • Win: αmαm+1\alpha_m \leftarrow \alpha_m + 1
  • Loss: βmβm+1\beta_m \leftarrow \beta_m + 1
  • Tie: αmαm+0.5, βmβm+0.5\alpha_m \leftarrow \alpha_m + 0.5,\ \beta_m \leftarrow \beta_m + 0.5

When use_thompson_sampling: false, falls back to epsilon-greedy with exploration_rate as ϵ\epsilon.

2. Router-R1 (LLM-as-Router)

When enable_llm_routing: true, an external LLM server analyzes the query and selects the optimal model using "think" and "route" actions from the Router-R1 paper.

Reward structure: R=Rformat+(1α)Routcome+αRcostR = R_{\text{format}} + (1-\alpha) \cdot R_{\text{outcome}} + \alpha \cdot R_{\text{cost}}

ComponentDescription
RformatR_{\text{format}}-1 for incorrect format, 0 for correct
RoutcomeR_{\text{outcome}}Based on exact match with ground truth
RcostR_{\text{cost}}Inversely proportional to model size × output tokens
α\alphacost_reward_alpha — performance-cost tradeoff

3. Concurrent (Arena Mode)

When used as a looper algorithm (type: "concurrent"), executes all candidate models concurrently and aggregates results — useful for A/B testing and arena evaluation.

Select Flow (Thompson Sampling)

Feedback Flow

What Problem Does It Solve?

Online routing often needs exploration, adaptation, or personalization that static rankings cannot provide. rl_driven keeps that learning policy inside the router so model choice can improve from rewards and user feedback over time.

When to Use

  • The route should keep exploring candidate models online.
  • Personalization should adapt over time per user.
  • Thompson Sampling is preferred when you want principled exploration without external dependencies.
  • Router-R1 mode is preferred when you have a trained router LLM server.

Known Limitations

  • Thompson Sampling requires sufficient samples before exploiting effectively (min_samples).
  • Router-R1 LLM routing adds latency (extra LLM call per request).
  • Router-R1 mode requires a separate trained router LLM server.
  • Exploration incurs short-term cost for long-term gain.

Configuration

algorithm:
type: rl_driven
rl_driven:
# Exploration control
exploration_rate: 0.3 # Initial exploration rate (epsilon)
exploration_decay: 0.99 # Decay per 100 selections
min_exploration: 0.05 # Minimum exploration rate
use_thompson_sampling: true # Use Thompson Sampling vs epsilon-greedy

# Personalization
enable_personalization: true # Per-user preference tracking
personalization_blend: 0.7 # 1.0=fully personalized, 0.0=fully global
session_context_weight: 0.5 # Within-session feedback weight
implicit_feedback_weight: 0.5 # Auto-detected feedback weight

# Cost awareness
cost_awareness: true # Prefer cheaper models for exploration
cost_weight: 0.2 # Cost influence weight

# Persistence
storage_path: /var/lib/vsr/rl_state.json
auto_save_interval: 30s

# Router-R1 reward
use_router_r1_rewards: false # Enable Router-R1 reward computation
cost_reward_alpha: 0.3 # Performance-cost tradeoff in reward
format_reward_penalty: -1.0 # Penalty for incorrect format

# LLM-as-Router
enable_llm_routing: false # Enable LLM-based routing
router_r1_server_url: "" # Router-R1 server URL
llm_routing_fallback: thompson # Fallback when LLM routing fails

# Multi-round
enable_multi_round_aggregation: false

Parameters

ParameterTypeDefaultDescription
exploration_ratefloat0.3Initial exploration rate (0–1)
exploration_decayfloat0.99Decay per 100 selections (0–1)
min_explorationfloat0.05Minimum exploration rate (0–1)
use_thompson_samplingbooltrueThompson Sampling vs epsilon-greedy
enable_personalizationbooltruePer-user preference tracking
personalization_blendfloat0.7Global vs. personalized blend (0–1)
session_context_weightfloat0.5Within-session feedback weight (0–1)
implicit_feedback_weightfloat0.5Implicit feedback weight (0–1)
cost_awarenessbooltruePrefer cheaper models for exploration
cost_weightfloat0.2Cost influence weight
use_router_r1_rewardsboolfalseEnable Router-R1 reward structure
cost_reward_alphafloat0.3Performance-cost tradeoff (0=outcome, 1=cost)
enable_llm_routingboolfalseEnable LLM-as-router mode
router_r1_server_urlstringURL of Router-R1 LLM server
llm_routing_fallbackstringthompsonFallback when LLM routing fails
storage_pathstringPersist RL state to file
auto_save_intervalstring30sAuto-save interval