Capacity Planning Scenarios
vllm-sr-sim answers fleet-planning questions that cannot be resolved
from first principles alone: where to set a split threshold, whether a fleet
will actually meet SLO under real queue dynamics, which GPU type is cheapest
for a given workload, and when to pre-provision the next tier.
GPU unit costs used throughout:
| GPU | $/hr | $/yr |
|---|---|---|
| A10G 24 GB | $1.01 | $8.85 K |
| A100 80 GB | $2.21 | $19.4 K |
| H100 80 GB | $4.02 | $35.2 K |
P99 TTFT = P99(KV-slot queue wait) + mean prefill time. Each KV-cache slot is modelled as a server in an M/G/c queue.
When to split pools — the short version
Before reaching for the simulator, apply this filter:
Heavy-tail service times (agent / long-context)?
→ Split required. Homo cannot meet SLO regardless of GPU count.
ctx ratio R = long_max_ctx / B_short and long-request fraction f:
R ≤ 2× or f > 30% → homo usually cheaper; split for latency isolation only
R ≥ 4× and f < 10% → split cheaper at high traffic (λ > ~100 req/s)
R ≥ 16× and f < 5% → split cheaper at any meaningful traffic level
Everything below is a puzzle the rule of thumb cannot solve on its own.