Install with Gateway API Inference Extension
This guide provides step-by-step instructions for integrating the vLLM Semantic Router (vSR) with Istio and the Kubernetes Gateway API Inference Extension (GIE). This powerful combination allows you to manage self-hosted, OpenAI-compatible models using Kubernetes-native APIs for advanced, load-aware routing.
Architecture Overview
The deployment consists of three main components:
- vLLM Semantic Router: The brain that classifies incoming requests based on their content.
- Istio & Gateway API: The network mesh and the front door for all traffic entering the cluster.
- Gateway API Inference Extension (GIE): A set of Kubernetes-native APIs (
InferencePool, etc.) for managing and scaling self-hosted model backends.
Benefits of Integration
Integrating vSR with Istio and GIE provides a robust, Kubernetes-native solution for serving LLMs with several key benefits:
1. Kubernetes-Native LLM Management
Manage your models, routing, and scaling policies directly through kubectl using familiar Custom Resource Definitions (CRDs).
2. Intelligent Model and Replica Routing
Combine vSR's prompt-based model routing with GIE's smart, load-aware replica selection. This ensures requests are sent not only to the right model but also to the healthiest replica, all in a single, efficient hop.
3. Protect Your Models from Overload
The built-in scheduler tracks GPU load and request queues, automatically shedding traffic to prevent your model servers from crashing under high demand.
4. Deep Observability
Gain insights from both high-level Gateway metrics and detailed vSR performance data (like token usage and classification accuracy) to monitor and troubleshoot your entire AI stack.
5. Secure Multi-Tenancy
Isolate tenant workloads using standard Kubernetes namespaces and HTTPRoutes. Apply rate limits and other policies while sharing a common, secure gateway infrastructure.
Supported Backend Models
This architecture is designed to work with any self-hosted model that exposes an OpenAI-compatible API. The demo models in this guide use vLLM to serve Llama3 and Phi-3, but you can easily replace them with your own model servers.
Prerequisites
Before starting, ensure you have the following tools installed:
- Docker or another container runtime.
- kind v0.22+ or any Kubernetes 1.29+ cluster.
- kubectl v1.30+.
- Helm v3.14+.
- istioctl v1.28+.
- A Hugging Face token stored in the
HF_TOKENenvironment variable, required for the sample vLLM deployments to download models.
You can validate your toolchain versions with the following commands:
kind version
kubectl version --client --short
helm version --short
istioctl version --remote=false
Step 1: Create a Kind Cluster (Optional)
If you don't have a Kubernetes cluster, create a local one for testing:
kind create cluster --name vsr-gie
# Verify the cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
Step 2: Install Istio
Install Istio with support for the Gateway API and external processing:
# Download and install Istio
export ISTIO_VERSION=1.29.0
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=$ISTIO_VERSION sh -
export PATH="$PWD/istio-$ISTIO_VERSION/bin:$PATH"
istioctl install -y --set profile=minimal --set values.pilot.env.ENABLE_GATEWAY_API=true
# Verify Istio is ready
kubectl wait --for=condition=Available deployment/istiod -n istio-system --timeout=300s
Step 3: Install Gateway API & GIE CRDs
Install the Custom Resource Definitions (CRDs) for the standard Gateway API and the Inference Extension:
# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml
# Install Gateway API Inference Extension CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.1.0/manifests.yaml
# Verify CRDs are installed
kubectl get crd | grep 'gateway.networking.k8s.io'
kubectl get crd | grep 'inference.networking.k8s.io'
Step 4: Deploy Demo LLM Servers
Deploy two vLLM instances (Llama3 and Phi-3) to act as our backends. These will be automatically downloaded from Hugging Face.
# Create namespace and secret for the models
kubectl create namespace llm-backends --dry-run=client -o yaml | kubectl apply -f -
kubectl -n llm-backends create secret generic hf-token --from-literal=token=$HF_TOKEN
# Deploy the model servers
kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vLlama3.yaml
kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vPhi4.yaml
# Wait for models to be ready (this may take several minutes)
kubectl -n llm-backends wait --for=condition=Ready pods --all --timeout=10m
Step 5: Deploy vLLM Semantic Router
Deploy the vLLM Semantic Router using its official Helm chart. This component will run as an ext_proc server that Istio calls for routing decisions.
helm upgrade -i semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace vllm-semantic-router-system \
--create-namespace \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml
# Wait for the router to be ready
kubectl -n vllm-semantic-router-system wait --for=condition=Available deploy/semantic-router --timeout=10m