Use predicted latency-based routing with GKE Inference Gateway

This document describes how to enable and use predicted latency-based routing provided by llm-d within GKE Inference Gateway. By default, GKE Inference Gateway routes requests using a combination of load signals and prefix-cache affinity heuristics. Predicted latency-based routing replaces the static heuristic weights with an XGBoost model trained continuously on live traffic, making more accurate routing decisions as workload patterns shift.

When to use predicted latency-based routing

This feature is most effective when the following conditions apply to your workload:

How predicted latency-based routing works

This section details the architecture and the scheduling pipeline used by predicted latency-based routing.

Architecture

Predicted latency-based scheduling deploys two additional sidecar containers inside the EPP Pod, alongside the EPP itself:

Component Description
Training Server Continuously retrains the XGBoost TTFT and TPOT models on completed request samples received from the EPP. Uses stratified bucketing over a sliding window so that rare traffic regimes are not forgotten. Writes updated models to a shared volume.
Prediction Servers Serve TTFT and TPOT predictions to the EPP on the request hot path. Read the latest trained model from the shared volume. Horizontally scalable — each server instance sustains approximately 300 QPS of prediction work. Multiple instances are load-balanced by a Go coalescing proxy in the EPP that batches concurrent prediction requests within a 1ms window.

llm-d EPP scheduling pipeline

When predicted latency-based scheduling is enabled, the EPP processes each request through the following sequence of composable plugins:

  1. predicted-latency-producer: calls the Prediction Server to obtain TTFT and TPOT estimates for every candidate Pod in the InferencePool, conditioned on each Pod's current KV-cache utilization, queue depth, prefix cache match score, and the incoming request features. After the response is returned to the client, the producer sends the observed TTFT and inter-token latency back to the Training Server as a new training sample.

  2. prefix-cache-affinity-filter: this filter narrows the candidate set to cache-warm Pods when any Pod's prefix cache match score exceeds the affinity threshold (default of 0.80). This threshold separates two populations observed in production: Pods that already have a conversation history cached from prior turns, and Pods that don't. This filter implements an epsilon-greedy exploit and explore strategy:

  3. slo-headroom-tier-filter (SLO requests only): when the request includes SLO headers, splits candidate Pods into a positive tier (predicted to meet the SLO) and a negative tier (predicted to violate it).

  4. latency-scorer: scores candidate Pods. Without SLO headers, the Pod with the lowest predicted latency is selected. With SLO headers, the score is based on headroom (SLO minus predicted latency) using the headroomSelectionStrategy:

  5. latency-slo-admitter (SLO requests only): rejects sheddable requests (priority is less than 0) when no candidate Pod is predicted to meet the SLO, instead of consuming capacity on a request predicted to miss its target. This filter has no effect when SLO headers are absent or when a Pod that meets the SLO exists.

  6. weighted-random-picker: selects the final Pod using weighted random selection over the scores. This spreads load while still favoring better-scoring Pods.

Streaming mode

The predicted-latency-producer plugin supports two training modes, configured using the streamingMode parameter:

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable predicted latency-based scheduling

The following steps guide you through enabling predicted latency-based scheduling for your GKE Inference Gateway deployment.

Step 1: Install or upgrade the InferencePool with predicted latency enabled

The latencyPredictor.enabled=true flag deploys the Training Server and Prediction Server sidecars inside the EPP Pod and wires up the full scheduling plugin pipeline:

helm upgrade --install INFERENCE_POOL_NAME \
  --set inferencePool.modelServers.matchLabels.app=MODEL_SERVER_LABEL \
  --set provider.name=gke \
  --set inferenceExtension.monitoring.gke.enabled=true \
  --set inferenceExtension.latencyPredictor.enabled=true \
  --version LLM_D_VERSION \
  oci://LLM_D_REGISTRY_PATH

Replace the following:

Step 2: Verify the deployment

Confirm the EPP Pod is running with all sidecar containers ready:

kubectl get pods -l app=INFERENCE_POOL_NAME-epp

The EPP Pod should show all containers in Running or Ready state: the EPP itself, the Training Server, and one or more Prediction Servers.

Step 3: Send a baseline request

Send a standard inference request to confirm that routing is working before enabling SLO headers:

curl -i -X POST GATEWAY_IP:PORT/v1/completions \
 -H 'Content-Type: application/json' \
 -H 'Authorization: Bearer $(gcloud auth print-access-token)' \
 -H 'x-prediction-based-scheduling: true' \
 -d '{
    "model": "MODEL_NAME",
    "prompt": "PROMPT_TEXT",
    "max_tokens": MAX_TOKENS,
    "temperature": "0"
 }'

Replace the following:

The x-prediction-based-scheduling: true header opts this request into the predicted latency scheduling pipeline. During the predictor warm-up period, the EPP falls back to heuristic routing.

Step 4: Send SLO-aware requests (optional)

To enable per-request SLO enforcement, add TTFT and TPOT latency objective headers:

curl -i -X POST GATEWAY_IP:PORT/v1/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer $(gcloud auth print-access-token)' \
  -H 'x-prediction-based-scheduling: true' \
  -H 'x-slo-ttft-ms: 500' \
  -H 'x-slo-tpot-ms: 50' \
  -d '{
    "model": "MODEL_NAME",
    "prompt": "PROMPT_TEXT",
    "max_tokens": MAX_TOKENS,
    "temperature": "0",
    "stream": true
  }'

Replace the following:

Request headers:

Monitor predicted latency scheduling

When the latency predictor is enabled, the EPP exposes additional metrics through Cloud Monitoring.

Metric Description
inference_objective_request_ttft_seconds Actual TTFT distribution (or E2E latency if streamingMode=false).
inference_objective_request_predicted_ttft_seconds Predicted TTFT distribution (or E2E latency if streamingMode=false).
inference_objective_request_tpot_seconds Actual TPOT distribution.
inference_objective_request_predicted_tpot_seconds Predicted TPOT distribution.
inference_objective_request_ttft_slo_violation_total Counter of TTFT SLO violations.

Scale the Prediction Server

The EPP makes one prediction call per candidate Pod per incoming request. Each Prediction Server instance sustains approximately 300 QPS of prediction work.

Approximate guidance for Prediction Server instance count:

Cluster QPS (100 Pods) Prediction servers required
Up to 1,000 QPS 1 server
Up to 5,000 QPS 2 servers
Up to 10,000 QPS 4 servers

Add Prediction Server instances by updating the latencyPredictor.predictionServerCount Helm value.

Limitations

What's next

  • Customize GKE Inference Gateway configuration.
  • Perform rollout operations for GKE Inference Gateway.