The default Kubernetes HorizontalPodAutoscaler scales on CPU utilization, a signal that decouples from load during GPU-bound LLM inference.
Standard autoscaling logic assumes CPU usage correlates linearly with request volume. For web servers or batch jobs, this holds true. For Large Language Model serving, the compute bottleneck resides on the GPU, while the CPU manages control logic and data movement. A pod can be fully saturated on its GPU accelerator while the CPU remains idle at 5% utilization. Relying on CPU metrics in this context creates a false sense of security; the autoscaler will not add replicas until the CPU spikes, which often happens only after the GPU is already exhausted and requests are queuing.
The HorizontalPodAutoscaler is a Kubernetes controller that adjusts replica counts based on observed metrics. It queries the metrics API, compares current values against target thresholds, and updates the Deployment scale. The default configuration targets CPU utilization at 80%. This threshold is arbitrary for stateless inference services where GPU memory and compute are the scarce resources. When the CPU metric is used as the scaling signal for vLLM or TGI workloads, the scaling decision lags behind the actual resource constraint.
Accurate scaling requires a metric that reflects the actual work queue. For LLM serving, this is the number of concurrent requests or the token generation rate. These signals are exposed by the inference server itself and must be routed into the Kubernetes control plane via the Custom Metrics API.
The signal mismatch
The core failure of CPU-based scaling for inference is the decoupling of control plane metrics from data plane saturation. During the prefill phase, the CPU processes the prompt and prepares the KV cache. During the decode phase, the GPU generates tokens sequentially. In a high-throughput scenario, the GPU remains at 90%+ utilization while the CPU spends most cycles waiting for GPU completion.
If the HorizontalPodAutoscaler observes 10% CPU usage, it assumes the pod is underutilized. It will not scale up, even if the GPU is fully saturated and the request queue is growing. This creates a latency cliff. Users experience increased time-to-first-token because the inference engine cannot accept new requests fast enough, yet the autoscaler sees no reason to provision more capacity.
The solution involves the Custom Metrics API, which allows the HorizontalPodAutoscaler to query metrics outside the standard kubelet stats. The prometheus-adapter project bridges this gap. It listens to a Prometheus server, applies query rules, and exposes the results as Kubernetes metrics. This allows the HPA to target vllm:num_requests_running instead of cpu/usage.
The following table compares the scaling signals for a standard web service versus an LLM inference service.
| Metric Source | Service Type | Scaling Signal | Failure Mode |
|---|---|---|---|
| CPU Utilization | Web Server | High CPU = High Load | Correct for CPU-bound |
| CPU Utilization | LLM Inference | Low CPU = Low Load | Incorrect for GPU-bound |
| Custom Metric (vLLM) | LLM Inference | High Requests = High Load | Correct for GPU-bound |
| Memory Utilization | LLM Inference | High GPU Memory = High Load | Risk of OOM if CPU scales first |
Configuring the adapter and HPA
To enable custom metric scaling, the prometheus-adapter must be configured to expose the inference metrics to the Kubernetes API. This is done via a ConfigMap in the kube-system or monitoring namespace. The ConfigMap defines rules that map Prometheus queries to Kubernetes metric names.
A typical rule for vLLM targets the vllm:num_requests_running metric. The adapter queries Prometheus for this series, aggregates it by namespace and deployment, and exposes it as an external metric. The HorizontalPodAutoscaler then references this external metric in its spec.
The following ConfigMap snippet defines the rule for exposing vLLM metrics. This configuration assumes the vLLM pods expose metrics on port 8000 and Prometheus scrapes them.
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: kube-system
data:
config.yaml: |
rules:
- seriesQuery: '{__name__=~"^vllm:num_requests_running$"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "vllm:num_requests_running"
as: "vllm_requests_running"
metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
Once the adapter is running, the HorizontalPodAutoscaler manifest must target this external metric. The HPA spec uses the external metric type instead of resource. It sets a target value, such as keeping 5 requests running per replica.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: vllm_requests_running
selector:
matchLabels:
app: vllm-deployment
target:
type: AverageValue
averageValue: "5"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
The behavior block is critical for stability. The default stabilization window for scale-down is 300 seconds. This prevents the HPA from reacting to momentary traffic spikes by immediately removing replicas. For inference workloads, traffic can be bursty. A 300-second window ensures that a brief lull does not trigger a scale-down that leaves the service unable to handle the next burst.
Failure modes and lag
The primary failure mode when using CPU metrics for LLM serving is the latency cliff. The system scales up only after the CPU spikes, which typically occurs when the GPU is already saturated and the request queue is backing up. By the time new replicas are provisioned, users have already experienced degraded latency.
When using custom metrics, the risk shifts to metric availability. If the prometheus-adapter pod restarts or the connection to Prometheus fails, the HPA cannot read the metric. The default HPA behavior when a metric is missing is to scale down to the minimum replicas. This can cause a cascade failure where the service scales to zero during a monitoring outage, leaving the inference API unreachable.
Another failure mode is oscillation. If the target value is set too low, the HPA adds replicas aggressively. If the target is set too high, it scales too slowly. The stabilizationWindowSeconds parameter mitigates this. Setting it to 300 seconds means the HPA will wait 5 minutes after a metric drops below the threshold before scaling down. This is often too long for inference services that need to be nimble. A value of 60 seconds is often more appropriate for inference, balancing responsiveness with stability.
The prometheus-adapter itself adds a layer of latency. The HPA syncs every 15 seconds by default. The adapter queries Prometheus, which may have a scrape interval of 15 seconds. The total lag from load increase to scale-up decision can exceed 30 seconds. For high-frequency trading or real-time chat, this lag is unacceptable. In these cases, the behavior policies can be tuned to allow faster scale-up while maintaining conservative scale-down.
Decision frame
The choice between CPU-based HPA and custom metric scaling is not about preference but about resource alignment. CPU metrics are valid for CPU-bound services; custom metrics are mandatory for GPU-bound services. The tradeoff lies in the stabilization window. A 300-second window prevents oscillation but delays response to traffic drops. A 60-second window responds quickly but risks scaling down during transient lulls.
The next time a GPU pod stays idle while requests queue, the question is not “is the autoscaler broken.” It is “is the HPA targeting the correct metric.” If the HorizontalPodAutoscaler spec targets resource.cpu instead of external.vllm_requests_running, the scaling logic is fundamentally misaligned with the workload’s bottleneck. Verify the metric source before tuning the stabilization window.