serve E

Inference servers, KV-cache, batching, streaming, model routing — the layer that answers requests.

serve·May 13, 2026

Model weights belong in object storage, not container images

An init-container downloads model weights to a shared emptyDir volume before the inference container starts, trading image size for cold-start latency.

serve·May 13, 2026

What KServe adds over a plain Kubernetes Deployment

KServe introduces an InferenceService CRD that manages model loading and traffic routing, while a Deployment manages only container replicas.

serve·May 13, 2026

When to pick Triton, vLLM, or TGI — three inference servers, three different bets

The choice between Triton, vLLM, and TGI maps to model family and operational complexity, not feature parity.

serve·May 13, 2026

Why CPU-based HPA is wrong for LLM serving

CPU utilization metrics fail to capture load during GPU-bound LLM inference; custom metrics on token throughput or request count are required for accurate scaling.

serve·May 13, 2026

What it takes to stream LLM responses through Kubernetes ingress

Streaming LLM tokens through Kubernetes ingress fails silently unless the ingress controller disables response buffering and extends read timeouts.

serve·May 13, 2026

Sizing KV-cache memory for LLM inference

KV-cache memory allocation is a deterministic arithmetic problem defined by model architecture and batch size, not a heuristic guess.