serve E

Inference servers, KV-cache, batching, streaming, model routing — the layer that answers requests.

serve·
Model weights belong in object storage, not container images
An init-container downloads model weights to a shared emptyDir volume before the inference container starts, trading image size for cold-start latency.
serve·
What KServe adds over a plain Kubernetes Deployment
KServe introduces an InferenceService CRD that manages model loading and traffic routing, while a Deployment manages only container replicas.
serve·
When to pick Triton, vLLM, or TGI — three inference servers, three different bets
The choice between Triton, vLLM, and TGI maps to model family and operational complexity, not feature parity.
serve·
Why CPU-based HPA is wrong for LLM serving
CPU utilization metrics fail to capture load during GPU-bound LLM inference; custom metrics on token throughput or request count are required for accurate scaling.
serve·
What it takes to stream LLM responses through Kubernetes ingress
Streaming LLM tokens through Kubernetes ingress fails silently unless the ingress controller disables response buffering and extends read timeouts.
serve·
Sizing KV-cache memory for LLM inference
KV-cache memory allocation is a deterministic arithmetic problem defined by model architecture and batch size, not a heuristic guess.