all charts

30 mechanism articles across 4 surfaces.

serve·May 13, 2026

Model weights belong in object storage, not container images

An init-container downloads model weights to a shared emptyDir volume before the inference container starts, trading image size for cold-start latency.

train·May 13, 2026

Checkpoint storage patterns for distributed training

Checkpointing writes model state to disk, but the storage tier determines whether a node failure costs minutes or days of training time.

cluster·May 13, 2026

How priority and preemption interact with GPU pods

PriorityClass values determine which pods survive resource contention; preemption can evict long-running training jobs for short inference requests.

operate·May 13, 2026

Why the NVIDIA gpu-operator upgrade window is non-trivial

GPU Operator upgrades force a kernel module reload sequence that blocks pod scheduling until the node re-uncordons and the device plugin restarts.

operate·May 13, 2026

What Velero backs up on an AI cluster

Velero captures Kubernetes API objects and optionally persistent volume data, but leaves stateless runtime artifacts like model weights outside its scope.

operate·May 13, 2026

ArgoCD and Flux reconciliation cost for AI clusters

ArgoCD and Flux provide feature parity for GitOps, but the operational cost diverges in the reconciliation loop frequency and controller CPU consumption.

operate·May 13, 2026

Why GPU workloads need a custom Pod Security Admission baseline

Default PSA restricted profiles reject privileged pods, but the NVIDIA device plugin requires privileged access to initialize GPU drivers.

operate·May 13, 2026

Network policy isolation for multi-tenant AI workloads

NetworkPolicy enforces tenant isolation but requires a default-deny policy and CNI enforcement to block cross-namespace traffic by default.

operate·May 13, 2026

Which DCGM metrics actually matter for GPU monitoring

DCGM exports over 150 metrics but only eight carry operational signal; the rest are noise that creates alert fatigue.

train·May 13, 2026

Why BF16 replaced FP16 for distributed training

BF16 matches FP32's exponent range to prevent gradient overflow during training, while FP16 remains viable for inference where values stay bounded.

operate·May 13, 2026

How to detect GPU-specific failures via Kubernetes events

When a GPU pod fails, the root cause signal lives in three distinct streams: scheduler allocation events, kubelet lifecycle events, and DCGM health metrics.

train·May 13, 2026

What gradient accumulation does to training throughput

Gradient accumulation sums gradients over N micro-batches before an optimizer step, reducing all-reduce frequency without lowering activation memory.

train·May 13, 2026

DeepSpeed ZeRO stages partition training states across GPUs

ZeRO-1, 2, and 3 trade memory savings for communication overhead by partitioning optimizer states, gradients, and parameters respectively.

train·May 13, 2026

What PyTorch Elastic actually recovers from

TorchElastic re-rendezvouses survivors when a worker fails, restarting from the last checkpoint. It does not recover from etcd outage or checkpoint corruption.

train·May 13, 2026

Streaming training data from object storage without network saturation

MosaicML's streaming library and WebDataset shard datasets into tar files, allowing PyTorch DataLoaders to fetch and cache samples on-demand.

train·May 13, 2026·stable since pytorch 2.0

DDP vs FSDP — when to switch and what it costs

DDP replicates the full model on each GPU; FSDP shards parameters across GPUs. Switching costs ~15-25% throughput but enables models that exceed single-GPU memory.

train·May 13, 2026

How the Kubeflow Training Operator's PyTorchJob actually launches a job

The PyTorchJob CRD creates a Headless Service and injects environment variables into pods; distributed training depends on DNS resolution of that Service name.

serve·May 13, 2026

What KServe adds over a plain Kubernetes Deployment

KServe introduces an InferenceService CRD that manages model loading and traffic routing, while a Deployment manages only container replicas.

serve·May 13, 2026

When to pick Triton, vLLM, or TGI — three inference servers, three different bets

The choice between Triton, vLLM, and TGI maps to model family and operational complexity, not feature parity.

serve·May 13, 2026

Why CPU-based HPA is wrong for LLM serving

CPU utilization metrics fail to capture load during GPU-bound LLM inference; custom metrics on token throughput or request count are required for accurate scaling.

serve·May 13, 2026

What it takes to stream LLM responses through Kubernetes ingress

Streaming LLM tokens through Kubernetes ingress fails silently unless the ingress controller disables response buffering and extends read timeouts.

cluster·May 13, 2026

Why GPU nodes need taints, even on a single-tenant cluster

The Kubernetes scheduler places CPU-only pods on GPU nodes unless taints block them, wasting expensive hardware capacity.

cluster·May 13, 2026

How node-feature-discovery actually labels GPU nodes

Node Feature Discovery labels nodes with the exact GPU model string returned by the driver, and pod selectors must match that string exactly to schedule workloads.

serve·May 13, 2026

Sizing KV-cache memory for LLM inference

KV-cache memory allocation is a deterministic arithmetic problem defined by model architecture and batch size, not a heuristic guess.

cluster·May 13, 2026

Why topology-aware placement matters for NCCL, and how to express it

Distributed training all-reduce throughput depends on whether pods land in the same rack, switch, or NVLink domain. PodAffinity and TopologySpreadConstraints control this, but the topology key must match node labels.

cluster·May 13, 2026

Eviction signals interrupt training checkpoints

Node-pressure eviction sends SIGTERM to training pods, interrupting checkpoint writes before the process receives SIGKILL.

cluster·May 13, 2026

Why GPU resource quotas behave differently than CPU quotas

ResourceQuota enforces GPU limits independently of scheduler requests, causing OutOfQuota events when LimitRange injects default values.

cluster·May 13, 2026

What gang scheduling actually guarantees with Volcano

Volcano's PodGroup CRD gates pod placement until a minimum number of replicas fit, preventing distributed training jobs from deadlocking on partial allocation.

cluster·May 13, 2026

MIG, MPS, and time-slicing — three ways to share a GPU, only one of them is isolation

NVIDIA's GPU sharing modes expose a critical distinction: only Multi-Instance GPU provides hardware memory isolation for Kubernetes workloads.

cluster·May 13, 2026

How the default scheduler scores nodes for GPU pods

The default scheduler assigns GPU pods using a weighted scoring system where NodeResourcesFit and InterPodAffinity outweigh custom pod annotations.