all charts

30 mechanism articles across 4 surfaces.

serve·
Model weights belong in object storage, not container images
An init-container downloads model weights to a shared emptyDir volume before the inference container starts, trading image size for cold-start latency.
train·
Checkpoint storage patterns for distributed training
Checkpointing writes model state to disk, but the storage tier determines whether a node failure costs minutes or days of training time.
cluster·
How priority and preemption interact with GPU pods
PriorityClass values determine which pods survive resource contention; preemption can evict long-running training jobs for short inference requests.
operate·
Why the NVIDIA gpu-operator upgrade window is non-trivial
GPU Operator upgrades force a kernel module reload sequence that blocks pod scheduling until the node re-uncordons and the device plugin restarts.
operate·
What Velero backs up on an AI cluster
Velero captures Kubernetes API objects and optionally persistent volume data, but leaves stateless runtime artifacts like model weights outside its scope.
operate·
ArgoCD and Flux reconciliation cost for AI clusters
ArgoCD and Flux provide feature parity for GitOps, but the operational cost diverges in the reconciliation loop frequency and controller CPU consumption.
operate·
Why GPU workloads need a custom Pod Security Admission baseline
Default PSA restricted profiles reject privileged pods, but the NVIDIA device plugin requires privileged access to initialize GPU drivers.
operate·
Network policy isolation for multi-tenant AI workloads
NetworkPolicy enforces tenant isolation but requires a default-deny policy and CNI enforcement to block cross-namespace traffic by default.
operate·
Which DCGM metrics actually matter for GPU monitoring
DCGM exports over 150 metrics but only eight carry operational signal; the rest are noise that creates alert fatigue.
train·
Why BF16 replaced FP16 for distributed training
BF16 matches FP32's exponent range to prevent gradient overflow during training, while FP16 remains viable for inference where values stay bounded.
operate·
How to detect GPU-specific failures via Kubernetes events
When a GPU pod fails, the root cause signal lives in three distinct streams: scheduler allocation events, kubelet lifecycle events, and DCGM health metrics.
train·
What gradient accumulation does to training throughput
Gradient accumulation sums gradients over N micro-batches before an optimizer step, reducing all-reduce frequency without lowering activation memory.
train·
DeepSpeed ZeRO stages partition training states across GPUs
ZeRO-1, 2, and 3 trade memory savings for communication overhead by partitioning optimizer states, gradients, and parameters respectively.
train·
What PyTorch Elastic actually recovers from
TorchElastic re-rendezvouses survivors when a worker fails, restarting from the last checkpoint. It does not recover from etcd outage or checkpoint corruption.
train·
Streaming training data from object storage without network saturation
MosaicML's streaming library and WebDataset shard datasets into tar files, allowing PyTorch DataLoaders to fetch and cache samples on-demand.
train··stable since pytorch 2.0
DDP vs FSDP — when to switch and what it costs
DDP replicates the full model on each GPU; FSDP shards parameters across GPUs. Switching costs ~15-25% throughput but enables models that exceed single-GPU memory.
train·
How the Kubeflow Training Operator's PyTorchJob actually launches a job
The PyTorchJob CRD creates a Headless Service and injects environment variables into pods; distributed training depends on DNS resolution of that Service name.
serve·
What KServe adds over a plain Kubernetes Deployment
KServe introduces an InferenceService CRD that manages model loading and traffic routing, while a Deployment manages only container replicas.
serve·
When to pick Triton, vLLM, or TGI — three inference servers, three different bets
The choice between Triton, vLLM, and TGI maps to model family and operational complexity, not feature parity.
serve·
Why CPU-based HPA is wrong for LLM serving
CPU utilization metrics fail to capture load during GPU-bound LLM inference; custom metrics on token throughput or request count are required for accurate scaling.
serve·
What it takes to stream LLM responses through Kubernetes ingress
Streaming LLM tokens through Kubernetes ingress fails silently unless the ingress controller disables response buffering and extends read timeouts.
cluster·
Why GPU nodes need taints, even on a single-tenant cluster
The Kubernetes scheduler places CPU-only pods on GPU nodes unless taints block them, wasting expensive hardware capacity.
cluster·
How node-feature-discovery actually labels GPU nodes
Node Feature Discovery labels nodes with the exact GPU model string returned by the driver, and pod selectors must match that string exactly to schedule workloads.
serve·
Sizing KV-cache memory for LLM inference
KV-cache memory allocation is a deterministic arithmetic problem defined by model architecture and batch size, not a heuristic guess.
cluster·
Why topology-aware placement matters for NCCL, and how to express it
Distributed training all-reduce throughput depends on whether pods land in the same rack, switch, or NVLink domain. PodAffinity and TopologySpreadConstraints control this, but the topology key must match node labels.
cluster·
Eviction signals interrupt training checkpoints
Node-pressure eviction sends SIGTERM to training pods, interrupting checkpoint writes before the process receives SIGKILL.
cluster·
Why GPU resource quotas behave differently than CPU quotas
ResourceQuota enforces GPU limits independently of scheduler requests, causing OutOfQuota events when LimitRange injects default values.
cluster·
What gang scheduling actually guarantees with Volcano
Volcano's PodGroup CRD gates pod placement until a minimum number of replicas fit, preventing distributed training jobs from deadlocking on partial allocation.
cluster·
MIG, MPS, and time-slicing — three ways to share a GPU, only one of them is isolation
NVIDIA's GPU sharing modes expose a critical distinction: only Multi-Instance GPU provides hardware memory isolation for Kubernetes workloads.
cluster·
How the default scheduler scores nodes for GPU pods
The default scheduler assigns GPU pods using a weighted scoring system where NodeResourcesFit and InterPodAffinity outweigh custom pod annotations.