kubernai

kubernaiMechanism reference for running AI workloads on Kubernetes. One article, one specific mechanism, with the real APIs and the real failure modes.https://kubernai.vercel.app/en-usModel weights belong in object storage, not container imageshttps://kubernai.vercel.app/articles/model-registry-sidecar-pattern/https://kubernai.vercel.app/articles/model-registry-sidecar-pattern/An init-container downloads model weights to a shared emptyDir volume before the inference container starts, trading image size for cold-start latency.Wed, 13 May 2026 20:31:03 GMTservemodel-registryinit-containerhuggingfaces3emptydirCheckpoint storage patterns for distributed traininghttps://kubernai.vercel.app/articles/checkpoint-storage-patterns/https://kubernai.vercel.app/articles/checkpoint-storage-patterns/Checkpointing writes model state to disk, but the storage tier determines whether a node failure costs minutes or days of training time.Wed, 13 May 2026 20:30:24 GMTtraincheckpointspvcobject-storagetorch-saveHow priority and preemption interact with GPU podshttps://kubernai.vercel.app/articles/priority-preemption-for-ai-pods/https://kubernai.vercel.app/articles/priority-preemption-for-ai-pods/PriorityClass values determine which pods survive resource contention; preemption can evict long-running training jobs for short inference requests.Wed, 13 May 2026 20:28:23 GMTclusterpriorityclasspreemptiongpuinterruptionWhy the NVIDIA gpu-operator upgrade window is non-trivialhttps://kubernai.vercel.app/articles/gpu-operator-upgrade-mechanics/https://kubernai.vercel.app/articles/gpu-operator-upgrade-mechanics/GPU Operator upgrades force a kernel module reload sequence that blocks pod scheduling until the node re-uncordons and the device plugin restarts.Wed, 13 May 2026 17:52:19 GMToperategpu-operatorupgradesnode-draindriversWhat Velero backs up on an AI clusterhttps://kubernai.vercel.app/articles/velero-backup-for-ai-clusters/https://kubernai.vercel.app/articles/velero-backup-for-ai-clusters/Velero captures Kubernetes API objects and optionally persistent volume data, but leaves stateless runtime artifacts like model weights outside its scope.Wed, 13 May 2026 17:52:15 GMToperatevelerobackupdrcsi-snapshotsArgoCD and Flux reconciliation cost for AI clustershttps://kubernai.vercel.app/articles/argocd-vs-flux-for-ai-clusters/https://kubernai.vercel.app/articles/argocd-vs-flux-for-ai-clusters/ArgoCD and Flux provide feature parity for GitOps, but the operational cost diverges in the reconciliation loop frequency and controller CPU consumption.Wed, 13 May 2026 17:51:11 GMToperateargocdfluxgitopsreconciliationWhy GPU workloads need a custom Pod Security Admission baselinehttps://kubernai.vercel.app/articles/pod-security-admission-for-gpu-workloads/https://kubernai.vercel.app/articles/pod-security-admission-for-gpu-workloads/Default PSA restricted profiles reject privileged pods, but the NVIDIA device plugin requires privileged access to initialize GPU drivers.Wed, 13 May 2026 17:51:10 GMToperatepsapod-securityprivilegedgpuNetwork policy isolation for multi-tenant AI workloadshttps://kubernai.vercel.app/articles/network-policy-for-multi-tenant-ai/https://kubernai.vercel.app/articles/network-policy-for-multi-tenant-ai/NetworkPolicy enforces tenant isolation but requires a default-deny policy and CNI enforcement to block cross-namespace traffic by default.Wed, 13 May 2026 17:50:01 GMToperatenetwork-policymulti-tenantciliumsecurityWhich DCGM metrics actually matter for GPU monitoringhttps://kubernai.vercel.app/articles/dcgm-exporter-metrics/https://kubernai.vercel.app/articles/dcgm-exporter-metrics/DCGM exports over 150 metrics but only eight carry operational signal; the rest are noise that creates alert fatigue.Wed, 13 May 2026 17:49:47 GMToperatedcgmgpu-metricsprometheusobservabilityWhy BF16 replaced FP16 for distributed traininghttps://kubernai.vercel.app/articles/mixed-precision-bf16-vs-fp16/https://kubernai.vercel.app/articles/mixed-precision-bf16-vs-fp16/BF16 matches FP32's exponent range to prevent gradient overflow during training, while FP16 remains viable for inference where values stay bounded.Wed, 13 May 2026 17:49:13 GMTtrainmixed-precisionbf16fp16trainingdeepseekHow to detect GPU-specific failures via Kubernetes eventshttps://kubernai.vercel.app/articles/kubernetes-events-for-gpu-failures/https://kubernai.vercel.app/articles/kubernetes-events-for-gpu-failures/When a GPU pod fails, the root cause signal lives in three distinct streams: scheduler allocation events, kubelet lifecycle events, and DCGM health metrics.Wed, 13 May 2026 17:49:04 GMToperateeventsgpukubeletmonitoringWhat gradient accumulation does to training throughputhttps://kubernai.vercel.app/articles/gradient-accumulation-mechanics/https://kubernai.vercel.app/articles/gradient-accumulation-mechanics/Gradient accumulation sums gradients over N micro-batches before an optimizer step, reducing all-reduce frequency without lowering activation memory.Wed, 13 May 2026 17:47:53 GMTtraingradient-accumulationbatch-sizetrainingthroughputDeepSpeed ZeRO stages partition training states across GPUshttps://kubernai.vercel.app/articles/deepspeed-zero-stages/https://kubernai.vercel.app/articles/deepspeed-zero-stages/ZeRO-1, 2, and 3 trade memory savings for communication overhead by partitioning optimizer states, gradients, and parameters respectively.Wed, 13 May 2026 17:47:24 GMTtraindeepspeedzeromemory-optimizationtrainingWhat PyTorch Elastic actually recovers fromhttps://kubernai.vercel.app/articles/torch-elastic-fault-tolerance/https://kubernai.vercel.app/articles/torch-elastic-fault-tolerance/TorchElastic re-rendezvouses survivors when a worker fails, restarting from the last checkpoint. It does not recover from etcd outage or checkpoint corruption.Wed, 13 May 2026 17:47:11 GMTtraintorch-elastictorchrunfault-tolerancerendezvousStreaming training data from object storage without network saturationhttps://kubernai.vercel.app/articles/dataset-streaming-from-object-storage/https://kubernai.vercel.app/articles/dataset-streaming-from-object-storage/MosaicML's streaming library and WebDataset shard datasets into tar files, allowing PyTorch DataLoaders to fetch and cache samples on-demand.Wed, 13 May 2026 17:47:03 GMTtraindatasetstreamingwebdatasetmosaicml-streamingDDP vs FSDP — when to switch and what it costshttps://kubernai.vercel.app/articles/pytorch-ddp-vs-fsdp/https://kubernai.vercel.app/articles/pytorch-ddp-vs-fsdp/DDP replicates the full model on each GPU; FSDP shards parameters across GPUs. Switching costs ~15-25% throughput but enables models that exceed single-GPU memory.Wed, 13 May 2026 17:45:04 GMTtrainpytorchddpfsdpdistributed-trainingmemoryHow the Kubeflow Training Operator's PyTorchJob actually launches a jobhttps://kubernai.vercel.app/articles/training-operator-pytorchjob/https://kubernai.vercel.app/articles/training-operator-pytorchjob/The PyTorchJob CRD creates a Headless Service and injects environment variables into pods; distributed training depends on DNS resolution of that Service name.Wed, 13 May 2026 17:44:06 GMTtraintraining-operatorpytorchjobkubeflowdistributed-trainingWhat KServe adds over a plain Kubernetes Deploymenthttps://kubernai.vercel.app/articles/kserve-vs-roll-your-own/https://kubernai.vercel.app/articles/kserve-vs-roll-your-own/KServe introduces an InferenceService CRD that manages model loading and traffic routing, while a Deployment manages only container replicas.Wed, 13 May 2026 17:39:27 GMTservekserveknativeserverlessinferencedeploymentWhen to pick Triton, vLLM, or TGI — three inference servers, three different betshttps://kubernai.vercel.app/articles/triton-vs-vllm-vs-tgi/https://kubernai.vercel.app/articles/triton-vs-vllm-vs-tgi/The choice between Triton, vLLM, and TGI maps to model family and operational complexity, not feature parity.Wed, 13 May 2026 17:38:21 GMTservetritonvllmtgiinference-serverkubernetesWhy CPU-based HPA is wrong for LLM servinghttps://kubernai.vercel.app/articles/hpa-for-token-throughput/https://kubernai.vercel.app/articles/hpa-for-token-throughput/CPU utilization metrics fail to capture load during GPU-bound LLM inference; custom metrics on token throughput or request count are required for accurate scaling.Wed, 13 May 2026 17:38:11 GMTservehpakpametricsautoscalingWhat it takes to stream LLM responses through Kubernetes ingresshttps://kubernai.vercel.app/articles/streaming-ingress-for-llm-responses/https://kubernai.vercel.app/articles/streaming-ingress-for-llm-responses/Streaming LLM tokens through Kubernetes ingress fails silently unless the ingress controller disables response buffering and extends read timeouts.Wed, 13 May 2026 17:37:42 GMTserveingressstreamingsseenvoyWhy GPU nodes need taints, even on a single-tenant clusterhttps://kubernai.vercel.app/articles/taints-and-tolerations-for-gpu-nodes/https://kubernai.vercel.app/articles/taints-and-tolerations-for-gpu-nodes/The Kubernetes scheduler places CPU-only pods on GPU nodes unless taints block them, wasting expensive hardware capacity.Wed, 13 May 2026 17:36:30 GMTclustertaintstolerationsgpuschedulingHow node-feature-discovery actually labels GPU nodeshttps://kubernai.vercel.app/articles/gpu-feature-discovery-and-labels/https://kubernai.vercel.app/articles/gpu-feature-discovery-and-labels/Node Feature Discovery labels nodes with the exact GPU model string returned by the driver, and pod selectors must match that string exactly to schedule workloads.Wed, 13 May 2026 17:34:45 GMTclusternfdgpu-feature-discoverynodelabelsschedulingSizing KV-cache memory for LLM inferencehttps://kubernai.vercel.app/articles/kv-cache-sizing-for-llm-serving/https://kubernai.vercel.app/articles/kv-cache-sizing-for-llm-serving/KV-cache memory allocation is a deterministic arithmetic problem defined by model architecture and batch size, not a heuristic guess.Wed, 13 May 2026 17:34:25 GMTservekv-cachememory-budgetvllmcontext-windowWhy topology-aware placement matters for NCCL, and how to express ithttps://kubernai.vercel.app/articles/pod-affinity-for-distributed-training/https://kubernai.vercel.app/articles/pod-affinity-for-distributed-training/Distributed training all-reduce throughput depends on whether pods land in the same rack, switch, or NVLink domain. PodAffinity and TopologySpreadConstraints control this, but the topology key must match node labels.Wed, 13 May 2026 17:33:31 GMTclusterpod-affinitytopology-spreadnccltrainingEviction signals interrupt training checkpointshttps://kubernai.vercel.app/articles/eviction-and-gpu-workloads/https://kubernai.vercel.app/articles/eviction-and-gpu-workloads/Node-pressure eviction sends SIGTERM to training pods, interrupting checkpoint writes before the process receives SIGKILL.Wed, 13 May 2026 17:31:35 GMTclusterevictionnode-pressurecheckpointstrainingWhy GPU resource quotas behave differently than CPU quotashttps://kubernai.vercel.app/articles/resource-quotas-for-gpu-namespaces/https://kubernai.vercel.app/articles/resource-quotas-for-gpu-namespaces/ResourceQuota enforces GPU limits independently of scheduler requests, causing OutOfQuota events when LimitRange injects default values.Wed, 13 May 2026 17:30:38 GMTclusterresourcequotagpunamespacesmulti-tenantWhat gang scheduling actually guarantees with Volcanohttps://kubernai.vercel.app/articles/gang-scheduling-with-volcano/https://kubernai.vercel.app/articles/gang-scheduling-with-volcano/Volcano's PodGroup CRD gates pod placement until a minimum number of replicas fit, preventing distributed training jobs from deadlocking on partial allocation.Wed, 13 May 2026 17:29:36 GMTclustervolcanogang-schedulingpodgrouptrainingMIG, MPS, and time-slicing — three ways to share a GPU, only one of them is isolationhttps://kubernai.vercel.app/articles/mig-vs-mps-vs-time-slicing/https://kubernai.vercel.app/articles/mig-vs-mps-vs-time-slicing/NVIDIA's GPU sharing modes expose a critical distinction: only Multi-Instance GPU provides hardware memory isolation for Kubernetes workloads.Wed, 13 May 2026 17:29:36 GMTclustermigmpstime-slicinggpu-partitioningHow the default scheduler scores nodes for GPU podshttps://kubernai.vercel.app/articles/kube-scheduler-scoring-for-gpu-pods/https://kubernai.vercel.app/articles/kube-scheduler-scoring-for-gpu-pods/The default scheduler assigns GPU pods using a weighted scoring system where NodeResourcesFit and InterPodAffinity outweigh custom pod annotations.Wed, 13 May 2026 17:28:24 GMTclusterschedulergpuscoringkube-scheduler-config