<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>kubernai</title><description>Mechanism reference for running AI workloads on Kubernetes. One article, one specific mechanism, with the real APIs and the real failure modes.</description><link>https://kubernai.vercel.app/</link><language>en-us</language><item><title>Model weights belong in object storage, not container images</title><link>https://kubernai.vercel.app/articles/model-registry-sidecar-pattern/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/model-registry-sidecar-pattern/</guid><description>An init-container downloads model weights to a shared emptyDir volume before the inference container starts, trading image size for cold-start latency.</description><pubDate>Wed, 13 May 2026 20:31:03 GMT</pubDate><category>serve</category><category>model-registry</category><category>init-container</category><category>huggingface</category><category>s3</category><category>emptydir</category></item><item><title>Checkpoint storage patterns for distributed training</title><link>https://kubernai.vercel.app/articles/checkpoint-storage-patterns/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/checkpoint-storage-patterns/</guid><description>Checkpointing writes model state to disk, but the storage tier determines whether a node failure costs minutes or days of training time.</description><pubDate>Wed, 13 May 2026 20:30:24 GMT</pubDate><category>train</category><category>checkpoints</category><category>pvc</category><category>object-storage</category><category>torch-save</category></item><item><title>How priority and preemption interact with GPU pods</title><link>https://kubernai.vercel.app/articles/priority-preemption-for-ai-pods/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/priority-preemption-for-ai-pods/</guid><description>PriorityClass values determine which pods survive resource contention; preemption can evict long-running training jobs for short inference requests.</description><pubDate>Wed, 13 May 2026 20:28:23 GMT</pubDate><category>cluster</category><category>priorityclass</category><category>preemption</category><category>gpu</category><category>interruption</category></item><item><title>Why the NVIDIA gpu-operator upgrade window is non-trivial</title><link>https://kubernai.vercel.app/articles/gpu-operator-upgrade-mechanics/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/gpu-operator-upgrade-mechanics/</guid><description>GPU Operator upgrades force a kernel module reload sequence that blocks pod scheduling until the node re-uncordons and the device plugin restarts.</description><pubDate>Wed, 13 May 2026 17:52:19 GMT</pubDate><category>operate</category><category>gpu-operator</category><category>upgrades</category><category>node-drain</category><category>drivers</category></item><item><title>What Velero backs up on an AI cluster</title><link>https://kubernai.vercel.app/articles/velero-backup-for-ai-clusters/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/velero-backup-for-ai-clusters/</guid><description>Velero captures Kubernetes API objects and optionally persistent volume data, but leaves stateless runtime artifacts like model weights outside its scope.</description><pubDate>Wed, 13 May 2026 17:52:15 GMT</pubDate><category>operate</category><category>velero</category><category>backup</category><category>dr</category><category>csi-snapshots</category></item><item><title>ArgoCD and Flux reconciliation cost for AI clusters</title><link>https://kubernai.vercel.app/articles/argocd-vs-flux-for-ai-clusters/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/argocd-vs-flux-for-ai-clusters/</guid><description>ArgoCD and Flux provide feature parity for GitOps, but the operational cost diverges in the reconciliation loop frequency and controller CPU consumption.</description><pubDate>Wed, 13 May 2026 17:51:11 GMT</pubDate><category>operate</category><category>argocd</category><category>flux</category><category>gitops</category><category>reconciliation</category></item><item><title>Why GPU workloads need a custom Pod Security Admission baseline</title><link>https://kubernai.vercel.app/articles/pod-security-admission-for-gpu-workloads/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/pod-security-admission-for-gpu-workloads/</guid><description>Default PSA restricted profiles reject privileged pods, but the NVIDIA device plugin requires privileged access to initialize GPU drivers.</description><pubDate>Wed, 13 May 2026 17:51:10 GMT</pubDate><category>operate</category><category>psa</category><category>pod-security</category><category>privileged</category><category>gpu</category></item><item><title>Network policy isolation for multi-tenant AI workloads</title><link>https://kubernai.vercel.app/articles/network-policy-for-multi-tenant-ai/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/network-policy-for-multi-tenant-ai/</guid><description>NetworkPolicy enforces tenant isolation but requires a default-deny policy and CNI enforcement to block cross-namespace traffic by default.</description><pubDate>Wed, 13 May 2026 17:50:01 GMT</pubDate><category>operate</category><category>network-policy</category><category>multi-tenant</category><category>cilium</category><category>security</category></item><item><title>Which DCGM metrics actually matter for GPU monitoring</title><link>https://kubernai.vercel.app/articles/dcgm-exporter-metrics/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/dcgm-exporter-metrics/</guid><description>DCGM exports over 150 metrics but only eight carry operational signal; the rest are noise that creates alert fatigue.</description><pubDate>Wed, 13 May 2026 17:49:47 GMT</pubDate><category>operate</category><category>dcgm</category><category>gpu-metrics</category><category>prometheus</category><category>observability</category></item><item><title>Why BF16 replaced FP16 for distributed training</title><link>https://kubernai.vercel.app/articles/mixed-precision-bf16-vs-fp16/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/mixed-precision-bf16-vs-fp16/</guid><description>BF16 matches FP32&apos;s exponent range to prevent gradient overflow during training, while FP16 remains viable for inference where values stay bounded.</description><pubDate>Wed, 13 May 2026 17:49:13 GMT</pubDate><category>train</category><category>mixed-precision</category><category>bf16</category><category>fp16</category><category>training</category><category>deepseek</category></item><item><title>How to detect GPU-specific failures via Kubernetes events</title><link>https://kubernai.vercel.app/articles/kubernetes-events-for-gpu-failures/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/kubernetes-events-for-gpu-failures/</guid><description>When a GPU pod fails, the root cause signal lives in three distinct streams: scheduler allocation events, kubelet lifecycle events, and DCGM health metrics.</description><pubDate>Wed, 13 May 2026 17:49:04 GMT</pubDate><category>operate</category><category>events</category><category>gpu</category><category>kubelet</category><category>monitoring</category></item><item><title>What gradient accumulation does to training throughput</title><link>https://kubernai.vercel.app/articles/gradient-accumulation-mechanics/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/gradient-accumulation-mechanics/</guid><description>Gradient accumulation sums gradients over N micro-batches before an optimizer step, reducing all-reduce frequency without lowering activation memory.</description><pubDate>Wed, 13 May 2026 17:47:53 GMT</pubDate><category>train</category><category>gradient-accumulation</category><category>batch-size</category><category>training</category><category>throughput</category></item><item><title>DeepSpeed ZeRO stages partition training states across GPUs</title><link>https://kubernai.vercel.app/articles/deepspeed-zero-stages/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/deepspeed-zero-stages/</guid><description>ZeRO-1, 2, and 3 trade memory savings for communication overhead by partitioning optimizer states, gradients, and parameters respectively.</description><pubDate>Wed, 13 May 2026 17:47:24 GMT</pubDate><category>train</category><category>deepspeed</category><category>zero</category><category>memory-optimization</category><category>training</category></item><item><title>What PyTorch Elastic actually recovers from</title><link>https://kubernai.vercel.app/articles/torch-elastic-fault-tolerance/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/torch-elastic-fault-tolerance/</guid><description>TorchElastic re-rendezvouses survivors when a worker fails, restarting from the last checkpoint. It does not recover from etcd outage or checkpoint corruption.</description><pubDate>Wed, 13 May 2026 17:47:11 GMT</pubDate><category>train</category><category>torch-elastic</category><category>torchrun</category><category>fault-tolerance</category><category>rendezvous</category></item><item><title>Streaming training data from object storage without network saturation</title><link>https://kubernai.vercel.app/articles/dataset-streaming-from-object-storage/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/dataset-streaming-from-object-storage/</guid><description>MosaicML&apos;s streaming library and WebDataset shard datasets into tar files, allowing PyTorch DataLoaders to fetch and cache samples on-demand.</description><pubDate>Wed, 13 May 2026 17:47:03 GMT</pubDate><category>train</category><category>dataset</category><category>streaming</category><category>webdataset</category><category>mosaicml-streaming</category></item><item><title>DDP vs FSDP — when to switch and what it costs</title><link>https://kubernai.vercel.app/articles/pytorch-ddp-vs-fsdp/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/pytorch-ddp-vs-fsdp/</guid><description>DDP replicates the full model on each GPU; FSDP shards parameters across GPUs. Switching costs ~15-25% throughput but enables models that exceed single-GPU memory.</description><pubDate>Wed, 13 May 2026 17:45:04 GMT</pubDate><category>train</category><category>pytorch</category><category>ddp</category><category>fsdp</category><category>distributed-training</category><category>memory</category></item><item><title>How the Kubeflow Training Operator&apos;s PyTorchJob actually launches a job</title><link>https://kubernai.vercel.app/articles/training-operator-pytorchjob/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/training-operator-pytorchjob/</guid><description>The PyTorchJob CRD creates a Headless Service and injects environment variables into pods; distributed training depends on DNS resolution of that Service name.</description><pubDate>Wed, 13 May 2026 17:44:06 GMT</pubDate><category>train</category><category>training-operator</category><category>pytorchjob</category><category>kubeflow</category><category>distributed-training</category></item><item><title>What KServe adds over a plain Kubernetes Deployment</title><link>https://kubernai.vercel.app/articles/kserve-vs-roll-your-own/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/kserve-vs-roll-your-own/</guid><description>KServe introduces an InferenceService CRD that manages model loading and traffic routing, while a Deployment manages only container replicas.</description><pubDate>Wed, 13 May 2026 17:39:27 GMT</pubDate><category>serve</category><category>kserve</category><category>knative</category><category>serverless</category><category>inference</category><category>deployment</category></item><item><title>When to pick Triton, vLLM, or TGI — three inference servers, three different bets</title><link>https://kubernai.vercel.app/articles/triton-vs-vllm-vs-tgi/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/triton-vs-vllm-vs-tgi/</guid><description>The choice between Triton, vLLM, and TGI maps to model family and operational complexity, not feature parity.</description><pubDate>Wed, 13 May 2026 17:38:21 GMT</pubDate><category>serve</category><category>triton</category><category>vllm</category><category>tgi</category><category>inference-server</category><category>kubernetes</category></item><item><title>Why CPU-based HPA is wrong for LLM serving</title><link>https://kubernai.vercel.app/articles/hpa-for-token-throughput/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/hpa-for-token-throughput/</guid><description>CPU utilization metrics fail to capture load during GPU-bound LLM inference; custom metrics on token throughput or request count are required for accurate scaling.</description><pubDate>Wed, 13 May 2026 17:38:11 GMT</pubDate><category>serve</category><category>hpa</category><category>kpa</category><category>metrics</category><category>autoscaling</category></item><item><title>What it takes to stream LLM responses through Kubernetes ingress</title><link>https://kubernai.vercel.app/articles/streaming-ingress-for-llm-responses/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/streaming-ingress-for-llm-responses/</guid><description>Streaming LLM tokens through Kubernetes ingress fails silently unless the ingress controller disables response buffering and extends read timeouts.</description><pubDate>Wed, 13 May 2026 17:37:42 GMT</pubDate><category>serve</category><category>ingress</category><category>streaming</category><category>sse</category><category>envoy</category></item><item><title>Why GPU nodes need taints, even on a single-tenant cluster</title><link>https://kubernai.vercel.app/articles/taints-and-tolerations-for-gpu-nodes/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/taints-and-tolerations-for-gpu-nodes/</guid><description>The Kubernetes scheduler places CPU-only pods on GPU nodes unless taints block them, wasting expensive hardware capacity.</description><pubDate>Wed, 13 May 2026 17:36:30 GMT</pubDate><category>cluster</category><category>taints</category><category>tolerations</category><category>gpu</category><category>scheduling</category></item><item><title>How node-feature-discovery actually labels GPU nodes</title><link>https://kubernai.vercel.app/articles/gpu-feature-discovery-and-labels/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/gpu-feature-discovery-and-labels/</guid><description>Node Feature Discovery labels nodes with the exact GPU model string returned by the driver, and pod selectors must match that string exactly to schedule workloads.</description><pubDate>Wed, 13 May 2026 17:34:45 GMT</pubDate><category>cluster</category><category>nfd</category><category>gpu-feature-discovery</category><category>nodelabels</category><category>scheduling</category></item><item><title>Sizing KV-cache memory for LLM inference</title><link>https://kubernai.vercel.app/articles/kv-cache-sizing-for-llm-serving/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/kv-cache-sizing-for-llm-serving/</guid><description>KV-cache memory allocation is a deterministic arithmetic problem defined by model architecture and batch size, not a heuristic guess.</description><pubDate>Wed, 13 May 2026 17:34:25 GMT</pubDate><category>serve</category><category>kv-cache</category><category>memory-budget</category><category>vllm</category><category>context-window</category></item><item><title>Why topology-aware placement matters for NCCL, and how to express it</title><link>https://kubernai.vercel.app/articles/pod-affinity-for-distributed-training/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/pod-affinity-for-distributed-training/</guid><description>Distributed training all-reduce throughput depends on whether pods land in the same rack, switch, or NVLink domain. PodAffinity and TopologySpreadConstraints control this, but the topology key must match node labels.</description><pubDate>Wed, 13 May 2026 17:33:31 GMT</pubDate><category>cluster</category><category>pod-affinity</category><category>topology-spread</category><category>nccl</category><category>training</category></item><item><title>Eviction signals interrupt training checkpoints</title><link>https://kubernai.vercel.app/articles/eviction-and-gpu-workloads/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/eviction-and-gpu-workloads/</guid><description>Node-pressure eviction sends SIGTERM to training pods, interrupting checkpoint writes before the process receives SIGKILL.</description><pubDate>Wed, 13 May 2026 17:31:35 GMT</pubDate><category>cluster</category><category>eviction</category><category>node-pressure</category><category>checkpoints</category><category>training</category></item><item><title>Why GPU resource quotas behave differently than CPU quotas</title><link>https://kubernai.vercel.app/articles/resource-quotas-for-gpu-namespaces/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/resource-quotas-for-gpu-namespaces/</guid><description>ResourceQuota enforces GPU limits independently of scheduler requests, causing OutOfQuota events when LimitRange injects default values.</description><pubDate>Wed, 13 May 2026 17:30:38 GMT</pubDate><category>cluster</category><category>resourcequota</category><category>gpu</category><category>namespaces</category><category>multi-tenant</category></item><item><title>What gang scheduling actually guarantees with Volcano</title><link>https://kubernai.vercel.app/articles/gang-scheduling-with-volcano/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/gang-scheduling-with-volcano/</guid><description>Volcano&apos;s PodGroup CRD gates pod placement until a minimum number of replicas fit, preventing distributed training jobs from deadlocking on partial allocation.</description><pubDate>Wed, 13 May 2026 17:29:36 GMT</pubDate><category>cluster</category><category>volcano</category><category>gang-scheduling</category><category>podgroup</category><category>training</category></item><item><title>MIG, MPS, and time-slicing — three ways to share a GPU, only one of them is isolation</title><link>https://kubernai.vercel.app/articles/mig-vs-mps-vs-time-slicing/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/mig-vs-mps-vs-time-slicing/</guid><description>NVIDIA&apos;s GPU sharing modes expose a critical distinction: only Multi-Instance GPU provides hardware memory isolation for Kubernetes workloads.</description><pubDate>Wed, 13 May 2026 17:29:36 GMT</pubDate><category>cluster</category><category>mig</category><category>mps</category><category>time-slicing</category><category>gpu-partitioning</category></item><item><title>How the default scheduler scores nodes for GPU pods</title><link>https://kubernai.vercel.app/articles/kube-scheduler-scoring-for-gpu-pods/</link><guid isPermaLink="true">https://kubernai.vercel.app/articles/kube-scheduler-scoring-for-gpu-pods/</guid><description>The default scheduler assigns GPU pods using a weighted scoring system where NodeResourcesFit and InterPodAffinity outweigh custom pod annotations.</description><pubDate>Wed, 13 May 2026 17:28:24 GMT</pubDate><category>cluster</category><category>scheduler</category><category>gpu</category><category>scoring</category><category>kube-scheduler-config</category></item></channel></rss>