Mechanism reference for Kubernetes at the helm of AI

One article, one mechanism, with the real APIs and the real failure modes. Forty charts across four surfaces.


latest charts all charts →
Model weights belong in object storage, not container images
serve · 
Checkpoint storage patterns for distributed training
train · 
How priority and preemption interact with GPU pods
cluster · 
Why the NVIDIA gpu-operator upgrade window is non-trivial
operate · 
What Velero backs up on an AI cluster
operate ·