operate W

Observability, security, GitOps, cost — the layer that keeps the other three honest.

operate·
Why the NVIDIA gpu-operator upgrade window is non-trivial
GPU Operator upgrades force a kernel module reload sequence that blocks pod scheduling until the node re-uncordons and the device plugin restarts.
operate·
What Velero backs up on an AI cluster
Velero captures Kubernetes API objects and optionally persistent volume data, but leaves stateless runtime artifacts like model weights outside its scope.
operate·
ArgoCD and Flux reconciliation cost for AI clusters
ArgoCD and Flux provide feature parity for GitOps, but the operational cost diverges in the reconciliation loop frequency and controller CPU consumption.
operate·
Why GPU workloads need a custom Pod Security Admission baseline
Default PSA restricted profiles reject privileged pods, but the NVIDIA device plugin requires privileged access to initialize GPU drivers.
operate·
Network policy isolation for multi-tenant AI workloads
NetworkPolicy enforces tenant isolation but requires a default-deny policy and CNI enforcement to block cross-namespace traffic by default.
operate·
Which DCGM metrics actually matter for GPU monitoring
DCGM exports over 150 metrics but only eight carry operational signal; the rest are noise that creates alert fatigue.
operate·
How to detect GPU-specific failures via Kubernetes events
When a GPU pod fails, the root cause signal lives in three distinct streams: scheduler allocation events, kubelet lifecycle events, and DCGM health metrics.