operate W

Observability, security, GitOps, cost — the layer that keeps the other three honest.

operate·May 13, 2026

Why the NVIDIA gpu-operator upgrade window is non-trivial

GPU Operator upgrades force a kernel module reload sequence that blocks pod scheduling until the node re-uncordons and the device plugin restarts.

operate·May 13, 2026

What Velero backs up on an AI cluster

Velero captures Kubernetes API objects and optionally persistent volume data, but leaves stateless runtime artifacts like model weights outside its scope.

operate·May 13, 2026

ArgoCD and Flux reconciliation cost for AI clusters

ArgoCD and Flux provide feature parity for GitOps, but the operational cost diverges in the reconciliation loop frequency and controller CPU consumption.

operate·May 13, 2026

Why GPU workloads need a custom Pod Security Admission baseline

Default PSA restricted profiles reject privileged pods, but the NVIDIA device plugin requires privileged access to initialize GPU drivers.

operate·May 13, 2026

Network policy isolation for multi-tenant AI workloads

NetworkPolicy enforces tenant isolation but requires a default-deny policy and CNI enforcement to block cross-namespace traffic by default.

operate·May 13, 2026

Which DCGM metrics actually matter for GPU monitoring

DCGM exports over 150 metrics but only eight carry operational signal; the rest are noise that creates alert fatigue.

operate·May 13, 2026

How to detect GPU-specific failures via Kubernetes events

When a GPU pod fails, the root cause signal lives in three distinct streams: scheduler allocation events, kubelet lifecycle events, and DCGM health metrics.