Why the NVIDIA gpu-operator upgrade window is non-trivial

The NVIDIA GPU Operator upgrade window is non-trivial because it forces a kernel module reload sequence that blocks pod scheduling until the node re-uncordons and the device plugin restarts.

The GPU Operator manages the NVIDIA driver stack, container runtime integration, and device plugins as a unified set of Kubernetes resources. It treats the NVIDIA driver not as a host-level package but as a workload component deployed via DaemonSets. This abstraction simplifies installation but couples the upgrade path to the Kubernetes node lifecycle. When the operator version changes, the underlying driver container images and kernel modules must be updated on every node.

This process requires the nvidia-driver-daemonset to stop the existing driver, load new kernel modules, and restart the container runtime. During this window, the node cannot schedule GPU workloads. The kubelet marks the node as unschedulable, and the device plugin stops advertising resources. On a cluster running distributed training jobs, this forces pod eviction or suspension. On inference clusters, it forces scaling out to other nodes to maintain latency SLAs. The upgrade is not a background process; it is a maintenance window that consumes capacity.

The upgrade sequence

The upgrade process follows a strict dependency chain across the control plane and the data plane. The ClusterPolicy CRD defines the desired version of the driver and components. The operator reconciles this state by updating the associated DaemonSets. The actual upgrade happens on each node individually, following the DaemonSet’s updateStrategy.

The sequence begins with the operator marking the node as unschedulable. The kube-controller-manager cordon the node to prevent new pods from being scheduled. The kubelet then drains existing GPU pods. If the pods are not terminated gracefully, the nvidia-device-plugin may remain active until the container runtime is restarted.

Once the node is drained, the driver DaemonSet terminates the old driver container. The container unloads the kernel modules. The new driver container starts and loads the new kernel modules. This step requires access to the host kernel namespace. If the kernel modules fail to load, the container exits, and the node remains in a degraded state.

After the driver container is running, the nvidia-device-plugin DaemonSet restarts. This pod registers the GPU resources with the kubelet. The kubelet updates the node status to report available GPUs. Finally, the operator uncordon the node, allowing the scheduler to place new pods.

The following table outlines the steps, the responsible component, and the impact on workloads.

Step	Component	Action	Workload Impact
1	`ClusterPolicy`	Updates desired version	None
2	`kube-controller-manager`	Cordon node (`SchedulingDisabled`)	No new pods
3	`kubelet`	Drain existing pods	Pods terminated or evicted
4	`nvidia-driver-daemonset`	Unload/Load kernel modules	GPU resources unavailable
5	`containerd`	Restart runtime	All containers stop
6	`nvidia-device-plugin`	Restart and register resources	Resources available
7	`kube-controller-manager`	Uncordon node	Scheduling resumes

The updateStrategy field on the DaemonSet controls the concurrency of this process. A RollingUpdate with maxUnavailable: 1 processes one node at a time. A maxUnavailable: 20% processes a percentage of the cluster simultaneously. The choice here determines the total upgrade duration versus the capacity loss during the window.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-driver-daemonset
  namespace: nvidia-driver
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

Driver mismatch and resource allocation

The most common failure mode occurs when the driver version on the node does not match the container runtime expectations. The NVIDIA driver container relies on specific kernel modules being present. If the host kernel is updated independently of the GPU Operator, the driver container may fail to start.

When the driver container fails, the nvidia-device-plugin cannot start. The kubelet does not report nvidia.com/gpu capacity on the node. Pods requesting GPUs enter a Pending state with events citing Insufficient nvidia.com/gpu. The scheduler sees the node as having zero GPU resources, even if physical GPUs are present.

This mismatch is particularly acute when using Dynamic Resource Allocation (DRA). DRA relies on the device plugin to claim resources. If the device plugin is down, the resource claim fails. The pod remains in ContainerCreating or Pending indefinitely. The kubectl describe pod output shows a FailedMount or ResourceClaim error.

Another failure mode involves the container runtime restart. The GPU Operator restarts containerd during the driver update. If the runtime restarts while a training job is active, the job is killed. The kubelet does not wait for the container to finish; it stops the container to allow the runtime to restart. This behavior is standard for runtime upgrades but conflicts with long-running training workloads that expect checkpointing to handle interruptions.

The kubelet logs on the node will show RuntimeHandlerNotFound or FailedCreatePodSandBox errors during this window. These errors persist until the device plugin successfully registers the resources. The operator does not automatically retry the driver load if the kernel modules are incompatible. Manual intervention is required to roll back the driver version or fix the kernel modules.

Decision frame

The question the next time a GPU cluster requires an upgrade is not “can we upgrade the operator.” It is “does the maxUnavailable setting in the DaemonSet updateStrategy match the cluster’s capacity buffer.” A maxUnavailable of 1 ensures safety but extends the upgrade window to the number of nodes times the reload time. A maxUnavailable of 20% speeds the upgrade but risks losing 20% of GPU capacity simultaneously. If the cluster runs 100% utilization training jobs, the upgrade window must be scheduled during a maintenance period where capacity loss is acceptable, or the maxUnavailable must be reduced to zero to prevent workload eviction. The tradeoff is upgrade speed versus guaranteed workload availability, and the updateStrategy field is the only lever to control it.