How to detect GPU-specific failures via Kubernetes events

When a GPU pod fails, the root cause signal lives in three distinct streams: scheduler allocation events, kubelet lifecycle events, and DCGM health metrics.

Standard Kubernetes troubleshooting assumes a uniform failure model where a container exits with a non-zero code. GPU workloads violate this assumption. A pod may remain Running while the underlying GPU enters a degraded state, or a pod may stay Pending because the scheduler cannot see available resources due to device-plugin registration lag. The kube-scheduler, kubelet, and nvidia-device-plugin each log specific signals that require different query patterns.

Relying solely on kubectl describe pod obscures the hardware context. The scheduler reports resource exhaustion, the kubelet reports container health, and the device plugin reports hardware topology. Distinguishing between these layers is the only way to route alerts correctly and avoid noise.

The scheduler allocation layer

The first failure point occurs before a pod reaches a node. The kube-scheduler evaluates resource requests against node capacity. For GPU workloads, this involves the nvidia.com/gpu resource, which is extended by the nvidia-device-plugin. When the scheduler cannot find a node with sufficient free GPUs, it emits a FailedScheduling event.

This event contains a specific reason string that indicates the nature of the shortage. A generic Insufficient error might mask a topology mismatch. For example, a pod requesting 2 GPUs on a node with 4 GPUs available might still fail if the nvidia-device-plugin cannot partition the resources correctly or if the node label nvidia.com/gpu.product does not match the pod’s nodeSelector.

Operators can query these events directly using field selectors to isolate GPU-specific scheduling failures. The following command filters for events related to Pods where the reason is FailedScheduling and the message contains the GPU resource name.

kubectl get events --field-selector involvedObject.kind=Pod,reason=FailedScheduling \
  --field-selector type=Warning \
  -o jsonpath='{range .items[*]}{.reason}: {.message}{"\n"}{end}'

The output typically displays messages like 0/3 nodes are available: 3 Insufficient nvidia.com/gpu. This confirms the issue is capacity, not image pull or startup logic. If the message cites node(s) didn't match Pod topology spread constraints, the failure is affinity-based, not capacity-based. Distinguishing these two requires inspecting the message field, not just the reason.

The kubelet lifecycle layer

Once a pod is scheduled, the kubelet takes ownership. GPU-specific failures at this stage often manifest as Unhealthy or Failed events. Unlike CPU containers, GPU containers depend on driver state and hardware health. If the NVIDIA driver crashes or the GPU enters a reset state, the container runtime may not detect the failure immediately.

To bridge this gap, platform engineers configure liveness probes that query the GPU health. The NVIDIA GPU Operator typically injects a probe that calls the nvidia-smi binary or checks a sidecar. When the probe fails, the kubelet records an Unhealthy event. This event is distinct from a container exit code because it triggers a restart without necessarily failing the pod phase immediately.

The kubectl describe pod command shows these events in the Events section at the bottom. A typical entry looks like this:

Type     Reason     Age   From               Message
----     ------     ----  ----               -------
Warning  Unhealthy  2m    kubelet            Liveness probe failed: exit status 1

The exit status 1 in a GPU probe often corresponds to a driver error or a device reset. If the probe fails three times, the kubelet kills the container. The subsequent Failed event will cite Liveness probe failed. This chain of events—Unhealthy followed by Failed—is the primary indicator of runtime GPU degradation rather than application logic errors.

DCGM health signals

The DCGM Exporter (Data Center GPU Manager) provides the underlying telemetry for hardware health. While DCGM does not emit Kubernetes Event objects directly, its metrics trigger the probes that do. The critical metric for hardware failure is DCGM_FI_DEV_XID_ERR, which counts XID errors on the GPU.

When an XID error occurs, the GPU may become unusable. If the liveness probe is not configured to check XID counts, the pod may continue running on a degraded device. To surface this, operators often configure a Prometheus alert on DCGM_FI_DEV_XID_ERR that increments. This alert does not create a Kubernetes event but correlates with the Unhealthy event in the kubelet logs.

The following table maps the failure source to the observable signal and the required query tool.

Failure Source	Signal Type	Observable Field	Query Tool
Scheduler	Kubernetes Event	`reason: FailedScheduling`	`kubectl get events`
Device Plugin	Kubernetes Event	`message: Insufficient nvidia.com/gpu`	`kubectl get events`
Kubelet	Kubernetes Event	`reason: Unhealthy`	`kubectl describe pod`
GPU Hardware	Prometheus Metric	`DCGM_FI_DEV_XID_ERR`	`Prometheus`

This separation is critical for alert routing. A FailedScheduling event should trigger a capacity alert. An Unhealthy event should trigger a hardware maintenance alert. Merging these into a single “Pod Down” alert causes on-call fatigue because the remediation steps differ entirely.

Failure modes and signal loss

Events in Kubernetes are ephemeral. The kube-apiserver stores events in etcd with a default TTL of 1 hour. For long-running GPU jobs, critical events may expire before an operator investigates. If a pod fails and restarts 45 minutes later, the initial Unhealthy event may be gone from the kubectl get events output.

This expiration window creates a blind spot for intermittent hardware faults. A GPU that fails once every 12 hours will leave no trace in the event log if the investigation happens 2 hours after the restart. The DCGM Exporter metrics persist in the time-series database, making them more reliable for historical analysis than Kubernetes events.

Another failure mode involves the device plugin registration. If the nvidia-device-plugin crashes on a node, the kubelet marks the node NotReady. However, if the plugin restarts quickly, the node becomes Ready without emitting a clear failure event to the scheduler. Pods scheduled during the outage may remain Pending indefinitely if the scheduler cache is stale. The kubectl get nodes command shows the Status column, but the specific reason for the plugin crash is only visible in the kubelet logs or the nvidia-device-plugin container logs.

The most common failure is assuming events explain hardware degradation. A pod can run on a GPU with high XID error counts without triggering a Kubernetes event if the probe does not check for them. The event system reports container state, not silicon state.

Decision frame

The choice is not between using events or metrics, but between event retention and signal noise. Kubernetes events are immediate but ephemeral; DCGM metrics are persistent but require external querying. If the platform requires audit trails for hardware failures, increase the --event-ttl flag on the kube-apiserver to 24 hours. If the priority is low-noise alerting, route FailedScheduling events to capacity dashboards and Unhealthy events to hardware maintenance tickets. Relying on kubectl get events alone for GPU debugging is insufficient because the event TTL expires before the root cause is often identified.