Which DCGM metrics actually matter for GPU monitoring

DCGM exports over 150 metrics from the NVIDIA driver stack, but only eight carry the operational signal required to manage a Kubernetes cluster running AI workloads.

The NVIDIA Data Center GPU Manager (DCGM) Exporter runs as a DaemonSet on every node, scraping the NVIDIA Management Library (NVML) to expose GPU state as Prometheus metrics. This mechanism is the standard for GPU observability in production Kubernetes environments. It bridges the gap between the hardware driver and the control plane, allowing the platform to make decisions based on physical resource state rather than pod requests alone.

Most operators configure the exporter to scrape every available field. This approach creates noise. The DCGM library exposes counters for ECC errors, clock throttling, power limits, and temperature across every GPU instance. Without filtering, a single node generates hundreds of time-series data points. The system becomes difficult to query, and alerting rules become brittle. The operational goal is not to see every number, but to see the numbers that indicate a failure or a bottleneck before the workload stalls.

The Core Four Metrics

Four metrics form the baseline for GPU health. These values indicate whether the hardware is available, utilized, and within safe operating parameters. The first is DCGM_FI_DEV_GPU_UTIL. This metric reports the percentage of time over the last sample period during which at least one compute kernel was active. It is not a measure of memory bandwidth or IO wait. A high value indicates the GPU is busy processing instructions. A low value on a running workload suggests the process is IO-bound or waiting on CPU synchronization.

The second metric is DCGM_FI_DEV_FB_USED. This tracks the amount of frame buffer memory currently allocated on the GPU. Frame buffer memory holds model weights, activations, and KV caches. When this value approaches the total capacity reported by DCGM_FI_DEV_FB_TOTAL, the system risks an OOMKilled state for the container. The exporter reports this in bytes. Operators must convert this to gigabytes for human-readable dashboards.

The third metric is DCGM_FI_DEV_POWER_USAGE. This measures the power draw of the GPU in watts. Power draw correlates with thermal output and cooling requirements. Sustained high power draw without corresponding utilization suggests a configuration issue, such as a runaway process or inefficient kernel. The fourth metric is DCGM_FI_DEV_GPU_TEMP. This reports the core temperature in degrees Celsius. NVIDIA GPUs throttle performance when temperatures exceed specific thresholds, usually around 80°C to 85°C depending on the architecture.

The following table defines the specific metric names, units, and the operational signal each provides.

Metric Name	Unit	Operational Signal
`DCGM_FI_DEV_GPU_UTIL`	Percent	Compute activity; low values indicate IO or CPU bottlenecks.
`DCGM_FI_DEV_FB_USED`	Bytes	Memory pressure; high values indicate risk of OOMKilled.
`DCGM_FI_DEV_POWER_USAGE`	Watts	Thermal load; sustained peaks indicate cooling stress.
`DCGM_FI_DEV_GPU_TEMP`	Celsius	Thermal throttling risk; spikes indicate cooling failure.

Error and Throttling Signals

Two additional metrics carry critical failure information. The first is DCGM_FI_DEV_XID_ERRORS. This counter increments when the driver encounters a hardware or firmware error. The value is cumulative. Operators must track the rate of change rather than the absolute value. A sudden jump in this counter indicates a hardware fault that requires immediate pod evacuation. The second metric is DCGM_FI_DEV_CLOCKS_THROTTLE_REASONS. This bitmask reports why the GPU clock speed has been reduced. Common reasons include power limits, thermal limits, or software settings.

These two metrics require different query patterns than the core four. DCGM_FI_DEV_XID_ERRORS is a cumulative counter. A Prometheus query must use the rate() function to detect changes over time. A query returning zero indicates a healthy node. A query returning a non-zero value indicates an error event occurred in the last window. DCGM_FI_DEV_CLOCKS_THROTTLE_REASONS is a gauge. A non-zero value indicates the GPU is not running at its advertised frequency. This often happens when the cooling system cannot dissipate heat fast enough.

The dcgm-exporter configuration file determines which fields are exposed. By default, the exporter exposes a standard set. Operators can modify the dcgm-field-config ConfigMap to add or remove fields. Removing unused fields reduces the cardinality of the time-series database. This improves query performance and reduces storage costs. The configuration is managed as a Kubernetes ConfigMap mounted into the exporter DaemonSet.

Prometheus Configuration and Queries

The Prometheus scrape configuration must target the dcgm-exporter service. The job definition specifies the port and the metric path. The exporter listens on port 9400 by default. The scrape config should include a relabel rule to attach the node name and GPU ID to every metric. This allows operators to filter queries by specific hardware.

scrape_configs:
  - job_name: 'dcgm-exporter'
    static_configs:
      - targets: ['localhost:9400']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [node_name]
        target_label: node

A typical query for memory pressure aggregates the used memory across all GPUs on a node. This helps identify if a single GPU is full or if the node is saturated.

avg by (node) (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100)

This query calculates the percentage of frame buffer used, grouped by node. It returns a single value per node representing the average utilization across all GPUs. If a node has 4 GPUs, and one is full while the others are empty, the average will show 25%. This hides the saturation. A better query groups by GPU index to see individual card health.

DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

This query returns a time series for every GPU on every node. It allows the dashboard to highlight specific cards that are near capacity. The alerting rule should trigger when this value exceeds 90% for more than 5 minutes. This duration prevents false positives during model loading phases where memory usage spikes temporarily.

Failure Modes and Noise

Alert fatigue is the primary failure mode when monitoring GPU metrics. If the system alerts on every metric change, the operations team ignores the alerts. The dcgm-exporter exposes metrics for every clock domain, every fan speed, and every power rail. Most of these are not actionable. A fan speed change is usually automatic and self-correcting. Alerting on it creates noise.

Another failure mode is misinterpreting node readiness. The Kubernetes node status does not include a GPU not ready condition. The standard conditions are Ready, MemoryPressure, DiskPressure, and PIDPressure. GPU readiness is determined by the status.capacity.nvidia.com/gpu field. If this field is zero, the node cannot schedule GPU pods. If the DCGM Exporter fails to scrape the GPU, the capacity remains zero, but the node status remains Ready. The pod will stay Pending. The operator must check the exporter logs, not the node conditions, to diagnose this.

A third failure mode is confusing utilization with occupancy. DCGM_FI_DEV_GPU_UTIL measures time during which at least one kernel was active. DCGM_FI_DEV_SM_OCCUPANCY measures the percentage of active streaming multiprocessors. A workload can have high utilization but low occupancy if it is memory-bound. The GPU is waiting for data, not computing. Relying solely on utilization metrics can lead to incorrect scaling decisions. If utilization is low but latency is high, the bottleneck is memory bandwidth, not compute power. Scaling up the cluster will not fix the latency.

Decision frame

The next time a GPU pod stays Pending or a node shows high latency, the question is not “is the GPU broken.” It is “which metric tells me the GPU is waiting.” Check DCGM_FI_DEV_GPU_UTIL first. If it is low, check DCGM_FI_DEV_FB_USED for memory saturation. If memory is free, check DCGM_FI_DEV_CLOCKS_THROTTLE_REASONS for thermal limits. The signal is in the rate of change, not the absolute value. A sudden spike in DCGM_FI_DEV_XID_ERRORS requires a node drain. A gradual rise in temperature requires a cooling audit. The metric names are fixed, but the operational response depends on the trend.