Why GPU nodes need taints, even on a single-tenant cluster

A GPU node without a taint allows the kube-scheduler to place CPU-only workloads on hardware reserved for inference and training.

Kubernetes treats all nodes as a pool of generic resources unless constraints are applied. The scheduler matches pod requests against node capacity. If a node reports nvidia.com/gpu: 10 available, the scheduler considers it a candidate for any pod that fits, regardless of whether the pod actually needs a GPU. This default behavior assumes resources are fungible. In an AI platform, GPU capacity is not fungible with CPU capacity. A node with 8 A100s is not equivalent to a node with 64 vCPUs.

Taints and tolerations are the mechanism that breaks this fungibility. A taint marks a node as repelling specific pods. A toleration marks a pod as allowed to ignore that repulsion. Without a taint on GPU nodes, the scheduler will bin-pack CPU workloads onto GPU hardware simply because the node has available CPU and memory. This wastes the GPU capacity and fragments the cluster, making it harder to schedule large distributed training jobs later.

The scheduler logic for taints

The kube-scheduler evaluates taints during the filtering phase. Before scoring a node, the scheduler checks if the pod has a toleration that matches the node’s taint. If the pod does not tolerate the taint, the node is removed from the candidate list immediately. This check happens before resource fit calculations.

The default scheduler configuration does not apply taints automatically. The kube-scheduler does not inspect the nvidia.com/gpu resource count to decide whether to taint. It relies on the node state provided by the kubelet. The kubelet reports resources based on the nvidia-device-plugin DaemonSet. The plugin registers the GPU resource but does not taint the node.

This separation creates a gap between resource availability and scheduling policy. The nvidia-device-plugin ensures the resource exists. The nvidia-gpu-operator or cluster administrator must ensure the policy exists. The nvidia-gpu-operator ClusterPolicy CRD can configure node taints, but the underlying mechanism remains the standard Kubernetes taint/toleration API.

The following command shows how to manually apply a taint that repels all pods except those with a specific toleration:

kubectl taint nodes gpu-node-01 nvidia.com/gpu=:NoSchedule

This command adds a taint with key nvidia.com/gpu, value :, and effect NoSchedule. The colon indicates an empty value, meaning the taint matches any pod without a specific value toleration. The effect NoSchedule prevents the scheduler from placing new pods on the node. Existing pods remain running unless they are restarted.

A pod that needs to run on this node must include a matching toleration in its spec:

apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
  - name: pytorch
    image: pytorch/pytorch:2.1
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

The operator: Exists allows the pod to tolerate any value for the nvidia.com/gpu key. This is standard for GPU workloads. Without this block, the pod will remain in Pending state on a tainted node.

Distinguishing labels from taints

A common failure mode involves confusing node labels with node taints. The NodeFeatureDiscovery (NFD) controller, often managed by the nvidia-gpu-operator, scans hardware and applies labels like nvidia.com/gpu.present=true. Labels are for selection, not restriction.

Labels allow pods to select nodes using nodeSelector or nodeAffinity. Taints force nodes to reject pods. The NodeFeatureRule CRD defines how labels are applied based on hardware detection. It does not define taints. A NodeFeatureRule can select nodes with specific labels, but it cannot directly apply taints to those nodes.

This distinction matters when debugging scheduling issues. If a GPU node has the label nvidia.com/gpu.present=true but no taint, CPU pods will land on it. If the node has the taint nvidia.com/gpu=:NoSchedule but the pod lacks the toleration, the GPU pods will not land on it.

The nvidia-gpu-operator ClusterPolicy CRD manages the lifecycle of the device plugin and the operator components. It includes a devicePlugin section where administrators can configure taint behavior. However, the operator does not enable tainting by default in all versions. The operator ensures the nvidia-device-plugin runs and registers resources. The operator does not inherently enforce scheduling policy without explicit configuration.

This means the cluster state must be verified independently of the operator’s health. The operator may report Ready while the nodes remain untainted. The kube-scheduler relies on the node object in the API server, not the operator’s internal state.

Failure modes and symptoms

The primary failure mode is resource waste. CPU-only pods land on GPU nodes because the node has available CPU and memory. The GPU remains idle. This is silent waste. There is no error event. The pod runs successfully. The metric DCGM_FI_DEV_GPU_UTIL remains near zero.

This waste compounds quickly. If a cluster has 10 GPU nodes and each accepts 10 CPU pods, 100 CPU pods occupy 10 GPU nodes. When a training job requests 8 GPUs, the scheduler finds no nodes with 8 free GPUs. It reports Insufficient nvidia.com/gpu. The training job stays pending. The operator logs show no errors. The autoscaler sees no pressure because CPU utilization is low.

The second failure mode is over-tainting. If every GPU node is tainted with nvidia.com/gpu=:NoSchedule but the nvidia-gpu-operator does not add the corresponding toleration to its own pods, the operator components may fail to schedule. The device-plugin DaemonSet pods require a toleration to run on tainted nodes. If the DaemonSet spec lacks the toleration, the pods stay Pending. The node reports 0 GPUs because the plugin is not running.

Check the kube-scheduler events for FailedScheduling to diagnose the first case. Check the DaemonSet status for Pending pods to diagnose the second.

kubectl get events --field-selector reason=FailedScheduling --sort-by='.lastTimestamp'

This command shows pods that could not be placed. Look for node(s) had taint {nvidia.com/gpu: }, that the pod did not tolerate. This confirms the taint is active and the pod is missing the toleration.

The third failure mode is the NoExecute effect. A taint with effect NoExecute evicts existing pods that do not tolerate it. This is useful for draining nodes but dangerous if applied to GPU nodes without verifying pod tolerations. If a GPU node is tainted with NoExecute, running CPU pods on that node will be terminated immediately. This causes unexpected downtime for workloads that were previously allowed.

Decision frame

The question the next time a GPU pod stays Pending is not ‘is the autoscaler broken.’ It is ‘did the node receive a taint that the pod does not tolerate.’ The nvidia-gpu-operator ClusterPolicy CRD manages the plugin, but it does not guarantee node taints are active. Verify the node object directly with kubectl describe node <gpu-node>. Look for the Taints section. If it is empty, the scheduler will place CPU pods there. If it contains nvidia.com/gpu=:NoSchedule, ensure the training job pod spec includes the matching toleration. Consistency between the node state and the pod spec is the only guarantee against silent resource waste.