How the default scheduler scores nodes for GPU pods

The default scheduler assigns GPU pods using a weighted scoring system where NodeResourcesFit and InterPodAffinity outweigh custom pod annotations.

The Kubernetes scheduler binds Pod objects to Node objects through a two-phase process: filtering and scoring. Filtering eliminates nodes that cannot run the pod based on hard constraints like taints or resource capacity. Scoring ranks the remaining nodes to determine the best placement. For GPU workloads, the nvidia.com/gpu resource request triggers specific logic in the NodeResourcesFit plugin, which calculates availability against status.allocatable rather than status.capacity.

This mechanism determines bin-packing efficiency. A pod requesting 2 GPUs will score higher on a node with 4 GPUs available than on a node with 8 GPUs available under the LeastAllocated strategy. The scheduler does not read pod annotations to make this decision. Annotations are metadata, not resource constraints, unless explicitly mapped by an admission controller. The scheduler configuration file, KubeSchedulerConfiguration, defines the weights for each scoring plugin, making the configuration the source of truth for placement logic.

Understanding the scoring pipeline allows operators to debug why a GPU pod stays Pending or why it lands on a specific node. The scheduler exposes its decisions through Kubernetes events and the kubectl describe output. The kube-scheduler component runs as a control plane process, separate from the data plane nodes. It watches for new Pods and binds them to Nodes within seconds, provided the cluster state is consistent.

The filtering and scoring pipeline

The scheduler first runs the Filter phase. This phase rejects nodes that do not meet the pod’s requirements. For GPU pods, the NodeResourcesFit plugin checks if the node has enough nvidia.com/gpu resources. The check compares the pod’s resources.requests against the node’s status.allocatable. If the node has 4 GPUs and 2 are already allocated, a pod requesting 3 GPUs fails the filter.

Once the filter phase passes, the scheduler enters the Score phase. This phase assigns a numerical score between 0 and 100 to each candidate node. The default profile includes multiple scoring plugins. NodeResourcesFit evaluates resource utilization. InterPodAffinity evaluates pod proximity to other pods. NodeAffinity evaluates node labels. TaintToleration is handled during filtering.

The NodeResourcesFit plugin uses a strategy to calculate the score. The RequestedToCapacityRatio strategy is common. It calculates the ratio of requested resources to total capacity. A node with lower utilization scores higher for packing efficiency. The formula for the score is:

score = (1 - (requested / capacity)) * 100

If a node has 8 GPUs and 6 are allocated, a pod requesting 2 GPUs results in a 100% utilization request. The score drops. If a node has 8 GPUs and 2 are allocated, the request results in 25% utilization. The score is higher. The scheduler sums the scores from all enabled plugins. The node with the highest total score wins.

Configuration and inspection

The weights for each scoring plugin are defined in the KubeSchedulerConfiguration API. This configuration is not stored in the Pod spec. It is stored in the scheduler’s config map or passed as a command-line flag to the kube-scheduler process. Changing pod annotations does not change these weights. To modify scoring behavior, an operator must update the scheduler configuration and restart the scheduler.

The following YAML shows a KubeSchedulerConfiguration snippet where the NodeResourcesFit plugin is explicitly enabled with a weight.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 100
      - name: InterPodAffinity
        weight: 1

In this configuration, NodeResourcesFit dominates the decision. The InterPodAffinity plugin has a weight of 1, meaning it has minimal impact compared to resource fit. This explains why GPU pods often bin-pack tightly on a single node rather than spreading across the cluster. The scheduler prioritizes filling one node’s GPU capacity before moving to the next.

Operators can inspect placement decisions using kubectl. The describe command shows the events generated during scheduling. A Pending pod will show FailedScheduling events if no nodes pass the filter phase. A Scheduled pod shows the Scheduled event with the node name.

kubectl describe pod gpu-inference-0

The output includes an Events section at the bottom. Look for the Reason field. FailedScheduling indicates the filter phase failed. Scheduled indicates the score phase succeeded. The Message field often contains the specific reason, such as 0/3 nodes are available: 3 Insufficient nvidia.com/gpu.

Failure modes and symptoms

GPU pods frequently stay in the Pending state due to resource exhaustion. The NodeResourcesFit plugin rejects nodes that do not have enough nvidia.com/gpu capacity. The scheduler does not evict existing pods to make room. It waits for the cluster-autoscaler to provision new nodes. If the autoscaler is not configured with a NodePool that supports GPUs, the pod remains Pending indefinitely.

Another common failure mode is the Insufficient error. This occurs when the node reports fewer allocatable GPUs than expected. The status.allocatable field is calculated by the kubelet subtracting system-reserved resources from status.capacity. If the system reserves 1 GPU for monitoring agents, a node with 8 physical GPUs reports 7 allocatable GPUs. A pod requesting 8 GPUs will fail to schedule.

Annotations do not resolve these issues. Adding scheduler.alpha.kubernetes.io/critical-pod or custom labels to the pod does not change the NodeResourcesFit scoring. The scheduler ignores pod annotations unless a custom plugin reads them. The KubeSchedulerConfiguration controls the logic. If the NodeResourcesFit strategy is set to MostAllocated, the scheduler will spread pods across nodes. If set to LeastAllocated, it will pack them. The default is often LeastAllocated to maximize resource utilization.

Misconfigured PriorityClass values can also cause issues. High-priority pods can preempt low-priority pods to free up resources. If a GPU pod has a low priority and sits on a node with a high-priority pod, the scheduler will not move the GPU pod. It will wait for the high-priority pod to finish or for a new node to become available. The PriorityClass resource defines the preemption policy, not the pod spec itself.

Decision frame

The question the next time a GPU pod stays Pending is not ‘is the scheduler broken.’ It is ‘did the pod request a GPU type the node’s status.allocatable actually supports.’ The NodeResourcesFit plugin checks exact resource counts, not model types. If the pod requests nvidia.com/gpu: 1 and the node has 1 GPU available, it schedules. If the pod requests nvidia.com/gpu.product: NVIDIA-A100 via a custom admission controller, the default scheduler ignores it. Read the KubeSchedulerConfiguration weights, not the pod annotations. The configuration dictates the bin-packing strategy; the pod spec dictates the resource demand.