Why GPU resource quotas behave differently than CPU quotas

A ResourceQuota configured for nvidia.com/gpu enforces limits independently of the scheduler’s placement logic, creating a state where pods are rejected for quota exhaustion even when cluster nodes have capacity.

The Kubernetes API server validates resource consumption at two distinct layers: admission control and scheduling. The admission layer checks ResourceQuota and LimitRange objects to ensure a namespace stays within defined bounds. The scheduling layer, managed by the kube-scheduler, checks node capacity to place pods. For CPU and memory, these layers often align because requests and limits are frequently set to the same value. For GPUs, the separation is sharper. Extended resources like nvidia.com/gpu are treated as countable units, but the enforcement of limits in a quota does not guarantee that the scheduler sees those limits as binding constraints.

This divergence creates a specific failure mode in multi-tenant environments. A namespace may exhaust its GPU quota based on limits defined by a LimitRange, while the kube-scheduler still attempts to place pods based on requests. The result is a rejection at the API server level that looks like a capacity issue but is actually a configuration mismatch. Understanding the order of operations between LimitRange, ResourceQuota, and the scheduler is required to debug why GPU pods remain pending despite available nodes.

The admission control sequence

The lifecycle of a GPU pod begins with the API server, not the scheduler. When a user submits a Pod manifest, the API server runs admission plugins in a specific order. First, LimitRange admission checks if default values need to be injected. If the Pod does not specify a limits.nvidia.com/gpu value, and a LimitRange exists in the namespace, the API server mutates the Pod to add that default.

Next, the ResourceQuota admission plugin validates the Pod against the namespace’s hard limits. It sums the requests and limits of all running and pending pods in the namespace. If the new Pod’s values exceed the ResourceQuota hard limits, the API server rejects the request immediately. The scheduler never sees this Pod.

This sequence means quota exhaustion is determined by the Pod spec as it enters the API, not by the physical node state. For CPU, requests and limits are often identical. For GPUs, operators often configure LimitRange to inject a default limits value to prevent runaway processes, even if the requests remain low. This injection consumes the ResourceQuota budget before the scheduler even evaluates node capacity.

The following table illustrates how ResourceQuota tracks CPU versus GPU resources differently when requests and limits diverge.

Resource Type	Quota Tracks	Scheduler Enforces	Typical LimitRange Behavior
`cpu`	`requests` + `limits`	`requests`	Often omitted or matches request
`memory`	`requests` + `limits`	`requests`	Often omitted or matches request
`nvidia.com/gpu`	`requests` + `limits`	`requests`	Often injects default `limits`

The critical distinction is in the LimitRange behavior. For GPUs, injecting a default limits value is common practice to enforce isolation. However, if the ResourceQuota enforces limits.nvidia.com/gpu, that injected default consumes the quota budget. If the quota is set to 4 GPUs, and a LimitRange injects a limit of 1 GPU per pod, four pods will consume the entire quota budget, even if the scheduler only sees requests of 1 GPU for each.

The quota enforcement mechanism

A ResourceQuota object defines the hard limits for a namespace. It supports both requests and limits for extended resources. When a Pod is created, the API server calculates the total usage. The calculation includes the sum of requests for all pods and the sum of limits for all pods. These are tracked independently in the ResourceQuota status.

If a namespace defines a ResourceQuota with limits.nvidia.com/gpu: 4, it reserves 4 units of the limit budget. When a Pod arrives without a limits field, the LimitRange admission plugin fills in the missing field. If the LimitRange specifies defaultLimit: 1 for nvidia.com/gpu, the Pod is mutated to limits: 1. The ResourceQuota then deducts 1 from the limits budget.

This mechanism works correctly in isolation, but it fails when the operator expects the requests to drive the quota consumption. The kube-scheduler only considers requests when scoring nodes. A node with 8 GPUs can schedule 8 pods with requests: 1, even if the ResourceQuota has already exhausted its limits budget for 4 pods.

The API server returns an error when the quota is breached. The error message is specific: exceeded quota: <name>, requested: nvidia.com/gpu=1, used: <used>, limited: <limit>. This message appears in the Pod events. It does not mention node capacity. It only mentions the namespace budget.

Operators often misdiagnose this as a cluster capacity issue. They check kubectl describe node and see available GPUs. They see the Pod status as Pending. They assume the scheduler cannot find a fit. The reality is the Pod never reached the scheduler. The API server rejected it during admission control because the ResourceQuota limits budget was full, regardless of node availability.

Failure modes and event diagnostics

The most common symptom is a Pod stuck in Pending with an OutOfQuota event. The kubectl describe pod output shows an event from the resourcequota controller. The message explicitly states which resource exceeded the limit. If the ResourceQuota tracks limits.nvidia.com/gpu, and the LimitRange injected a limit, the event will cite the limit exhaustion.

A second failure mode occurs when the ResourceQuota defines requests but not limits. If a LimitRange injects a limits value, the Pod consumes the requests quota (if requests are not set, they may default to limits) but also consumes the limits quota if the ResourceQuota is configured to track it. This creates a double-counting scenario where the effective quota is lower than intended.

Consider a namespace with a ResourceQuota of requests.nvidia.com/gpu: 4 and limits.nvidia.com/gpu: 4. If a Pod is submitted with requests: 1 but no limits, and the LimitRange injects limits: 1, the Pod consumes 1 from requests and 1 from limits. This is correct. However, if the ResourceQuota is configured to track limits but the LimitRange injects limits: 2 for safety, the Pod consumes 2 from the limits budget. Two such pods exhaust the quota, even though the scheduler could place four.

The kube-scheduler logs will not show these failures. The scheduler only sees pods that passed admission control. To debug, the operator must inspect the ResourceQuota status and the LimitRange configuration. The command kubectl get resourcequota -n <namespace> -o yaml reveals the used and hard values. The command kubectl get limitrange -n <namespace> -o yaml reveals the injected defaults.

The discrepancy is often invisible in the Pod spec after creation. The kubectl get pod output shows the final mutated spec, including the injected limits. If the operator does not compare the used quota against the hard limit, they miss the fact that the limits budget is the bottleneck, not the requests budget.

Decision frame

The next time a GPU pod stays Pending with an OutOfQuota event, the investigation must start with the ResourceQuota status, not the node capacity. The question is not whether the cluster has available GPUs, but whether the namespace’s limits budget has been consumed by injected defaults. If the ResourceQuota tracks limits.nvidia.com/gpu and the LimitRange injects a default limit, the quota budget fills up faster than the scheduler’s placement logic anticipates. Align the ResourceQuota hard limits with the LimitRange defaults to prevent the admission layer from blocking valid workloads.