A higher-priority inference pod can terminate a running training pod if the scheduler finds no other node.
This dynamic exists because the Kubernetes scheduler treats resource requests as hard constraints. When a new pod requires resources that are unavailable on any node, the kube-scheduler checks if lower-priority pods can be evicted to make room. For AI workloads, this creates a critical risk where a 30-second inference request can interrupt a 6-hour distributed training job. The mechanism relies on integer values assigned to workloads, not on the workload type itself.
Platform engineers often assume that once a pod is running, it is safe. In a cluster with GPU partitioning and strict quotas, safety is conditional. The PriorityClass resource defines the relative importance of a workload. If a high-priority pod arrives and the cluster is full, the scheduler calculates the cost of evicting lower-priority pods. If the cost is acceptable, the eviction proceeds. This process is automatic and silent unless events are monitored.
What it is
PriorityClass is a cluster-scoped resource that assigns an integer value to a pod. Higher values indicate higher priority. The kube-scheduler uses this value during the scheduling cycle to rank candidate nodes and determine preemption candidates. Preemption is the act of terminating existing pods to free resources for a new pod that cannot otherwise be scheduled.
The mechanism operates at the control plane level. The kube-scheduler does not interact with the GPU drivers or the container runtime directly during preemption decisions. It interacts with the API server to update pod status and with the kubelet to signal deletion. The kubelet then terminates the container and releases the GPU resources. This separation means the eviction latency depends on the kubelet sync loop and the container runtime, not just the scheduler logic.
The mechanism
The interaction begins with the PriorityClass definition. A cluster administrator creates a PriorityClass with a specific value. The value field is an integer. Higher integers represent higher priority. The preemptionPolicy field determines how the scheduler handles lower-priority pods when scheduling a pod with this class.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-inference
value: 1000000
globalDefault: false
description: "For latency-sensitive inference workloads"
preemptionPolicy: PreemptLowerPriority
A training pod requests this class via priorityClassName. A lower-priority training pod uses a different class.
apiVersion: v1
kind: Pod
metadata:
name: training-job-1
spec:
priorityClassName: low-priority-training
containers:
- name: trainer
image: pytorch:2.1
resources:
limits:
nvidia.com/gpu: 4
When the kube-scheduler receives the high-priority inference pod, it attempts to find a node with 4 GPUs available. If no node has 4 GPUs, the scheduler scans for lower-priority pods on nodes that could fit the inference pod if those pods were removed. The scheduler calculates the resource delta. If the inference pod fits after eviction, the scheduler selects the victim pods.
The selection of victims follows specific rules. The scheduler prioritizes evicting pods with the lowest priority value. If multiple pods share the same priority, it considers resource consumption. The goal is to minimize the disruption while satisfying the new request. This calculation happens in the kube-scheduler before any deletion occurs.
The eviction sequence is observable via the API server.
kubectl get events --field-selector reason=Preempting --sort-by=.lastTimestamp
The output shows the Preempting reason, the target pod, and the victim pod. The API server updates the victim pod’s status to Terminating. The kubelet on the node receives the update and begins the container shutdown sequence. The GPU is released only after the container exits.
| Policy | Behavior | Risk to Running Pods |
|---|---|---|
PreemptLowerPriority | Evicts pods with lower priority values. | High if priority classes are misconfigured. |
Never | Does not evict any pods. | Low for running pods, high for scheduling failures. |
PreemptHigherPriority | Evicts pods with higher priority values. | Rare; usually a misconfiguration. |
The preemptionPolicy field on the PriorityClass controls this behavior. The default is PreemptLowerPriority. If set to Never, the pod will remain Pending if resources are unavailable, rather than evicting others. This tradeoff is static per PriorityClass and cannot be overridden at the pod level.
Failure modes
The most common failure mode is unintended eviction of long-running jobs. A training job running for 6 hours may be terminated by a spike in inference traffic. The symptom is a Pod status of Terminated with a reason of Preempting. The kubectl describe pod output shows an event stating The node was low on resource: nvidia.com/gpu.
This failure is silent in terms of application logs. The training process does not receive a signal to checkpoint. It simply stops. The loss of progress depends on the checkpointing interval. If the job checkpoints every 10 minutes, a 6-hour job loses 10 minutes of work. If it checkpoints every hour, it loses 50 minutes. The eviction time itself is often 30 to 60 seconds, depending on the container runtime.
Another failure mode is priority inversion. If all inference pods are assigned a lower priority than training pods, the inference pods may never schedule. The kube-scheduler will leave them in Pending indefinitely. This happens when the value field on the PriorityClass is configured incorrectly. The cluster appears healthy, but the service level agreement (SLA) for inference latency is violated.
Resource fragmentation is a third failure mode. If the scheduler evicts a 4-GPU pod to fit a 1-GPU pod, the remaining 3 GPUs on that node may be unusable for other workloads. This reduces the effective capacity of the cluster. The kube-scheduler does not optimize for fragmentation during preemption; it optimizes for scheduling the new pod. This can lead to a state where the cluster has free GPUs but no pod can be scheduled because they are not contiguous enough for the remaining requests.
Decision frame
The choice between PreemptLowerPriority and Never is not about availability; it is about the cost of interruption. If the training job checkpoints every 5 minutes, the cost of eviction is low. If the job does not checkpoint, the cost is the entire runtime. The question the next time a GPU pod stays Pending is not ‘is the autoscaler broken.’ It is ‘did the priority class for the inference pod allow it to evict the training pod.’ The preemptionPolicy on the PriorityClass is the control for this tradeoff. Set it to Never for critical training jobs and accept that high-priority inference will queue. Set it to PreemptLowerPriority for inference and accept that training jobs may be interrupted. The tradeoff is fixed: consistency costs time, and speed costs reliability.