The default Kubernetes scheduler places pods individually, which allows a distributed training job to start 6 of 8 replicas and leave the rest Pending indefinitely.

Distributed training workloads, such as PyTorch Distributed Data Parallel or MPI jobs, require all participating processes to start simultaneously to establish communication rings. If one worker fails to start, the entire job stalls. The native Kubernetes scheduler treats each Pod as an independent unit of work. It does not understand the relationship between replicas in a Job or StatefulSet. When resources are fragmented, the scheduler binds the first 6 pods it can fit, leaving the remaining 2 waiting for GPUs that are now occupied. This creates a deadlock where the running pods hold resources but cannot complete the task, and the pending pods cannot start because the resources are taken.

Volcano is a Kubernetes batch scheduler designed to handle high-throughput workloads. It introduces the PodGroup Custom Resource Definition to define scheduling units larger than a single Pod. A PodGroup aggregates a set of Pods that must be scheduled together. Volcano enforces a minimum threshold for the group before allowing any member to bind to a node. This mechanism shifts the scheduling logic from “fit any pod” to “fit the job.” The guarantee is binary: either all required pods start, or none of them do. This prevents the partial allocation state that causes distributed training jobs to hang.

The PodGroup CRD structure

The core mechanism in Volcano is the PodGroup resource. This object defines the minimum number of pods required for the group to be considered schedulable. The API group is scheduling.volcano.sh/v1beta1. The field spec.minMember specifies the minimum count of pods that must be placeable before the scheduler releases them from the queue.

When a user submits a distributed training Job, they must annotate the pods or the Job controller to associate them with a PodGroup. Volcano’s controller watches for new pods and automatically creates a PodGroup if one does not exist, or the user can define it explicitly. The scheduler then queries the PodGroup status. If the number of placeable pods is less than minMember, the scheduler returns a “Unschedulable” reason for all pods in the group.

The following YAML defines a PodGroup for an 8-replica training job. It specifies that at least 8 members must be found before any binding occurs.

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: training-job-pg
spec:
  minMember: 8
  minResources:
    cpu: "16"
    memory: 64Gi
    nvidia.com/gpu: 1

The minResources field aggregates the total resource request for the group. This allows the scheduler to check cluster capacity against the total job requirement rather than individual pod requirements. If the cluster has 10 GPUs but 2 are already allocated to other workloads, Volcano sees 8 available. If the job requires 8, it proceeds. If the job requires 10 and only 8 are available, the entire group waits. This check happens atomically during the scheduling cycle.

Resource locking without gang scheduling

Without Volcano, the native kube-scheduler processes pods from a queue based on priority and timestamp. It evaluates each pod against the current node state. If a node has 1 GPU available and a pod requests 1 GPU, the pod binds. It does not check if other pods in the same Job can also bind. This behavior is efficient for stateless services but catastrophic for stateful distributed systems.

Consider a cluster with 6 available GPUs. A user submits a Job with 8 replicas, each requesting 1 GPU. The native scheduler fills the 6 GPUs with 6 pods. These pods transition to Running. The remaining 2 pods enter the Pending state. They will remain Pending until a GPU frees up. However, the 6 running pods are waiting for the 2 pending pods to establish the distributed communication channel. The job is stuck in a “Running” state that cannot complete.

The symptoms of this failure mode are visible in the pod status. Running kubectl get pods shows a mix of Running and Pending statuses for the same Job. The events for the pending pods show FailedScheduling with reasons like Insufficient nvidia.com/gpu. The running pods show no errors, but the application logs indicate a timeout on the initialization barrier. This state persists until an operator manually deletes the running pods to free resources.

StateNative SchedulerVolcano Scheduler
6 of 8 pods fit6 Running, 2 Pending0 Running, 8 Pending
Resource usage6 GPUs occupied0 GPUs occupied
Job completionDeadlockedQueued
Recovery actionManual interventionAutomatic when capacity frees

The table illustrates the operational difference. The native scheduler maximizes immediate resource utilization but sacrifices job liveness. Volcano maximizes job liveness but may leave resources idle while waiting for the full group. This is the fundamental tradeoff of gang scheduling.

The tradeoff in cluster utilization

The primary cost of gang scheduling is reduced cluster utilization during contention. When a large job waits for 8 GPUs, smaller jobs that could fit on 2 GPUs cannot run. This is known as the fragmentation problem. The cluster sits with idle capacity because the scheduler refuses to break the gang.

This behavior is intentional. It prevents the “head-of-line” blocking where a large job starts partially and blocks the queue. However, it requires operators to manage the queue carefully. If the cluster has 8 GPUs and a 10-GPU job is submitted, Volcano will hold the entire job. If a 2-GPU job is submitted simultaneously, it may also be held if the scheduler is configured to prioritize the gang.

Volcano provides a queue system to manage this. Multiple queues can be defined with different priorities. A default queue might allow smaller jobs to run while waiting for the gang. The minMember field is the lever that controls this. Setting minMember to a lower value allows partial starts but reintroduces the deadlock risk. Setting it to the full replica count enforces the all-or-nothing guarantee.

Operators must also consider the minResources field. If the aggregated resources exceed the cluster capacity, the job will never schedule. This is different from the native scheduler, which would schedule what it can. The operator must ensure the cluster has enough capacity for the largest expected job or tune the minMember to allow smaller batches.

Decision frame

The next time a distributed training job hangs with some pods Running and others Pending, the question is not whether the scheduler is broken. It is whether the cluster capacity matches the job’s minMember requirement. The tradeoff is between immediate resource utilization and guaranteed job completion. If the workload is latency-sensitive and can tolerate partial execution, the native scheduler is sufficient. If the workload requires all-or-nothing consistency, Volcano’s PodGroup is the mechanism. The decision rests on whether the cost of idle GPUs is higher than the cost of a deadlocked job. The choice is not about scheduling speed; it is about whether the platform prioritizes filling nodes or finishing jobs.