A distributed training job loses 40-60% of its all-reduce throughput when NCCL pods land across different network switches, and the Kubernetes scheduler will not prevent this without explicit topology constraints.
NCCL (NVIDIA Collective Communications Library) is the communication layer for multi-GPU training. It handles all-reduce, broadcast, and gather operations across ranks. The performance of these operations depends entirely on the underlying network topology. Two GPUs in the same NVLink domain achieve 600GB/s bidirectional bandwidth. Two GPUs across different racks with 100GbE interconnect achieve 25-40GB/s. The difference is not linear; it is exponential for large batch sizes and model parallelism.
The Kubernetes scheduler places pods based on resource availability by default. It does not understand NCCL topology. It does not know which nodes share a top-of-rack switch, which nodes are in the same NVLink domain, or which nodes have RDMA-capable InfiniBand. Without explicit constraints, the scheduler may place an 8-rank training job across 3 different racks, degrading training throughput by 40-60% as measured by NCCL’s nccl-tests benchmark suite.
Two mechanisms exist to express topology requirements: PodAffinity for co-location and TopologySpreadConstraints for spreading. They are not interchangeable. PodAffinity forces pods to the same topology domain. TopologySpreadConstraints forces pods to different topology domains. Both use the same topologyKey field, which must match a label that exists on the nodes.
How PodAffinity and TopologySpreadConstraints work
PodAffinity and TopologySpreadConstraints are both fields in the Pod spec, but they serve opposite purposes. PodAffinity is for co-location. TopologySpreadConstraints is for spreading. The scheduler evaluates both during the scheduling cycle, but they operate on different constraints.
PodAffinity uses podAffinityTerm with a topologyKey. The scheduler looks for pods that match the labelSelector and places the new pod on a node that shares the same topologyKey value. For example, to place all 8 training ranks in the same rack, the PodAffinityTerm would reference a topology key like rack-id that is the same for all nodes in that rack.
TopologySpreadConstraints uses topologyKey at the constraint level. The scheduler counts how many pods from the same labelSelector exist in each topology domain. It then places new pods in domains with the fewest pods, up to the maxSkew limit. This is for spreading, not co-location.
The following table shows the two mechanisms, their purpose, and the valid field names in the Pod spec:
| Mechanism | Field Path | Purpose | WhenUnsatisfiable Values |
|---|---|---|---|
| PodAffinity | spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution | Co-locate pods | N/A (hard requirement) |
| TopologySpread | spec.topologySpreadConstraints | Spread pods | DoNotSchedule or ScheduleAnyway |
The topologyKey in both cases must be a label that exists on the nodes. Common values include topology.kubernetes.io/zone (standard Kubernetes label), topology.kubernetes.io/region, or custom labels like rack-id, switch-id, or nvlink-domain. The scheduler does not validate that the label exists; it only uses the value if it is present. If the label is missing, the constraint is ignored for that node.
A working Pod spec for an 8-rank training job using PodAffinity to co-locate all ranks in the same rack:
apiVersion: v1
kind: Pod
metadata:
name: training-rank-0
labels:
app: distributed-training
rank: "0"
spec:
containers:
- name: trainer
image: pytorch-training:2.1
resources:
limits:
nvidia.com/gpu: 8
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: distributed-training
topologyKey: rack-id
This Pod will only schedule on a node that already has a pod with app: distributed-training and the same rack-id value. The first pod in the job will schedule to any node with resources. Subsequent pods will only schedule to nodes in the same rack.
How to express topology when labels exist
The topology key must match a label that the cluster actually exports. Kubernetes provides standard labels like topology.kubernetes.io/zone and topology.kubernetes.io/region through the cloud-controller-manager. Custom labels like rack-id or nvlink-domain must be added by the cluster operator or node labeler.
A cluster with 4 racks, each with 8 A100 GPUs, would have nodes labeled with rack-id: rack-1, rack-id: rack-2, and so on. The scheduler can use this label to make placement decisions. The label must be present on all nodes, or the constraint will fail silently for nodes without the label.
The following command shows how to inspect the topology labels on a node:
kubectl get node worker-1 -o jsonpath='{.metadata.labels}'
Output:
{"kubernetes.io/hostname":"worker-1","topology.kubernetes.io/zone":"us-east-1a","rack-id":"rack-1","nvidia.com/gpu.product":"A100-80GB"}
The rack-id label is a custom label added by the cluster operator. The topology.kubernetes.io/zone label is standard. The nvidia.com/gpu.product label is added by the NVIDIA device plugin.
For NCCL workloads, the ideal topology key is nvlink-domain if the cluster exports it. This label would be the same for all nodes that share an NVLink fabric. Most clusters do not export this label by default. The operator must add it using a DaemonSet or node-labeler.
If nvlink-domain is not available, use rack-id or topology.kubernetes.io/zone. These are less precise but still prevent cross-rack placement. The difference in throughput between same-rack and same-zone placement is smaller than the difference between same-rack and cross-rack.
Failure modes when topology is wrong
When pods land in the wrong topology domain, NCCL throughput degrades. The symptom is not a pod failure. The symptom is slow training. The pod stays Running, the training loop completes, but the time per epoch is 40-60% higher than expected.
The nccl-tests benchmark measures this directly. A same-rack all-reduce on 8 A100s achieves 450-500GB/s. A cross-rack all-reduce on the same hardware achieves 200-250GB/s. The training throughput scales with the communication bandwidth. A 50% drop in bandwidth means a 50% drop in tokens per second.
The scheduler does not emit a warning when topology constraints are unsatisfied. If the topology key is missing from a node, the scheduler ignores the constraint for that node. The pod schedules to a node with the wrong topology, and the training job runs slowly. The event log shows Successfully assigned with no indication of the topology mismatch.
The following command shows how to check the topology labels on running pods:
kubectl get pods -l app=distributed-training -o jsonpath='{range .items[*]}{.metadata.name} {.spec.nodeSelector} {.spec.affinity}{"\n"}{end}'
This reveals which nodes the pods are on and what topology constraints were applied. If the topologyKey is rack-id but the nodes have different rack-id values, the constraint is not working.
PodAffinity with requiredDuringSchedulingIgnoredDuringExecution is a hard constraint. If no node satisfies the constraint, the pod stays Pending. The event log shows 0/10 nodes are available: 10 pod has unmet affinity constraints. This is the correct failure mode for hard constraints.
TopologySpreadConstraints with whenUnsatisfiable: DoNotSchedule is also a hard constraint. If the skew exceeds the limit, the pod stays Pending. The event log shows 0/10 nodes are available: 10 pod has unsatisfied topology spread constraint. This is the correct failure mode for hard spreading.
TopologySpreadConstraints with whenUnsatisfiable: ScheduleAnyway is a soft constraint. The pod schedules even if the skew is exceeded. The training job runs, but the throughput may be degraded. This is the failure mode for soft constraints.
The following table shows the three constraint types and their behavior:
| Constraint Type | whenUnsatisfiable | Behavior | Failure Mode |
|---|---|---|---|
| PodAffinity (hard) | N/A | Must co-locate or stay Pending | Pending with unmet affinity |
| TopologySpread (hard) | DoNotSchedule | Must spread or stay Pending | Pending with unsatisfied topology |
| TopologySpread (soft) | ScheduleAnyway | Spread if possible, schedule anyway | Running with degraded throughput |
Decision frame
The question the next time a training job runs slowly is not “is the scheduler broken.” It is “did the topology key match a label that actually exists on the nodes.” PodAffinity with topologyKey: rack-id only works if the nodes have rack-id labels. If the label is missing, the constraint is ignored, and the pods land in the wrong topology domain. The throughput delta is 40-60%. The fix is to add the label to the nodes, not to change the scheduler configuration.