A PyTorch worker pod failure triggers a re-rendezvous sequence that re-assigns ranks to survivors, but the system cannot recover if the etcd backend storing rendezvous state becomes unavailable.
TorchElastic is the fault-tolerance layer in torch.distributed.elastic. Its purpose is to maintain a distributed training job when individual worker processes fail. The mechanism relies on torchrun as the launcher, which manages the lifecycle of worker processes and coordinates re-joining after failures. This is distinct from the Kubernetes Job controller, which restarts the pod itself.
The system has two layers of restart logic. Kubernetes restarts the container when a pod dies. torchrun restarts the training process when a worker process fails or when a new pod joins the group. The two layers interact through a rendezvous backend, typically etcd, which stores the current group membership and rank assignments. If either layer fails, the recovery path diverges.
The torchrun architecture inside worker pods
The torchrun launcher does not run as a separate controller pod. It runs as the entrypoint process inside each worker pod. When a pod starts, the container’s command invokes torchrun with arguments specifying the number of nodes, the number of processes per node, and the rendezvous backend.
containers:
- name: worker
image: pytorch/pytorch:2.3.0-cuda12.1
command:
- torchrun
- --nnodes=4
- --nproc_per_node=8
- --rdzv_backend=etcd
- --rdzv_endpoint=etcd-service:2379
- --max-restarts=3
- /app/train.py
The --max-restarts flag sets the ceiling for how many times torchrun will attempt to re-rendezvous after a failure. This is a per-job limit, not a per-pod limit. When the count is exceeded, the entire job terminates.
Each worker pod runs torchrun as PID 1. The launcher spawns the actual training process (e.g., the train.py script). If the training process crashes, torchrun detects the exit and attempts to re-join the rendezvous group. The launcher uses the c10d.Store API to communicate with the etcd backend. The store implementation is abstracted: c10d provides the interface, etcd provides the persistence layer.
kubectl logs worker-pod-0 -c worker | grep "Rendezvous"
This command shows the re-join sequence when a pod restarts. The log will display the rank assignment and the current group size. If the rank changes, the training process must reload its checkpoint to match the new topology.
The re-rendezvous sequence
When a worker pod dies, Kubernetes restarts the container based on the pod’s restart policy. The torchrun process re-executes and contacts the etcd backend. The etcd store returns the current group membership, which may now exclude the failed node.
The re-join sequence follows these steps:
- The restarted pod’s
torchrunprocess callsc10d.Store.set()to announce its presence. - The etcd backend updates the group membership atomically.
- All surviving workers detect the membership change through the
c10d.Storewatch API. torchrunre-assigns ranks to the new group size.- The training process reloads its checkpoint and resumes from the last valid step.
The rank reassignment is the critical failure point. If a worker with rank 0 fails, a new worker takes rank 0. The checkpoint must be compatible with this new topology. Checkpoints saved with torch.distributed.checkpoint handle this automatically, but custom checkpointing logic may break if it assumes fixed ranks.
The following table shows the failure modes and their recovery paths.
| Failure Type | Kubernetes Action | TorchElastic Action | Recovery Possible |
|---|---|---|---|
| Worker process crash | Pod restarts | torchrun re-rendezvouses | Yes |
| Pod evicted (OOM) | Pod restarts | torchrun re-rendezvouses | Yes |
| Node failure | Pod rescheduled | torchrun re-rendezvouses | Yes |
| etcd outage | No action | torchrun blocks on rendezvous | No |
| Checkpoint corruption | No action | Training process fails | No |
--max-restarts exceeded | No action | Job terminates | No |
The etcd outage is the hard failure mode. The c10d.Store blocks indefinitely waiting for the backend to respond. No worker can proceed because the group membership cannot be updated. The Kubernetes Job controller does not detect this because the pod is still running.
Failure modes and their symptoms
Worker pod failures produce visible symptoms in both Kubernetes and application logs. When a pod is evicted due to OOM, the Kubernetes event shows OOMKilled. The torchrun logs on surviving workers show a rank reassignment. The restarted pod’s logs show a new rank assignment and a checkpoint reload.
kubectl get pods -l app=training-job -o wide
This command shows the restart count for each worker pod. If a pod has restarted more than 3 times, the --max-restarts=3 limit has been reached. The job will terminate if the restart count exceeds this value.
The etcd outage produces a different symptom. All worker pods remain in Running state. The application logs show torchrun waiting at the rendezvous barrier. The c10d.Store does not timeout by default. This creates a silent hang that appears as a training stall.
kubectl logs worker-pod-0 -c worker | grep "Rendezvous barrier"
This command shows if the process is blocked at the rendezvous barrier. If the log shows the barrier being reached repeatedly without progress, the etcd backend is likely unavailable.
Checkpoint corruption is another unrecoverable failure mode. If the checkpoint file is corrupted, the training process fails on reload. The torch.distributed.checkpoint library validates the checkpoint metadata, but it cannot recover from corrupted shard data. The job terminates with an error.
The --max-restarts limit is a per-job ceiling. It counts total re-rendezvous attempts across all workers. If one worker fails and restarts 4 times, the job terminates even if the other workers are healthy. This is a design choice to prevent infinite restart loops.
The decision frame
The question the next time a distributed training job stalls is not “did a worker fail.” It’s “is the etcd backend responding to c10d.Store calls.” Worker pod failures are recoverable through Kubernetes restarts and torchrun re-rendezvous. Etcd outages are not recoverable because the group membership cannot be updated. Check the etcd service health before investigating worker logs. The --max-restarts ceiling is a job-wide limit, not a per-pod limit, so a single failing worker can terminate the entire job if it exceeds the threshold.