The Kubeflow Training Operator’s PyTorchJob CRD does not manage a distributed process group directly — it creates a Headless Service and injects environment variables into pods, relying on DNS resolution to establish the training contract.

The Kubeflow Training Operator is a Kubernetes controller that manages the lifecycle of distributed training jobs. It defines custom resources for specific frameworks, including PyTorchJob for PyTorch distributed workloads. Its purpose is to translate a high-level training intent into the underlying Kubernetes objects: Pods, Services, and ConfigMaps. The operator does not execute the training code. It creates the environment in which the training code executes, then waits for the Pods to report success or failure.

This separation means the operator’s responsibility ends at Pod creation and environment injection. The actual distributed coordination happens inside the container, driven by the torch.distributed library. The contract between the operator and the training process is defined by a set of environment variables. If those variables are incorrect, the training process hangs or fails, regardless of the operator’s health. Understanding the injection mechanism is required to debug why a distributed job stays in a Pending or Running state indefinitely.

The mechanism of injection

The operator reconciles the PyTorchJob Custom Resource by creating a Headless Service for each replica type (Master and Worker). It then creates the Pods defined in spec.pytorchReplicaSpecs. During Pod creation, the operator injects a specific set of environment variables into the container spec. These variables provide the rendezvous information required by torch.distributed.run.

The critical variables are RANK, WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. RANK identifies the specific process within the group. WORLD_SIZE defines the total count. MASTER_ADDR points to the Kubernetes Service name, not a specific Pod IP. MASTER_PORT is the port exposed by that Service. The operator calculates RANK based on the Pod’s index (e.g., 0 for the first Master Pod, 1 for the first Worker Pod) and injects it via the env field in the Pod spec.

A standard PyTorchJob manifest defines the replica counts and the container image. The operator expands this into the actual Pods.

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-dist-example
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
              command: ["python", "-m", "torch.distributed.run", "--nnodes", "1", "--nproc_per_node", "1", "train.py"]
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
              command: ["python", "-m", "torch.distributed.run", "--nnodes", "1", "--nproc_per_node", "1", "train.py"]

The operator generates a Headless Service named <job-name>-master (or similar, depending on configuration). It injects MASTER_ADDR as the string pytorch-dist-example-master. It does not inject the Pod’s FQDN. This distinction is critical for DNS resolution. The Service acts as a load balancer that resolves to the current IP of the Master Pod.

To verify the injection, inspect the running Pod. The environment variables appear in the spec.containers.env section of the Pod description.

kubectl get pods -l training.kubeflow.org/job-name=pytorch-dist-example
kubectl describe pod pytorch-dist-example-master-0

The output reveals the injected values. MASTER_ADDR will show the Service name. RANK will show 0. WORLD_SIZE will show the sum of all replicas. This configuration is static at Pod creation time. If the Pod restarts, the RANK remains the same, but the Pod IP changes. The Service handles the IP update, but the torch.distributed initialization logic must handle the reconnection.

Failure modes

The most common failure mode occurs when the Master Pod terminates unexpectedly. In a standard PyTorchJob without elasticPolicy, the operator treats the Master Pod as a single instance. If the Master Pod OOMs and restarts, the RANK 0 process is lost. The Worker Pods, which are waiting for the Master to signal readiness via init_process_group, will block indefinitely.

The Kubernetes Service updates its endpoints list when the Pod restarts. DNS resolution for MASTER_ADDR will eventually return the new Pod IP. However, torch.distributed does not automatically retry connection failures during initialization. If the Workers initialized before the Master was ready, or if the Master restarts after the Workers have connected, the Workers may hang waiting for a heartbeat that never arrives. This manifests as Pods in a Running state with no logs, consuming CPU but making no progress.

Another failure mode involves the MASTER_PORT. The operator exposes a port on the Service. If the container does not bind to that port, or if a NetworkPolicy blocks traffic on that port, the RANK 0 process cannot accept connections. The kubectl describe output will show MASTER_PORT as 29500 by default, but the container must listen on that port. A mismatch between the injected variable and the application’s listening port causes an immediate connection refusal.

The third failure mode is specific to the elasticPolicy configuration. When elasticPolicy is set, the operator uses a different rendezvous backend (often etcd or a stateful store) to manage the job. In this mode, the operator can restart failed pods without failing the entire job. Without elasticPolicy, a single Master failure is a terminal condition for the job. The failure is silent: the job status remains Running even though no training is occurring.

ScenarioMechanismSymptomResolution
Master OOMStatic PyTorchJobWorkers hang in RunningDelete Job to restart
Master OOMelasticPolicyWorkers pause, Master restartsWait for operator to recover
DNS TimeoutService EndpointConnectionRefusedErrorCheck Service Selector
Port MismatchEnv vs ConfigTimeoutError on initAlign container port

The table summarizes the operational differences. The static PyTorchJob relies on the assumption that the Master Pod will not die during the critical initialization phase. If it does, the operator does not intervene to reset the distributed group. The elasticPolicy configuration changes this assumption, allowing the operator to manage the group state dynamically.

Decision frame

The choice between a standard PyTorchJob and one with elasticPolicy is a tradeoff between simplicity and fault tolerance. The standard PyTorchJob mechanism is static: it injects environment variables once and assumes the Pod lifecycle matches the training lifecycle. This reduces the operational complexity of the controller but increases the risk of silent hangs if the Master Pod fails.

The question the next time a distributed training job stalls is not “is the network policy blocking traffic.” It is “did the Master Pod restart without the Workers knowing.” If the job uses a standard PyTorchJob without elasticPolicy, a Master restart is fatal to the training run. If the job requires fault tolerance, the elasticPolicy must be enabled to allow the operator to manage the rendezvous state dynamically. The tradeoff is the additional complexity of the etcd backend or stateful tracking versus the ability to recover from node failures without manual intervention.