train S

Distributed training, NCCL, checkpointing, fault tolerance — the layer that turns GPU-hours into weights.

train·May 13, 2026

Checkpoint storage patterns for distributed training

Checkpointing writes model state to disk, but the storage tier determines whether a node failure costs minutes or days of training time.

train·May 13, 2026

Why BF16 replaced FP16 for distributed training

BF16 matches FP32's exponent range to prevent gradient overflow during training, while FP16 remains viable for inference where values stay bounded.

train·May 13, 2026

What gradient accumulation does to training throughput

Gradient accumulation sums gradients over N micro-batches before an optimizer step, reducing all-reduce frequency without lowering activation memory.

train·May 13, 2026

DeepSpeed ZeRO stages partition training states across GPUs

ZeRO-1, 2, and 3 trade memory savings for communication overhead by partitioning optimizer states, gradients, and parameters respectively.

train·May 13, 2026

What PyTorch Elastic actually recovers from

TorchElastic re-rendezvouses survivors when a worker fails, restarting from the last checkpoint. It does not recover from etcd outage or checkpoint corruption.

train·May 13, 2026

Streaming training data from object storage without network saturation

MosaicML's streaming library and WebDataset shard datasets into tar files, allowing PyTorch DataLoaders to fetch and cache samples on-demand.

train·May 13, 2026·stable since pytorch 2.0

DDP vs FSDP — when to switch and what it costs

DDP replicates the full model on each GPU; FSDP shards parameters across GPUs. Switching costs ~15-25% throughput but enables models that exceed single-GPU memory.

train·May 13, 2026

How the Kubeflow Training Operator's PyTorchJob actually launches a job

The PyTorchJob CRD creates a Headless Service and injects environment variables into pods; distributed training depends on DNS resolution of that Service name.