train S

Distributed training, NCCL, checkpointing, fault tolerance — the layer that turns GPU-hours into weights.

train·
Checkpoint storage patterns for distributed training
Checkpointing writes model state to disk, but the storage tier determines whether a node failure costs minutes or days of training time.
train·
Why BF16 replaced FP16 for distributed training
BF16 matches FP32's exponent range to prevent gradient overflow during training, while FP16 remains viable for inference where values stay bounded.
train·
What gradient accumulation does to training throughput
Gradient accumulation sums gradients over N micro-batches before an optimizer step, reducing all-reduce frequency without lowering activation memory.
train·
DeepSpeed ZeRO stages partition training states across GPUs
ZeRO-1, 2, and 3 trade memory savings for communication overhead by partitioning optimizer states, gradients, and parameters respectively.
train·
What PyTorch Elastic actually recovers from
TorchElastic re-rendezvouses survivors when a worker fails, restarting from the last checkpoint. It does not recover from etcd outage or checkpoint corruption.
train·
Streaming training data from object storage without network saturation
MosaicML's streaming library and WebDataset shard datasets into tar files, allowing PyTorch DataLoaders to fetch and cache samples on-demand.
train··stable since pytorch 2.0
DDP vs FSDP — when to switch and what it costs
DDP replicates the full model on each GPU; FSDP shards parameters across GPUs. Switching costs ~15-25% throughput but enables models that exceed single-GPU memory.
train·
How the Kubeflow Training Operator's PyTorchJob actually launches a job
The PyTorchJob CRD creates a Headless Service and injects environment variables into pods; distributed training depends on DNS resolution of that Service name.