Checkpoint storage patterns for distributed training

A checkpoint write is not a single operation but a pipeline where the storage tier dictates the recovery point objective for the entire training run.

Distributed training jobs save model weights and optimizer states to disk at regular intervals. This process allows a job to resume after a node failure without restarting from epoch zero. The mechanism involves the training framework serializing tensors into files, which are then flushed to the underlying storage system. The speed and durability of that storage system determine the operational cost of a failure.

Most platform engineers treat storage as a binary choice between local disk and a network volume. This binary view ignores the latency and throughput characteristics that define the tradeoff. A 70B-parameter model in FP32 occupies roughly 280GB of space. Writing this to a local NVMe drive takes seconds. Writing the same data to a network file system takes minutes. Writing it to object storage takes tens of minutes. The choice of storage tier directly limits how frequently checkpoints can be taken without stalling the training loop.

Local ephemeral storage

Local storage offers the highest throughput but the lowest durability. In Kubernetes, this is typically implemented using emptyDir volumes with medium: Memory or hostPath mounts to NVMe devices. The kubelet manages the lifecycle of these volumes, tying them to the node’s lifecycle.

When a training pod writes to a local volume, the data bypasses the network stack. The torch.save call writes directly to the device driver. Throughput on modern NVMe SSDs can reach 3GB/s. For a 280GB checkpoint, this means a write completes in under 100 seconds. This speed allows for frequent checkpointing, minimizing the recovery point objective.

However, the data is volatile. If the node crashes or the pod is evicted for resource pressure, the local volume is destroyed. The kubelet does not replicate this data to another node. Recovery requires restarting the job from the last successful checkpoint stored on a durable backend. This creates a dependency on a secondary storage system for the actual recovery data, even if the active write is local.

Network persistent volumes

Network persistent volumes provide durability across node failures but sacrifice throughput. A PersistentVolumeClaim (PVC) with ReadWriteMany access mode allows multiple pods to access the same storage backend. Common implementations include NFS, GPFS, or CSI drivers for cloud file systems like Amazon EFS.

The kubelet mounts the network volume into the pod’s filesystem. The torch.save call writes to this mount, but the data must traverse the network to reach the storage controller. Throughput is limited by the network bandwidth and the storage backend’s IOPS. Typical sustained throughput for shared network storage ranges from 200MB/s to 500MB/s under load.

Writing 280GB to a 500MB/s network volume takes approximately 560 seconds, or roughly 9 minutes. This duration often exceeds the timeout of the training loop’s synchronization barrier. If the checkpoint write blocks the main training thread, the GPU sits idle while waiting for the disk. This reduces overall training utilization. The StorageClass backing the PVC determines the performance tier, but the network path is the primary bottleneck.

Object storage offload

Object storage offers the highest durability and infinite capacity but the slowest write speed. Systems like Amazon S3 or Google Cloud Storage are accessed via HTTP APIs rather than POSIX filesystems. Training frameworks often use a sidecar container or a library to sync local checkpoints to object storage.

The write path involves serializing the file locally, then uploading it via multipart upload. This process is asynchronous in some implementations but synchronous in others. Throughput is limited by the upload bandwidth and the object storage service’s request rate limits. A typical sustained upload speed is around 200MB/s.

Uploading 280GB to object storage takes approximately 1,400 seconds, or 23 minutes. This latency is often too high for frequent checkpointing during active training. Object storage is better suited for archiving the final model or periodic snapshots. Using it for every checkpoint introduces significant overhead. The S3 protocol does not support atomic file operations, so a partial upload can result in a corrupted file if the process is interrupted.

Storage Tier	Typical Throughput	280GB Write Time	Durability	Access Mode
Local NVMe	3,000 MB/s	~93 seconds	Node-bound	`ReadWriteOnce`
Network PVC	500 MB/s	~560 seconds	Cluster-wide	`ReadWriteMany`
Object Store	200 MB/s	~1,400 seconds	Region-wide	API (HTTP)

Failure modes

The most common failure mode is partial writes during network interruptions. When a pod writes to a network PVC and the connection drops, the file may be truncated. The torch.save operation does not always guarantee atomicity on network filesystems. A subsequent resume attempt may fail with a ValueError or EOFError because the checkpoint file is incomplete.

Another failure mode is I/O contention. In multi-tenant clusters, multiple training jobs may write to the same storage backend simultaneously. This saturates the storage controller’s IOPS. The kubelet_volume_stats_available_bytes metric will show available space, but it does not reflect I/O latency. High latency manifests as increased GPU utilization wait times, visible as gpu_util dropping while gpu_memory_util remains high.

Node eviction is the third risk. If a node is under memory pressure, the kubelet may evict pods. If the checkpoint is stored locally, the data is lost. If the checkpoint is stored on a network PVC, the pod may be terminated before the file syncs. The preStop hook can help, but it does not guarantee the network write completes before the container is killed.

Decision frame

The choice between storage tiers is not about speed but about the acceptable recovery point objective relative to training throughput. If the training loop can tolerate a 10-minute checkpoint interval, a network PVC is sufficient. If the loop requires a 1-minute interval to minimize loss, local NVMe is mandatory, but it requires a strategy to sync to durable storage before node termination. The question is not which storage is fastest, but whether the storage write time fits within the training loop’s synchronization budget without starving the GPU. The storage tier must be selected based on the maximum acceptable time between checkpoints, not the maximum available throughput.