The most efficient way to stream terabytes of training data from S3 is not to increase network bandwidth, but to shard the dataset into fixed-size tar files that amortize HTTP request overhead.

Training large language models requires datasets that exceed the capacity of local NVMe storage on GPU nodes. Copying a 10TB dataset to every node in a training cluster is prohibitively expensive and slow. Object storage services like S3 provide cheap capacity, but accessing individual files over HTTP introduces latency that stalls GPU training loops. The solution is not to treat the dataset as a filesystem, but as a stream of fixed-size chunks that can be prefetched and cached locally.

MosaicML’s streaming library and the WebDataset format solve this by converting datasets into tar files. These shards are downloaded on-demand by the PyTorch DataLoader and cached on the node’s local disk. This mechanism decouples the training speed from the network throughput, provided the shard size is tuned correctly. If the shards are too small, the HTTP request overhead dominates the epoch time. If the shards are too large, the local cache fills slowly, causing GPU starvation.

The shard structure

The dataset is preprocessed into a sequence of tar files, each containing a fixed number of samples. A typical shard size ranges from 100MB to 1GB. This size is chosen to balance two competing constraints: the overhead of establishing an HTTP connection and the capacity of the local cache.

When the training job starts, the StreamingDataset component on each worker node receives the list of shard URLs. It does not download the entire dataset. Instead, it calculates which shards are needed for the current epoch and initiates downloads in the background. The cache_limit argument in the library configuration dictates how much local disk space is available for these downloads.

The following table illustrates the relationship between shard size, request count, and total download time for a 1TB dataset.

Shard SizeShard CountHTTP RequestsNetwork OverheadLocal Cache Pressure
10MB100,000100,000High (TCP/SSL)Low
100MB10,00010,000ModerateModerate
1GB1,0001,000LowHigh
10GB100100Very LowVery High

For a 1TB dataset, using 100MB shards results in 10,000 HTTP requests. With a standard S3 request latency of 10ms, the total overhead is roughly 100 seconds per epoch just for connection establishment. This overhead is acceptable for a 10-hour training run. Using 10MB shards increases the overhead to 1,000 seconds, which can reduce effective training throughput by 5%.

The loading mechanism

The PyTorch DataLoader integrates with the StreamingDataset via the standard iterator protocol. The worker processes do not block on network I/O. They request the next sample from the local cache. If the sample is not present, the library triggers a download of the entire shard containing that sample.

The initialization of the StreamingDataset in a PyTorch training script looks like this:

from streaming import StreamingDataset
from torch.utils.data import DataLoader

dataset = StreamingDataset(
    remote="s3://my-bucket/dataset-shards",
    local="/local/cache",
    cache_limit="100GB",
    download=True,
)

dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,
    persistent_workers=True
)

The remote argument points to the object storage bucket. The local argument specifies the path on the training node’s filesystem where shards are cached. The cache_limit ensures the node does not fill its disk, evicting older shards if necessary. The download=True flag enables automatic fetching of shards as they are accessed.

The num_workers setting in the DataLoader is critical for this mechanism. Each worker process maintains its own download queue. If num_workers is set too low, the network bandwidth is underutilized. If set too high, the node saturates its network interface with concurrent requests, causing TCP retransmissions. For a single GPU node with a 10Gbps network, 4 to 8 worker processes is the typical sweet spot.

The sizing tradeoff

The shard size determines the granularity of the caching. A 1GB shard contains roughly 100,000 samples for a standard text dataset. Downloading this shard takes approximately 1 second on a 10Gbps link. Once downloaded, all 100,000 samples can be served from local disk at microsecond latency.

If the shard size is 100MB, the download takes 0.1 seconds. However, the cache fills 10 times faster, requiring more frequent evictions. If the cache limit is smaller than the dataset size, the node must re-download shards that were evicted. This re-download penalty is the primary performance risk.

The cache_limit must be large enough to hold at least 2 to 3 epochs of data if the dataset fits, or enough to hold the working set of the current epoch. For a 1TB dataset, a 100GB cache limit means the node will re-download 90% of the data every epoch. This is acceptable if the network is fast, but it increases the total training time.

Failure modes

The most common failure mode is GPU idle time caused by data starvation. This occurs when the local cache is exhausted and the network cannot keep up with the training speed. The torch.utils.data.DataLoader workers block on the StreamingDataset iterator, waiting for shards to download.

Symptoms include GPU utilization dropping to near zero while the top command shows high network activity. The dmesg logs may show TCP retransmissions if the network interface is saturated. This often happens when the shard size is too small, forcing the workers to open too many concurrent connections.

Another failure mode is disk space exhaustion. If the cache_limit is not set correctly, the StreamingDataset may fill the node’s root partition. This causes the training job to crash with an OSError. The cache_limit must be strictly enforced by the library, but the underlying filesystem must also have enough free space to handle temporary download buffers.

Network saturation is a third failure mode. If the training cluster has 100 nodes, each downloading 10 shards per epoch, the aggregate traffic can exceed the cluster’s ingress bandwidth. This is managed by configuring the object storage rate limits or by staggering the shard downloads across workers.

Decision frame

The question the next time a training job stalls on data loading is not “is the network fast enough.” It is “is the shard size large enough to amortize HTTP overhead but small enough to fit in the local cache.” The shard size is the lever that controls the tradeoff between network request latency and local cache eviction. If the GPU utilization is low and the network is saturated, increase the shard size. If the disk fills up or shards are re-downloaded every epoch, increase the cache_limit.