DDP replicates the full model on each GPU while FSDP shards parameters across GPUs; the switch costs ~15-25% throughput but enables models that exceed single-GPU memory.

The torch.distributed package provides two primary parallelism strategies for training large models. Distributed Data Parallel (DDP) maintains a complete copy of the model on each GPU and synchronizes gradients after each backward pass. Fully Sharded Data Parallel (FSDP), introduced in PyTorch 2.0, shards parameters, gradients, and optimizer states across GPUs, requiring all-gather operations before each forward pass.

Both strategies use the same torchrun launcher and the torch.distributed backend. The difference is in how they allocate memory and communicate during training. DDP works for models that fit on a single GPU; FSDP is required when the model state exceeds available memory. The decision to switch is not about model size alone — it is about the total memory footprint including optimizer states and activations.

The memory math for a 13B-parameter model

Consider a 13-billion-parameter model trained on 8×H100 80GB GPUs. The memory requirements differ dramatically between DDP and FSDP.

ComponentDDP (per GPU)FSDP (per GPU, 8 GPUs)
Model params (FP16)26 GB3.25 GB
Optimizer states (Adam FP32)104 GB13 GB
Gradients (FP16)26 GB3.25 GB
Activations (seq len 2048)~15 GB~15 GB
Total~171 GB~34.5 GB

The math is straightforward. A 13B parameter model in FP16 requires 13B × 2 bytes = 26GB for weights alone. Adam optimizer maintains two FP32 buffers per parameter, adding 13B × 8 bytes = 104GB. Under DDP, each GPU holds all three components, totaling 171GB — far exceeding the 80GB available on an H100.

FSDP shards the model parameters, gradients, and optimizer states across 8 GPUs. The per-GPU footprint becomes 26GB/8 + 104GB/8 + 26GB/8 = 19.5GB for static components, plus activations which remain largely unsharded. With activation checkpointing enabled, activation memory can be reduced by ~60%, bringing the total to approximately 34.5GB per GPU.

This calculation assumes FP16 for model parameters and gradients, FP32 for optimizer states. Mixed precision training with torch.cuda.amp is standard practice. The numbers shift if using BF16 instead of FP16, but the ratio between DDP and FSDP remains the same.

The communication pattern and throughput cost

DDP uses all-reduce to synchronize gradients after each backward pass. This is a single collective operation that scales well with the number of GPUs. FSDP uses all-gather before each forward pass to reconstruct the full model, then all-scatter after the backward pass to shard the gradients again.

The additional all-gather operation introduces latency. Each forward pass requires gathering parameters from all GPUs, performing the computation, then discarding the gathered parameters. This pattern repeats for every layer in the model. The FullyShardedDataParallel wrapper in PyTorch handles this automatically, but the communication overhead is measurable.

Benchmarking from the PyTorch team and independent evaluations show FSDP throughput is typically 15-25% lower than DDP for models that fit under both strategies. The penalty increases with model size and decreases with network bandwidth. On H100s with NVLink, the overhead is closer to 15%. On standard InfiniBand, it can reach 25%.

The ShardingStrategy enum in FSDP controls the communication pattern. FULL_SHARD shards parameters, gradients, and optimizer states — maximum memory savings, maximum communication. SHARD_GRAD_OP shards only gradients and optimizer states, keeping parameters replicated — less communication, less memory savings. NO_SHARD is equivalent to DDP. The choice depends on whether the model fits under FULL_SHARD or if SHARD_GRAD_OP is sufficient.

from torch.distributed.fsdp import FullyShardedDataParallel, ShardingStrategy

model = FullyShardedDataParallel(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    cpu_offload=False,
)

This configuration wraps the model for FSDP training. The sharding_strategy parameter is set in the Python code, not passed to torchrun. The launcher remains unchanged:

torchrun --nproc_per_node=8 train.py

Activation checkpointing is configured separately, typically through the transformers library or manual wrapping of layers:

from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import checkpoint_wrapper

for layer in model.layers:
    layer = checkpoint_wrapper(layer)

Activation checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them. This reduces activation memory by ~60% but increases compute time by ~10-15%. For large models, this tradeoff is often necessary to fit within GPU memory.

Failure modes when the switch is wrong

Picking the wrong strategy manifests in specific, diagnosable symptoms. DDP for a model that exceeds GPU memory results in OOM errors during the backward pass. The error message from PyTorch is clear: “CUDA out of memory” followed by the attempted allocation size. This is a hard failure — training cannot proceed without reducing batch size or switching to FSDP.

FSDP for a model that fits under DDP wastes resources and throughput. The all-gather operations add latency that is unnecessary when parameters could remain replicated. The symptom is slower training without a memory benefit. Monitoring GPU utilization shows lower compute time and higher communication time compared to a DDP baseline.

Misconfigured FSDP can also cause silent failures. If sharding_strategy is set to NO_SHARD while the model exceeds memory, the training will fail with the same OOM error as DDP. The wrapper is present but not doing sharding. This is a common configuration mistake when porting training scripts from DDP to FSDP.

The transformers library provides FSDP integration through its Trainer class, but the configuration must be explicit:

torchrun --nproc_per_node=8 train.py \
  --fsdp "full_shard auto_wrap" \
  --fsdp_config fsdp_config.yaml

Here the --fsdp flag is a transformers argument, not a torchrun argument. The actual FSDP configuration lives in fsdp_config.yaml. This distinction is critical — passing --fsdp directly to torchrun will fail.

The switch threshold

The decision to switch from DDP to FSDP is determined by the total memory footprint relative to GPU capacity. For H100 80GB GPUs, the practical threshold is around 13B parameters for DDP with FP16. Beyond that, FSDP is required.

For smaller models under 10B parameters, DDP is preferable if the batch size is large enough to saturate the GPUs. The throughput advantage of DDP (15-25%) outweighs the memory savings of FSDP when the model fits. For models between 10B and 20B, FSDP with SHARD_GRAD_OP may be sufficient — parameters remain replicated while gradients and optimizer states are sharded. This reduces communication overhead while still saving memory.

For models above 20B, FULL_SHARD is required. The 70B parameter models common in LLM training cannot fit under DDP on any reasonable GPU configuration. FSDP with FULL_SHARD and activation checkpointing is the only viable option.

The table below shows the approximate model size thresholds for H100 80GB:

Model sizeDDP feasible?FSDP strategyThroughput vs DDP
<10BYesN/ABaseline
10-13BMarginalSHARD_GRAD_OP-15%
13-20BNoFULL_SHARD-20%
>20BNoFULL_SHARD + checkpointing-25%

These thresholds assume FP16 for model parameters and gradients. Using BF16 reduces memory by ~50% for model parameters, shifting the thresholds upward. Using mixed precision with torch.cuda.amp is standard and assumed in these calculations.

Decision frame

The question when a GPU training job fails with OOM is not “should I increase batch size.” It is “what is the total memory footprint including optimizer states.” DDP requires ~8 bytes per parameter for Adam optimizer states alone; a 13B model needs 104GB per GPU before accounting for activations. If the model exceeds 13B parameters on H100 80GB, switch to FSDP with FULL_SHARD and activation checkpointing. The 20% throughput cost is the price of fitting the model — not a configuration bug to debug.