Why BF16 replaced FP16 for distributed training

A kubectl describe on a training pod will never show why gradients overflow, but the exponent bits in the chosen dtype will.

FP16 and BF16 both occupy 2 bytes of memory, yet they distribute those bits differently. FP16 uses 5 exponent bits and 10 mantissa bits, giving it a dynamic range of roughly 6e-5 to 6e4. BF16 uses 8 exponent bits and 7 mantissa bits, matching FP32’s range of roughly 1e-38 to 3e38. This difference determines whether training a 70B parameter model will converge or silently fail.

The mechanism is not about precision alone. It is about the exponent range required for gradient accumulation. During backpropagation, gradients accumulate across mini-batches and across layers. With FP16’s limited range, small gradients underflow to zero while large gradients overflow to infinity. BF16’s wider exponent range prevents both failure modes without requiring loss scaling.

The exponent range determines training stability

FP16’s 5-bit exponent limits its dynamic range to approximately 5 orders of magnitude. Training loss values typically span 1e-4 to 1e4 during the first epoch, then narrow to 1e-2 to 1e2. Gradient norms across layers can vary by 10 orders of magnitude in deep networks. When a gradient value exceeds 65504 (FP16’s maximum), it becomes infinity. When it falls below 6e-5, it becomes zero.

This is not theoretical. In practice, gradient underflow causes certain layers to stop learning entirely. Gradient overflow causes NaN values that propagate through the entire network. Both failure modes are silent until the loss curve shows a spike or flatlines.

BF16’s 8-bit exponent matches FP32’s range. A gradient of 1e-30 or 1e30 remains representable. This eliminates the need for dynamic loss scaling, a technique that multiplies all losses by a factor before backpropagation to prevent underflow, then divides the gradients afterward. Loss scaling adds complexity and introduces another hyperparameter to tune.

The following table compares the three formats relevant to training:

Format	Exponent Bits	Mantissa Bits	Range	Precision	Training Viability
FP32	8	23	1e-38 to 3e38	High	Baseline
FP16	5	10	6e-5 to 6e4	Medium	Requires loss scaling
BF16	8	7	1e-38 to 3e38	Lower than FP32	Stable without scaling

Both FP16 and BF16 use 2 bytes per value. The memory savings are identical. The difference is purely in the exponent range.

The configuration that obsoletes loss scaling

PyTorch’s torch.cuda.amp.autocast handles mixed precision automatically, but the dtype selection must be explicit in the training loop. The torch.cuda.amp.GradScaler class exists specifically for FP16 to implement dynamic loss scaling. It is unnecessary for BF16.

DeepSpeed’s configuration file demonstrates the difference. For FP16 with loss scaling:

{
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  }
}

For BF16:

{
  "bf16": {
    "enabled": true
  }
}

The FP16 configuration requires tuning initial_scale_power and loss_scale_window. If the initial scale is too high, gradients overflow. If too low, gradients underflow. The GradScaler adjusts the scale dynamically, but this adds overhead and can still fail on edge cases.

The BF16 configuration is binary: enabled or not. No scaling parameters. No tuning. The exponent range handles the variation in gradient magnitudes.

NVIDIA’s A100 and H100 GPUs both support BF16 natively. The V100 supports FP16 but not BF16. This hardware constraint determines which training clusters can use BF16 without emulation overhead.

Where FP16 remains the correct choice

Inference workloads have different requirements than training. During inference, the model weights are static. The forward pass does not accumulate gradients. The values flowing through the network are activations, not gradients.

Activations in a well-initialized network stay within a bounded range. A transformer layer’s output typically has a norm between 0.1 and 10. This range fits comfortably within FP16’s 6e-5 to 6e4 envelope. No loss scaling is needed because there is no gradient accumulation.

For inference, FP16 can outperform BF16 on older hardware. The V100’s Tensor Cores are optimized for FP16. BF16 requires the A100 or later. On a cluster of V100s, FP16 inference is faster and uses the same memory as BF16.

The decision is not about which format is “better.” It is about the workload’s range requirements. Training requires gradient accumulation across millions of parameters. Inference requires a single forward pass through a bounded activation range.

python -c "import torch; print(torch.cuda.get_device_properties(0).major, torch.cuda.get_device_properties(0).compute_capability)"

This command reveals the GPU’s compute capability. A V100 reports 7.0. An A100 reports 8.0. An H100 reports 9.0. Only compute capability 8.0 and above support native BF16.

Failure modes when the wrong dtype is chosen

Training with FP16 on a model that has large gradient spikes will produce NaN loss values. The symptom appears in the training loop as a torch.isnan(loss).any() check returning True. The root cause is not the loss function. It is the exponent overflow in the gradient.

import torch

# This will print True if gradients have overflowed
if torch.isnan(model.loss).any():
    print("Gradient overflow detected")

The fix is not to adjust the learning rate. It is to switch to BF16 or implement loss scaling. Loss scaling is a workaround. BF16 is a structural fix.

Training with BF16 on hardware that does not support it (V100) will either fail or fall back to FP32 emulation. The emulation is slow. The performance benefit disappears. The DeepSpeed configuration will not catch this at startup. It will only fail when the first BF16 operation executes.

The torch.cuda.is_bf16_supported() function exists to check hardware support before training begins.

if not torch.cuda.is_bf16_supported():
    raise RuntimeError("BF16 not supported on this GPU")

This check prevents the silent fallback to FP32 that would waste compute hours.

Gradient underflow is harder to detect. Small gradients become zero silently. The model stops learning in certain layers. The loss curve looks normal. The validation accuracy stagnates. This failure mode is indistinguishable from a bad learning rate or insufficient training data.

Decision frame

The next time a training job produces NaN gradients, the question is not “is the learning rate too high.” It is “does the GPU support BF16 and is the DeepSpeed config using it.” FP16 requires loss scaling to prevent overflow. BF16 does not. The tradeoff is hardware compatibility. A V100 cluster must use FP16 with careful loss scaling or accept FP32’s memory cost. An A100 cluster should use BF16 and remove the scaling parameters entirely. The decision is binary: check the compute capability, then choose the format that matches the hardware’s native precision.