FSDP vs DDP vs DeepSpeed: A Practical Guide With Real Benchmarks

By allocguy · February 2026 · 12 min read

Most teams spend weeks comparing A100 vs H100. They should be spending that time picking the right distributed training strategy. The choice between DDP, FSDP, and DeepSpeed ZeRO determines whether your model fits in memory at all, how much communication overhead you pay per step, and whether adding GPUs actually speeds things up.

Here's the real data. No handwaving, no "it depends" without numbers. Published benchmarks from PyTorch, HuggingFace, Microsoft Research, and IBM/ETH Zurich.

DDP: The Simplest Strategy (Until Your Model Outgrows It)

PyTorch DistributedDataParallel (DDP) replicates the full model on every GPU. Each GPU processes a different batch, then all-reduces gradients before the optimizer step. Simple, battle-tested, and fast when the model fits.

The memory cost is straightforward. With mixed-precision training using AdamW, each parameter costs 16 bytes per GPU: 2 bytes for the FP16 parameter, 2 bytes for the FP16 gradient, and 12 bytes for the FP32 optimizer states (master copy + momentum + variance). That means a 1B parameter model requires roughly 16 GB per GPU before activations and framework overhead.

GPU MemoryMax Model Size (DDP)Notes
24 GB (RTX 3090/4090)~1B params16B model + activations + overhead
40 GB (A100-40GB)~1.5–2B paramsTight at 2B with batch_size=1
80 GB (A100/H100)~3.5B paramsLeaves room for small batches

Estimates assume mixed-precision AdamW (16 bytes/param) plus activation memory overhead. Exact limits depend on sequence length and batch size.

DDP's strength is scalability. With gradient bucketing (overlapping communication with backward pass computation), DDP achieves near-linear scaling across GPUs. The PyTorch team demonstrated this rigorously in their VLDB 2020 paper (Li et al.), showing that gradient bucketing and computation/communication overlap are what make DDP practical at scale.

But DDP has a hard ceiling: if the model doesn't fit on a single GPU, DDP cannot train it. No amount of GPUs helps because every GPU needs a full copy. Once you cross that threshold, you need sharding.

FSDP: Sharding Inside PyTorch

Fully Sharded Data Parallel (FSDP) was added in PyTorch 1.11 (March 2022). It implements the ZeRO-3 algorithm natively in PyTorch: model parameters, gradients, and optimizer states are all sharded across GPUs. Each GPU only holds a slice. When a layer needs the full parameters for forward or backward, FSDP all-gathers them on the fly, computes, then discards them.

The result is dramatic. Published benchmarks show FSDP achieves 4–6x peak memory reduction compared to DDP (arXiv:2505.12832). That means models that OOM on DDP can train comfortably with FSDP.

FSDP BENCHMARK: GPT-2 1.5B ON 2x TITAN RTX (24 GB)

Source: HuggingFace FSDP blog. Same model, same hardware. DDP cannot even start training.

DDPOOM at batch_size=1. Model too large for single GPU.
FSDP (ZeRO-3)Trains at batch_size=5 per GPU.
FSDP + CPU offloadTrains at batch_size=14 per GPU.

At larger scale, FSDP holds up well. A PyTorch/AWS benchmark trained T5-3B on 8x A100-40GB and achieved 2.3 TFLOPS with 95% GPU memory utilization at batch_size=14. Transformer-level wrapping (wrapping each transformer block individually instead of the whole model) delivered over 2x throughput improvement. Activation checkpointing added another 10x improvement. BF16 mixed precision gave roughly 5x over FP32, and full sharding (ZeRO-3 equivalent) provided 1.5x improvement over shard-grad-op (ZeRO-2 equivalent).

FSDP BenchmarkHardwareResultSource
GPT-2 1.5B2x Titan RTX 24GBbs=5 (DDP OOMs)HuggingFace FSDP blog
T5-3B8x A100-40GB2.3 TFLOPS, 95% memPyTorch/AWS blog
GPT 175B128 A100-40GB159–186 TFLOPS (51–60% peak)arXiv:2304.11277
7B model128 A1003,700 tok/s/GPU, 57% MFUPyTorch blog

The Zhao et al. paper (arXiv:2304.11277) also demonstrated linear scalability from 128 to 512 GPUs for GPT 175B training. FSDP scales because its communication pattern (all-gather for forward, reduce-scatter for backward) overlaps well with computation when properly configured.

FSDP2: The DTensor Rewrite

PyTorch 2.x introduced FSDP2, a ground-up rewrite based on the DTensor abstraction. The API is cleaner, composability with other parallelism strategies (tensor parallel, pipeline parallel) is significantly better, and performance improved.

FSDP2 IMPROVEMENTS OVER FSDP1

Measured on Llama models with TorchTitan. Source: arXiv:2410.06511

7% lower peak memory usage.
1.5% throughput gain from better overlap and scheduling.
Native DTensor composability. Combine FSDP + tensor parallelism + pipeline parallelism in a single model.
Float8 training with FSDP2: up to 50% throughput speedup over BF16.

If you're starting a new project on PyTorch 2.x, FSDP2 is the recommended path. The original FSDP1 API still works, but new development and optimizations are focused on FSDP2.

DeepSpeed ZeRO: Three Stages of Memory Reduction

Microsoft's DeepSpeed library implements the ZeRO (Zero Redundancy Optimizer) algorithm in three stages, each sharding progressively more state across GPUs. The original ZeRO paper (Rajbhandari et al., SC'20) laid out the math clearly.

StageWhat's ShardedMemory ReductionCommunication Cost
Stage 1Optimizer states4xSame as DDP
Stage 2Optimizer states + gradients8xSame as DDP
Stage 3Optimizer states + gradients + parametersLinear with Nd (64x on 64 GPUs)1.5x DDP volume

Nd = number of data-parallel GPUs. Source: Rajbhandari et al., SC'20

The practical impact on 64 GPUs with a 7.5B parameter model: Stage 1 can train up to 7.5B parameters. Stage 2 pushes that to 14B. Stage 3 enables training up to 128B parameters on the same hardware. The tradeoff is communication: Stages 1 and 2 have the same communication volume as DDP, but Stage 3 adds 1.5x more.

FSDP vs DeepSpeed: Head-to-Head

Both FSDP (FULL_SHARD) and DeepSpeed ZeRO Stage 3 implement the same core algorithm: shard everything, all-gather before compute, reduce-scatter after. The differences are in implementation quality, ecosystem integration, and performance at different scales.

An IBM/ETH Zurich benchmark found that FSDP FULL_SHARD ran up to 5x faster per iteration than DeepSpeed ZeRO-3 for models up to a few billion parameters. The gap narrows as model size increases. At 10B+ parameters, the difference shrinks significantly. At 70B+, they largely converge.

Model ScaleFSDP vs ZeRO-3Notes
1–3BFSDP up to ~5x fasterFSDP's tighter PyTorch integration pays off
10–30BGap narrowsCommunication becomes dominant factor
70B+Roughly equivalentBoth bottlenecked by interconnect bandwidth

Source: IBM/ETH Zurich distributed training comparison

Decision Matrix: Which Strategy to Use

Model size is the primary decision variable. Everything else is secondary.

CriteriaDDPFSDPDeepSpeed ZeRO
Model < 500MBestOverkillOverkill
Model 500M–2BOKSimilar perfStage 1–2
Model > ~2.3BOOMRequiredStage 3 required
SimplicityEasiestModerateConfig-heavy
Native PyTorchYesYesThird-party
CPU offloadingNoYesYes (ZeRO-Offload)
Checkpoint complexitySimpleModerateComplex
HF Trainer supportBuilt-inBuilt-inBuilt-in

"OOM" assumes 80 GB GPU with mixed-precision AdamW. The exact threshold depends on sequence length and batch size.

The short version: if your model fits in GPU memory with room for reasonable batch sizes, use DDP. If it doesn't, use FSDP (for PyTorch-native) or DeepSpeed Stage 3 (if you need the broader DeepSpeed ecosystem or ZeRO-Offload). For models between 500M and 2B, either DDP or FSDP will work. Above 2.3B on a single 80 GB GPU, sharding is not optional.

Communication Overhead: NVLink vs PCIe

DDP only all-reduces gradients once per step. FSDP and ZeRO Stage 3 all-gather parameters before every forward and backward pass, then reduce-scatter gradients after. That means interconnect bandwidth matters much more for FSDP/ZeRO-3 than for DDP.

NVLINK VS PCIE: DDP BENCHMARK

GPT-2 on 2x Titan RTX. Source: HuggingFace docs

NVLinkRuntime: 101.9s, throughput: 1.963 samples/sec
PCIe onlyRuntime: 131.4s, throughput: 1.522 samples/sec
DeltaNVLink is 23% faster even with DDP (gradient-only communication)

For FSDP and ZeRO-3, the impact is larger because the communication volume is 1.5x that of DDP. If you're running FSDP or ZeRO-3 on hardware without NVLink (consumer GPUs, PCIe-only multi-GPU setups), expect significant communication overhead. NVLink is critical for sharded strategies at any meaningful model size.

Practical Gotchas

1. CPU offloading costs 20–40% iteration speed

Both FSDP and DeepSpeed support offloading optimizer states (and optionally parameters) to CPU RAM. This lets you train larger models on fewer GPUs. The cost: 20–40% slower iterations due to PCIe transfers between CPU and GPU. Use this as a last resort when sharding alone isn't enough.

2. Use BFloat16, not Float16, with FSDP

BFloat16 has the same dynamic range as Float32, so it doesn't require a gradient scaler. Float16 with FSDP requires careful loss scaling to avoid underflow, especially when gradients are sharded across GPUs. On A100 and H100, BFloat16 is roughly 4% faster than Float16 because it skips the scaling overhead entirely.

3. FSDP checkpoints are not DDP checkpoints

FSDP saves sharded checkpoints by default. Each rank writes its own shard. To get a single consolidated checkpoint (for inference or resuming with a different GPU count), you need FullStateDictConfig(offload_to_cpu=True, rank0_only=True) or the newer torch.distributed.checkpoint. DeepSpeed has its own checkpoint format that also requires explicit consolidation. Plan your checkpoint strategy before you start training.

4. Wrapping policy matters for FSDP performance

Wrapping each transformer block as its own FSDP unit (instead of wrapping the entire model) gave 2x throughput improvement in the T5-3B PyTorch/AWS benchmark. The wrapping granularity controls how much is all-gathered at once and how well communication overlaps with computation.

5. DeepSpeed ZeRO-Infinity adds NVMe offloading

DeepSpeed can offload to NVMe SSDs in addition to CPU RAM, enabling training of extremely large models on limited GPU counts. FSDP does not have an NVMe offload equivalent. If you need to train a model that doesn't fit even with CPU offloading, ZeRO-Infinity is currently the only option.

Find Your Strategy Before You Launch

The right distributed strategy depends on your model size, your hardware, and your interconnect. Alloc profiles your training job, measures VRAM usage across all GPUs, and tells you whether you're communication-bound, memory-bound, or compute-bound.

pip install alloc && alloc run python train.py

30 seconds of calibration. Full VRAM breakdown, bottleneck detection, and right-sizing recommendations.

Sources

  • Li et al., "PyTorch Distributed: Experiences on Accelerating Data Parallel Training," VLDB 2020
  • Zhao et al., "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel," arXiv:2304.11277
  • Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models," SC'20
  • HuggingFace, "Accelerate FSDP Integration" blog post and multi-GPU documentation
  • PyTorch/AWS, "Getting Started with Fully Sharded Data Parallel (FSDP)" blog post (T5-3B benchmarks)
  • PyTorch blog, "Training a 7B model on 128 GPUs with FSDP"
  • TorchTitan team, "TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training," arXiv:2410.06511
  • Siddharth et al., "Memory-efficient FSDP analysis," arXiv:2505.12832
  • IBM/ETH Zurich, distributed training strategy comparison (FSDP vs DeepSpeed ZeRO-3)

Related Reading