FSDP vs DDP vs DeepSpeed: A Practical Guide With Real Benchmarks
By allocguy · February 2026 · 12 min read
Most teams spend weeks comparing A100 vs H100. They should be spending that time picking the right distributed training strategy. The choice between DDP, FSDP, and DeepSpeed ZeRO determines whether your model fits in memory at all, how much communication overhead you pay per step, and whether adding GPUs actually speeds things up.
Here's the real data. No handwaving, no "it depends" without numbers. Published benchmarks from PyTorch, HuggingFace, Microsoft Research, and IBM/ETH Zurich.
DDP: The Simplest Strategy (Until Your Model Outgrows It)
PyTorch DistributedDataParallel (DDP) replicates the full model on every GPU. Each GPU processes a different batch, then all-reduces gradients before the optimizer step. Simple, battle-tested, and fast when the model fits.
The memory cost is straightforward. With mixed-precision training using AdamW, each parameter costs 16 bytes per GPU: 2 bytes for the FP16 parameter, 2 bytes for the FP16 gradient, and 12 bytes for the FP32 optimizer states (master copy + momentum + variance). That means a 1B parameter model requires roughly 16 GB per GPU before activations and framework overhead.
| GPU Memory | Max Model Size (DDP) | Notes |
|---|---|---|
| 24 GB (RTX 3090/4090) | ~1B params | 16B model + activations + overhead |
| 40 GB (A100-40GB) | ~1.5–2B params | Tight at 2B with batch_size=1 |
| 80 GB (A100/H100) | ~3.5B params | Leaves room for small batches |
Estimates assume mixed-precision AdamW (16 bytes/param) plus activation memory overhead. Exact limits depend on sequence length and batch size.
DDP's strength is scalability. With gradient bucketing (overlapping communication with backward pass computation), DDP achieves near-linear scaling across GPUs. The PyTorch team demonstrated this rigorously in their VLDB 2020 paper (Li et al.), showing that gradient bucketing and computation/communication overlap are what make DDP practical at scale.
But DDP has a hard ceiling: if the model doesn't fit on a single GPU, DDP cannot train it. No amount of GPUs helps because every GPU needs a full copy. Once you cross that threshold, you need sharding.
FSDP: Sharding Inside PyTorch
Fully Sharded Data Parallel (FSDP) was added in PyTorch 1.11 (March 2022). It implements the ZeRO-3 algorithm natively in PyTorch: model parameters, gradients, and optimizer states are all sharded across GPUs. Each GPU only holds a slice. When a layer needs the full parameters for forward or backward, FSDP all-gathers them on the fly, computes, then discards them.
The result is dramatic. Published benchmarks show FSDP achieves 4–6x peak memory reduction compared to DDP (arXiv:2505.12832). That means models that OOM on DDP can train comfortably with FSDP.
FSDP BENCHMARK: GPT-2 1.5B ON 2x TITAN RTX (24 GB)
Source: HuggingFace FSDP blog. Same model, same hardware. DDP cannot even start training.
At larger scale, FSDP holds up well. A PyTorch/AWS benchmark trained T5-3B on 8x A100-40GB and achieved 2.3 TFLOPS with 95% GPU memory utilization at batch_size=14. Transformer-level wrapping (wrapping each transformer block individually instead of the whole model) delivered over 2x throughput improvement. Activation checkpointing added another 10x improvement. BF16 mixed precision gave roughly 5x over FP32, and full sharding (ZeRO-3 equivalent) provided 1.5x improvement over shard-grad-op (ZeRO-2 equivalent).
| FSDP Benchmark | Hardware | Result | Source |
|---|---|---|---|
| GPT-2 1.5B | 2x Titan RTX 24GB | bs=5 (DDP OOMs) | HuggingFace FSDP blog |
| T5-3B | 8x A100-40GB | 2.3 TFLOPS, 95% mem | PyTorch/AWS blog |
| GPT 175B | 128 A100-40GB | 159–186 TFLOPS (51–60% peak) | arXiv:2304.11277 |
| 7B model | 128 A100 | 3,700 tok/s/GPU, 57% MFU | PyTorch blog |
The Zhao et al. paper (arXiv:2304.11277) also demonstrated linear scalability from 128 to 512 GPUs for GPT 175B training. FSDP scales because its communication pattern (all-gather for forward, reduce-scatter for backward) overlaps well with computation when properly configured.
FSDP2: The DTensor Rewrite
PyTorch 2.x introduced FSDP2, a ground-up rewrite based on the DTensor abstraction. The API is cleaner, composability with other parallelism strategies (tensor parallel, pipeline parallel) is significantly better, and performance improved.
FSDP2 IMPROVEMENTS OVER FSDP1
Measured on Llama models with TorchTitan. Source: arXiv:2410.06511
If you're starting a new project on PyTorch 2.x, FSDP2 is the recommended path. The original FSDP1 API still works, but new development and optimizations are focused on FSDP2.
DeepSpeed ZeRO: Three Stages of Memory Reduction
Microsoft's DeepSpeed library implements the ZeRO (Zero Redundancy Optimizer) algorithm in three stages, each sharding progressively more state across GPUs. The original ZeRO paper (Rajbhandari et al., SC'20) laid out the math clearly.
| Stage | What's Sharded | Memory Reduction | Communication Cost |
|---|---|---|---|
| Stage 1 | Optimizer states | 4x | Same as DDP |
| Stage 2 | Optimizer states + gradients | 8x | Same as DDP |
| Stage 3 | Optimizer states + gradients + parameters | Linear with Nd (64x on 64 GPUs) | 1.5x DDP volume |
Nd = number of data-parallel GPUs. Source: Rajbhandari et al., SC'20
The practical impact on 64 GPUs with a 7.5B parameter model: Stage 1 can train up to 7.5B parameters. Stage 2 pushes that to 14B. Stage 3 enables training up to 128B parameters on the same hardware. The tradeoff is communication: Stages 1 and 2 have the same communication volume as DDP, but Stage 3 adds 1.5x more.
FSDP vs DeepSpeed: Head-to-Head
Both FSDP (FULL_SHARD) and DeepSpeed ZeRO Stage 3 implement the same core algorithm: shard everything, all-gather before compute, reduce-scatter after. The differences are in implementation quality, ecosystem integration, and performance at different scales.
An IBM/ETH Zurich benchmark found that FSDP FULL_SHARD ran up to 5x faster per iteration than DeepSpeed ZeRO-3 for models up to a few billion parameters. The gap narrows as model size increases. At 10B+ parameters, the difference shrinks significantly. At 70B+, they largely converge.
| Model Scale | FSDP vs ZeRO-3 | Notes |
|---|---|---|
| 1–3B | FSDP up to ~5x faster | FSDP's tighter PyTorch integration pays off |
| 10–30B | Gap narrows | Communication becomes dominant factor |
| 70B+ | Roughly equivalent | Both bottlenecked by interconnect bandwidth |
Source: IBM/ETH Zurich distributed training comparison
Decision Matrix: Which Strategy to Use
Model size is the primary decision variable. Everything else is secondary.
| Criteria | DDP | FSDP | DeepSpeed ZeRO |
|---|---|---|---|
| Model < 500M | Best | Overkill | Overkill |
| Model 500M–2B | OK | Similar perf | Stage 1–2 |
| Model > ~2.3B | OOM | Required | Stage 3 required |
| Simplicity | Easiest | Moderate | Config-heavy |
| Native PyTorch | Yes | Yes | Third-party |
| CPU offloading | No | Yes | Yes (ZeRO-Offload) |
| Checkpoint complexity | Simple | Moderate | Complex |
| HF Trainer support | Built-in | Built-in | Built-in |
"OOM" assumes 80 GB GPU with mixed-precision AdamW. The exact threshold depends on sequence length and batch size.
The short version: if your model fits in GPU memory with room for reasonable batch sizes, use DDP. If it doesn't, use FSDP (for PyTorch-native) or DeepSpeed Stage 3 (if you need the broader DeepSpeed ecosystem or ZeRO-Offload). For models between 500M and 2B, either DDP or FSDP will work. Above 2.3B on a single 80 GB GPU, sharding is not optional.
Communication Overhead: NVLink vs PCIe
DDP only all-reduces gradients once per step. FSDP and ZeRO Stage 3 all-gather parameters before every forward and backward pass, then reduce-scatter gradients after. That means interconnect bandwidth matters much more for FSDP/ZeRO-3 than for DDP.
NVLINK VS PCIE: DDP BENCHMARK
GPT-2 on 2x Titan RTX. Source: HuggingFace docs
For FSDP and ZeRO-3, the impact is larger because the communication volume is 1.5x that of DDP. If you're running FSDP or ZeRO-3 on hardware without NVLink (consumer GPUs, PCIe-only multi-GPU setups), expect significant communication overhead. NVLink is critical for sharded strategies at any meaningful model size.
Practical Gotchas
1. CPU offloading costs 20–40% iteration speed
Both FSDP and DeepSpeed support offloading optimizer states (and optionally parameters) to CPU RAM. This lets you train larger models on fewer GPUs. The cost: 20–40% slower iterations due to PCIe transfers between CPU and GPU. Use this as a last resort when sharding alone isn't enough.
2. Use BFloat16, not Float16, with FSDP
BFloat16 has the same dynamic range as Float32, so it doesn't require a gradient scaler. Float16 with FSDP requires careful loss scaling to avoid underflow, especially when gradients are sharded across GPUs. On A100 and H100, BFloat16 is roughly 4% faster than Float16 because it skips the scaling overhead entirely.
3. FSDP checkpoints are not DDP checkpoints
FSDP saves sharded checkpoints by default. Each rank writes its own shard. To get a single consolidated checkpoint (for inference or resuming with a different GPU count), you need FullStateDictConfig(offload_to_cpu=True, rank0_only=True) or the newer torch.distributed.checkpoint. DeepSpeed has its own checkpoint format that also requires explicit consolidation. Plan your checkpoint strategy before you start training.
4. Wrapping policy matters for FSDP performance
Wrapping each transformer block as its own FSDP unit (instead of wrapping the entire model) gave 2x throughput improvement in the T5-3B PyTorch/AWS benchmark. The wrapping granularity controls how much is all-gathered at once and how well communication overlaps with computation.
5. DeepSpeed ZeRO-Infinity adds NVMe offloading
DeepSpeed can offload to NVMe SSDs in addition to CPU RAM, enabling training of extremely large models on limited GPU counts. FSDP does not have an NVMe offload equivalent. If you need to train a model that doesn't fit even with CPU offloading, ZeRO-Infinity is currently the only option.
Find Your Strategy Before You Launch
The right distributed strategy depends on your model size, your hardware, and your interconnect. Alloc profiles your training job, measures VRAM usage across all GPUs, and tells you whether you're communication-bound, memory-bound, or compute-bound.
pip install alloc && alloc run python train.py30 seconds of calibration. Full VRAM breakdown, bottleneck detection, and right-sizing recommendations.
Sources
- Li et al., "PyTorch Distributed: Experiences on Accelerating Data Parallel Training," VLDB 2020
- Zhao et al., "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel," arXiv:2304.11277
- Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models," SC'20
- HuggingFace, "Accelerate FSDP Integration" blog post and multi-GPU documentation
- PyTorch/AWS, "Getting Started with Fully Sharded Data Parallel (FSDP)" blog post (T5-3B benchmarks)
- PyTorch blog, "Training a 7B model on 128 GPUs with FSDP"
- TorchTitan team, "TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training," arXiv:2410.06511
- Siddharth et al., "Memory-efficient FSDP analysis," arXiv:2505.12832
- IBM/ETH Zurich, distributed training strategy comparison (FSDP vs DeepSpeed ZeRO-3)