7 Training Performance Stragglers That Burn GPU Hours (and How to Find Them)

By allocguy · February 2026 · 14 min read

Chris Fregly's AI Systems Performance Engineering (O'Reilly, 2025) makes a compelling case: GPU training has its own discipline of performance engineering, distinct from traditional systems work. The failure modes aren't segfaults or 500 errors. They're silent. Your training job finishes, but it took 3x longer than it should have, and nobody noticed because nobody measured.

Fregly calls these "performance stragglers"—components that drag down end-to-end throughput without raising exceptions. Unlike crashes, stragglers don't appear in your logs. They appear in your cloud bill. An H100 node at $3.50/GPU/hr across 8 GPUs is $672/day. A straggler that adds 30% to your training time costs $200/day in waste—and most teams don't find it until weeks into a job, if ever.

This post covers 7 of the most common training performance stragglers with real numbers: bandwidth calculations, memory math, and the specific hardware thresholds where each straggler becomes the dominant bottleneck. Everything here is publicly documented—the problem isn't that the information doesn't exist, it's that most teams discover it after burning through their GPU budget.

1. The Compute-to-Memory Bandwidth Gap

This is the meta-straggler that explains why every other straggler on this list is getting worse, not better, with each GPU generation. GPU compute (TFLOPs) has been growing faster than GPU memory bandwidth (TB/s) for a decade. The result: the minimum arithmetic intensity (FLOPs per byte loaded from memory) required to keep the GPU's math units busy keeps rising.

GPUBF16 TFLOPsHBM BandwidthFLOPs/Byte RatioYear
V100125900 GB/s1392017
A1003122.0 TB/s1562020
H100 SXM9893.35 TB/s2952022
B2002,2508.0 TB/s2812024

BF16 dense Tensor Core throughput (non-sparsity). FLOPs/Byte = peak TFLOPs × 10³ / HBM bandwidth in GB/s.

From V100 to H100, the ratio more than doubled: 139 to 295. That means an operation that was compute-bound on V100 can be memory-bandwidth-bound on H100 without changing a single line of code. Your training script didn't get slower—the hardware got faster at math but didn't keep up on data movement.

This is the physics underneath every kernel fusion optimization, Flash Attention, operator fusion in torch.compile, and FP8 quantization. They all solve the same problem: reduce bytes loaded from HBM per FLOP so the GPU's compute units aren't starved. An unfused layer norm on H100 spends ~90% of its time waiting for memory reads. Fuse it with the subsequent linear layer, and the data stays in SRAM registers. This is the core of what NVIDIA's CuTiles (CUTLASS 4.0) is trying to democratize: tiled, SRAM-resident computation as a programmable abstraction rather than a hand-optimized one-off kernel.

The practical implication: if you upgrade from A100 to H100 and your MFU (Model FLOPs Utilization) drops, you haven't found a bug. You've hit the bandwidth wall. The fix isn't more GPUs; it's better kernels.

2. Gradient Synchronization Overhead

In distributed training with DDP or FSDP, every step ends with an AllReduce: all GPUs exchange and average their gradients. The fastest GPU in the group waits for the slowest one to finish its backward pass before any GPU can proceed to the next step. One slow rank holds back the entire job.

Let's do the math. A 7B parameter model in BF16 generates 14 GB of gradient data per step (7B × 2 bytes). In a ring-allreduce across 8 GPUs, each GPU sends and receives approximately 2 × (N-1)/N × 14 GB ≈ 24.5 GB of data. The time this takes depends entirely on your interconnect.

InterconnectBandwidth (bidirectional)AllReduce Time (7B, 8 GPUs)Hardware
PCIe 4.064 GB/s~383 msConsumer/workstation GPUs
PCIe 5.0128 GB/s~191 msL40S, some H100 PCIe
NVLink 4.0900 GB/s~27 msH100 SXM, A100 SXM
NVLink 5.01.8 TB/s~14 msGB200 NVL72

AllReduce times are theoretical minimums assuming full bandwidth utilization. Real-world times are 1.5–2x higher due to protocol overhead and NCCL scheduling.

On PCIe 4.0, gradient sync for a 7B model takes almost 400 ms per step—that can easily exceed the compute time itself, meaning your GPUs spend more time communicating than training. Rajbhandari et al.'s ZeRO paper (SC'20) and Narayanan et al.'s Megatron-LM work showed that communication-compute overlap is the key to scaling—but overlap only works if your interconnect can keep up with the gradient volume.

At the other end, NVIDIA's GB200 NVL72 connects 72 Blackwell GPUs with NVLink 5.0 at 1.8 TB/s per GPU. Even at that bandwidth, a 70B model (140 GB of gradients) still needs ~170 ms for a ring-allreduce across all 72 GPUs. Bandwidth is never "enough"—models grow to fill whatever interconnect is available.

DETECTION SIGNALS

Communication time as a percentage of total step time — above 20% indicates significant overhead
Per-rank timing variance — if the slowest rank is 2x+ the median, you have a straggler
Scaling efficiency below 80% when adding GPUs — communication is eating your gains

Fix hints: enable gradient bucketing (DDP does this by default, but bucket size tuning matters), overlap communication with computation via torch.nn.parallel.DistributedDataParallel(bucket_cap_mb=25), and verify your interconnect. The difference between PCIe and NVLink is not 2x—it's 14x.

3. torch.compile Graph Breaks and Silent Fallback

torch.compile (PyTorch 2.0+) uses TorchDynamo to trace your model's forward pass into a graph, then optimizes it via Triton or TorchInductor: fusing kernels, eliminating redundant memory accesses, and generating hardware-specific code. On a well-compiled model, it delivers 20–40% training speedup for free. On a poorly-compiled one, it delivers nothing—and doesn't tell you.

The failure mode is graph breaks: points where the tracer can't capture a Python operation into the graph and falls back to eager execution. Each graph break splits the model into smaller compilable subgraphs with eager gaps between them. Those gaps reintroduce the exact memory bandwidth overhead that compilation was supposed to eliminate.

COMMON GRAPH BREAK TRIGGERS

Data-dependent control flow if tensor.item() > 0 forces eager execution because the branch depends on runtime values
Dynamic shapes — variable sequence lengths across batches trigger recompilation or graph breaks on every shape change
Custom CUDA extensions — any torch.autograd.Function with a custom backward is opaque to the tracer
Python side effects — logging, metric tracking, or printing inside forward() breaks the trace

The insidious part: your training still runs. PyTorch doesn't error on a graph break—it silently falls back to eager mode for that subgraph. You think you're running compiled code, but you're actually running a patchwork of compiled and eager regions. Set TORCH_LOGS="graph_breaks" to see every break, or use torch.compile(fullgraph=True) to force a hard error instead of silent degradation.

There's also the cold start problem. First-iteration compilation for a 7B model can take 10–60 minutes depending on model complexity. On spot instances or preemptible VMs, a preemption means a full recompile on restart. The compilation cache helps across restarts on the same machine, but doesn't transfer across nodes in distributed training without explicit cache sharing. For a 4-node job that gets preempted 3 times over a weekend, you've burned 1–3 hours of GPU time just recompiling.

4. Inefficient Attention Computation

Vanilla self-attention computes the full N×N attention matrix and stores it in GPU memory. Let's quantify this for a typical LLM architecture. Take a model with 32 attention heads and a sequence length of 4096: the attention matrix per layer is batch × 32 heads × 4096 × 4096 × 2 bytes = 1 GB per layer at batch_size=1 in BF16. With 32 layers, that's 32 GB of attention matrices alone—nearly half an 80 GB A100's memory, just for intermediate attention state.

Scale to 8192 tokens (common for code models and long-context fine-tuning) and it quadruples: 128 GB of attention matrices. That's more than any single GPU can hold, and you haven't even counted model weights, optimizer states, or activations.

Flash Attention (Dao et al., 2022) solved this by fusing the attention computation into a single CUDA kernel that tiles the computation in SRAM and never materializes the full attention matrix, reducing memory from O(N²) to O(N). On an A100, Flash Attention achieves 124 TFLOPs for forward pass (40% of peak BF16 throughput) vs. ~20 TFLOPs for a naive PyTorch implementation. That's a 6x speedup from a single kernel change.

Tri Dao's key insight was that the bottleneck isn't compute—it's memory bandwidth. Flash Attention is the poster child for why fused kernels matter in ML workloads. An H100 has 3.35 TB/s HBM bandwidth but 989 TFLOPs of BF16 compute. The arithmetic intensity (FLOPs per byte) required to saturate compute is ~295 FLOPs/byte. Vanilla attention reads and writes the entire N×N matrix from HBM, wasting bandwidth on data movement instead of math.

If your training script doesn't explicitly use torch.nn.functional.scaled_dot_product_attention (added in PyTorch 2.0) or FlashAttention-2, you're leaving a 3–6x speedup on the table. Profile the attention forward pass time relative to total forward time. If attention dominates (>50% of forward time), you're almost certainly using a naive implementation.

5. FP8: The 2x Throughput You're Not Using

Every H100 ships with FP8 Tensor Cores rated at 1,978 TFLOPs— exactly 2x the BF16 throughput of 989 TFLOPs. On Blackwell (B200), the gap is the same: ~4,500 TFLOPs FP8 vs. ~2,250 TFLOPs BF16. If you're training on Hopper or Blackwell in BF16, you're leaving half of the silicon's capability unused.

So why isn't everyone using FP8? Because it's hard. FP8 has two formats: E4M3 (4 exponent bits, 3 mantissa— range [−448, 448], better precision) and E5M2 (5 exponent bits, 2 mantissa—range [−57344, 57344], better dynamic range). The standard recipe is E4M3 for forward activations and weights, E5M2 for backward gradients, since gradients need more dynamic range to avoid underflow.

PrecisionH100 TFLOPsB200 TFLOPsScaling RequiredTypical Speedup
FP32~67~150NoneBaseline
BF169892,250None~2x over FP32
FP81,978~4,500Per-tensor delayed scaling~2x over BF16

Dense Tensor Core throughput. Actual training speedup is typically 1.3–1.7x over BF16 due to non-matmul operations remaining in higher precision.

The core challenge is per-tensor scaling. FP8's tiny representable range means you need to compute the absolute max of each tensor and scale it to fit within [−448, 448] (E4M3) before quantizing. The standard approach is "delayed scaling": use the previous iteration's absmax to scale the current iteration's tensors, avoiding an extra synchronization pass. But if the tensor distribution shifts between iterations—common during learning rate warmup or at loss spikes—the stale scale factor causes overflow or underflow, and the training loss diverges.

The other landmine is outlier channels. Some weight and activation channels have magnitudes 100x larger than the median. FP8 quantization clips these outliers, effectively zeroing out the information they carry. This manifests as loss spikes 500–2000 steps into training—long enough that you've invested significant compute before noticing.

NVIDIA's TransformerEngine handles the scaling and format management, but requires rewriting your model to use TE-specific layers. Microsoft's MS-AMP provides a more drop-in solution. Both are still maturing. If you're running on H100s in BF16, FP8 is the single largest throughput gain available to you—but it requires instrumentation to verify training stability.

6. Checkpointing I/O Blocking

Model checkpointing is essential—nobody wants to lose hours of training progress to a hardware fault. But synchronous checkpoint saves block training while the entire model state is serialized and written to disk. The checkpoint for a 7B parameter model includes the model weights (14 GB in BF16), optimizer states (28 GB for AdamW—two FP32 momentum buffers per parameter), and the gradient scaler state. That's roughly 42 GB total.

Storage TypeWrite SpeedTime for 42 GB Checkpoint
NVMe SSD (local)3–7 GB/s6–14 sec
EBS gp3 (AWS)~1 GB/s~42 sec
NFS / shared filesystem200–500 MB/s84–210 sec
S3 (direct)~500 MB/s (multipart)~84 sec

Write speeds are approximate. Actual throughput depends on concurrent I/O, filesystem cache pressure, and network conditions.

On a shared NFS mount—common in on-prem clusters—a single checkpoint can block training for over 3 minutes. If you checkpoint every 1000 steps and your step time is 500 ms, that's 3 minutes of idle time every 8 minutes of training. You're spending 27% of your GPU time waiting for I/O.

The signature is a distinctive sawtooth pattern in step times: consistent fast steps interrupted by periodic spikes. Maeng et al.'s CPR paper (Stanford, MLSys 2021) found checkpoint overhead consumes 12% of total training time on average, reaching 43% for the worst jobs. Fixes: use asynchronous checkpointing via torch.distributed.checkpoint (which writes sharded checkpoints in the background), stage checkpoints to local NVMe before uploading to networked storage, and reduce checkpoint frequency when stability allows. PyTorch 2.3+ added AsyncCheckpointer for production-ready async saves that overlap I/O with the next training step.

7. Stragglers in Multi-Node Training

Multi-node training introduces a class of stragglers that don't exist on a single machine. One slow node—whether from thermal throttling, noisy neighbors on shared cloud instances, network congestion, or OS-level interrupts—holds back the entire job. With synchronous training (DDP, FSDP), every node must finish its batch before any node can proceed.

Pipeline parallelism: bandwidth math

In pipeline parallelism, you split the model across nodes by layers. Inter-stage communication sends activations between pipeline stages for every micro-batch. Take a 70B model with d_model=8192 across 8 pipeline stages with 4 micro-batches and sequence length 4096 in BF16: each activation tensor is 4 × 4096 × 8192 × 2 bytes = 256 MB. With 4 micro-batches flowing through 7 stage boundaries, that's roughly 7 GB of point-to-point transfers per step. On InfiniBand HDR (200 Gbps = 25 GB/s), that's ~280 ms of pure data transfer per step—competitive with the compute time itself.

MoE: all-to-all bandwidth saturation

Mixture-of-Experts models are the worst case for interconnect bandwidth. With top-k=2 routing across 64 experts on 64 GPUs, every token dispatched to a remote expert triggers an all-to-all communication. For a single MoE layer with batch 2048 tokens, d_model=4096, and BF16: each token is 8 KB, and top-2 routing sends 2048 × 2 × 8 KB = 32 MB per layer. A Mixtral-style model with 32 MoE layers generates ~1 GB of all-to-all traffic per step. But all-to-all is far less bandwidth-efficient than allreduce—NCCL achieves only 40–60% of peak bandwidth on all-to-all patterns. On InfiniBand HDR, that 1 GB takes ~65 ms, repeated every step. This is why Google's TPU pods and NVIDIA's GB200 NVL72 (130 TB/s bisection bandwidth) exist—MoE models literally cannot train efficiently on commodity interconnects.

Infrastructure-level stragglers

This is where systems engineering becomes more important than ML engineering. Narayanan et al.'s PipeDream work at Microsoft Research showed that pipeline and data parallelism interact with hardware topology in non-obvious ways. The problem isn't a bug in your training code. It's a systems problem: thermal throttling on a GPU that's been running at 100% utilization for 12 hours (H100 throttles from 1980 MHz to ~1700 MHz under sustained thermal load—a 14% clock reduction), a cloud VM that shares physical cores with a noisy neighbor, or a network switch that's congested during peak hours.

MULTI-NODE STRAGGLER SOURCES

Thermal throttling — GPU clocks reduce 10–15% under sustained load, slowing one rank while all others wait at the next AllReduce
Noisy neighbors — shared cloud instances where CPU contention stalls DataLoader threads, starving the GPU between collectives
Network congestion — cross-rack traffic on 100 Gbps Ethernet vs. intra-rack NVLink creates 10x latency asymmetry for NCCL collectives
ECC memory errors — correctable ECC errors slow GPU memory access without crashing; check nvidia-smi -q -d ECC

Detection requires per-rank step time distribution analysis. Log step times per rank and look for outliers: if rank 5 is consistently 15% slower than the median, that node is the straggler. Mitigations include health checks before launching (GPU clock speeds, memory bandwidth tests via cuda_memcheck), using dedicated/bare metal instances for multi-day jobs, and tuning NCCL_TIMEOUT so a single slow rank doesn't cascade into a job-wide hang.

A Checklist Before You Launch

Here's a quick reference for catching these stragglers before they cost you GPU hours.

StragglerWhat to CheckOne-Line Fix
Bandwidth gapMFU drop after GPU upgradeFuse kernels, use torch.compile
Gradient syncComm time as % of step timeVerify NVLink, tune bucket size
torch.compileTORCH_LOGS="graph_breaks"fullgraph=True, fix breaks
AttentionAttention forward time vs totalUse SDPA or FlashAttention-2
FP8 underuseBF16 on Hopper/BlackwellTransformerEngine or MS-AMP
Checkpoint I/OStep time spikes at intervalsAsync checkpoint, NVMe staging
Multi-nodePer-rank step time distributionDedicated instances, health checks

Fregly's book includes a checklist of 200+ items for production AI system performance—the table above is a starting point, not a replacement. For the comprehensive treatment, read the book.

Most Teams Find These After the Budget Is Gone

The pattern is always the same: a team launches a multi-day training job, notices it's slower than expected, and starts investigating two weeks in. By then they've already burned $10k–$50k on a misconfigured pipeline. The straggler was there from step 1, but nobody measured the right thing early enough.

Profiling after the fact means you're debugging a completed expense. What you need is a pre-flight check: profile the first 100 steps, surface the bottleneck breakdown, and catch the misconfiguration before it compounds over 100,000 steps.

Alloc does exactly this. It wraps your training command, profiles the first calibration window, and flags data pipeline starvation, gradient sync overhead, memory fragmentation, attention inefficiency, and mixed precision issues—automatically, without modifying your training code.

pip install alloc && alloc run python train.py

30 seconds of calibration. Full bottleneck report with actionable fix suggestions. Find the straggler before it finds your budget.

Sources

  • Chris Fregly, AI Systems Performance Engineering (O'Reilly, 2025)
  • Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," NeurIPS 2022
  • Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models," SC'20
  • Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM," SC'21
  • Narayanan et al., "PipeDream: Generalized Pipeline Parallelism for DNN Training," SOSP'19
  • Maeng et al., "CPR: Understanding and Improving Failure Tolerant Training," MLSys 2021
  • Micikevicius et al., "FP8 Formats for Deep Learning," arXiv:2209.05433
  • NVIDIA H100, B200, GB200 NVL72 data sheets: TFLOPs, HBM bandwidth, NVLink specifications
  • NVIDIA CUTLASS 4.0 / CuTiles: cooperative tiling abstraction for GPU kernels
  • NVIDIA TransformerEngine: FP8 training, delayed scaling, per-tensor quantization
  • PyTorch documentation: torch.compile, TorchDynamo, TorchInductor, FSDP, DDP
  • PyTorch blog, "Introducing AsyncCheckpointer for Distributed Training"

Related Reading