A100 vs H100 vs L40S: Which GPU for Your Training Job?

By allocguy · February 2026 · 10 min read

"Which GPU should I use?" is one of the most common questions I see in ML engineering Slacks, Discord servers, and Reddit threads. The answer is always "it depends," which is technically correct and completely useless.

So here's my attempt at a practical framework. Not a vendor comparison chart that tells you the H100 is faster (no kidding), but a guide for actually deciding which GPU to rent for a specific workload, given real constraints like budget, availability, and what you're actually training.

In 2025-2026, the three GPUs most ML teams actually choose between are the A100 (40GB and 80GB variants), the H100 SXM, and increasingly, the L40S. There are other options (H200, A10G, L4, RTX 4090), but these three cover probably 80% of serious training workloads in the cloud. Let's break them down.

The Specs, Side by Side

SpecA100-40GBA100-80GBH100 SXML40S
VRAM40 GB HBM2e80 GB HBM2e80 GB HBM348 GB GDDR6X
FP16 TFLOPS312312990362
BF16 TFLOPS312312990362
Memory BW1.6 TB/s2.0 TB/s3.35 TB/s864 GB/s
TDP400W400W700W350W
NVLinkYes (600 GB/s)Yes (600 GB/s)Yes (900 GB/s)No
Cloud $/hr$1.50–3.10$2.00–3.90$3.00–4.90$0.80–1.50

Pricing ranges reflect on-demand rates across major cloud providers (AWS, GCP, Azure, Lambda, CoreWeave) as of early 2026. Spot/preemptible pricing can be 60-70% lower.

A100-40GB: The Reliable Workhorse

The A100-40GB is still the most widely available data center GPU in the cloud. It's been around since 2020 and supply is mature, which means availability is good and pricing is competitive. For a lot of workloads, this is the default choice, and that's not a bad thing.

Good for

  • Full fine-tuning 7B parameter models in fp16/bf16. Tight on memory but works with gradient checkpointing enabled.
  • LoRA and QLoRA on 13B+ models where you only need to fit the base model weights plus a small number of trainable parameters
  • Inference serving for most production models
  • Best price-to-performance ratio for mid-size training where you're not memory-starved

Watch out for

  • The 40GB ceiling gets real fast. Full fine-tuning a 13B model in bf16 needs roughly 50–60GB of VRAM (parameters + optimizer states + activations). You'll OOM unless you shard across multiple GPUs.
  • Batch sizes are constrained. If your workload benefits from large batches (e.g., contrastive learning), 40GB may force you into gradient accumulation, which slows things down.

A100-80GB: When 40GB Isn't Enough

The 80GB variant of the A100 is the same chip, same compute, just double the VRAM and slightly higher memory bandwidth (2.0 TB/s vs 1.6 TB/s). The extra 40GB of headroom is often worth the ~30% price premium over the 40GB variant, because it means the difference between "fits on one GPU" and "needs multi-GPU sharding."

Good for

  • Full fine-tuning 7–13B models with room to breathe. No need to micro-optimize batch sizes or enable every memory-saving trick.
  • Pre-training runs where memory bandwidth matters and you want the 2.0 TB/s HBM2e
  • Large batch training that would OOM on 40GB
  • Multi-GPU setups where you want fewer GPUs (e.g., 2x80GB instead of 4x40GB) to reduce communication overhead

Watch out for

  • Defaulting to 80GB "just in case" when your workload only needs 28GB. I see this constantly. Run a quick VRAM estimate first. You might be paying a 30% premium for VRAM you never touch.
  • The compute is identical to the 40GB variant. If you're compute-bound (not memory-bound), the 80GB version won't train faster. It just has more headroom.

H100: The Current King

The H100 SXM is the flagship. 3.2x the FP16/BF16 compute of the A100, 1.7x the memory bandwidth, and 50% more NVLink bandwidth for multi-GPU communication. On paper, it's the obvious choice. In practice, whether that performance premium translates to actual savings depends entirely on your workload.

Good for

  • Pre-training runs where throughput matters most. The faster you finish, the less total you spend on GPU-hours
  • Large model training (70B+ parameters) where NVLink bandwidth between GPUs is a real bottleneck with FSDP/tensor parallelism
  • Compute-bound workloads that can actually saturate the 990 TFLOPS of BF16 throughput
  • FP8 training. The H100's Transformer Engine enables FP8 precision, which can nearly double throughput for supported architectures

Watch out for

  • The price premium. If your workload is memory-bound (waiting on data to move, not waiting on compute), the 3.2x compute advantage doesn't help. You're paying H100 prices for A100-class performance.
  • Data pipeline bottlenecks. If your DataLoader can't keep the GPU fed, spending more on faster GPUs is like buying a faster car for a road with a 30 mph speed limit.
  • Availability. H100s are still harder to get than A100s in many regions. Queue times can eat into your wall-clock savings.

L40S: The Dark Horse

The L40S is the GPU nobody talks about but a lot of teams are quietly adopting. 48GB of VRAM at roughly half the hourly cost of an A100-40GB. The catch? It uses GDDR6X instead of HBM, so memory bandwidth is significantly lower (864 GB/s vs 2.0+ TB/s). That tradeoff matters, but for the right workloads, it's a great deal.

Good for

  • LoRA and QLoRA fine-tuning. 48GB comfortably fits most 7–13B LoRA setups, and LoRA is less bandwidth-intensive than full fine-tuning
  • Inference workloads where latency requirements aren't extreme and you want maximum VRAM per dollar
  • Development and experimentation where cost matters more than wall-clock speed
  • Single-GPU training where NVLink doesn't matter

Watch out for

  • Memory bandwidth. At 864 GB/s, bandwidth-heavy operations (large matrix multiplications, attention layers on long sequences) will bottleneck much sooner than on HBM-based GPUs. If your workload is memory-bandwidth-bound, the L40S will feel sluggish.
  • No NVLink. Multi-GPU training is limited to PCIe bandwidth (~64 GB/s bidirectional), which is 10–14x slower than NVLink on A100/H100. FSDP and tensor parallelism across L40S GPUs will have significant communication overhead.
  • Compute-heavy full fine-tuning. The FP16 throughput (362 TFLOPS) is only ~16% faster than an A100, so you're not getting meaningfully more compute, just more VRAM at a lower price.

The Real Decision Framework

ALLOCGUY'S TAKE

GPU comparison articles usually list specs and stop. Here's the decision process I actually use when helping teams pick hardware:

1. Start with VRAM: what does your workload actually need?

Not "what GPU sounds right," but: what is the actual peak VRAM your job will consume? This includes model parameters, optimizer states, gradients, activations, and framework overhead. For a 7B model in bf16 with AdamW, that's roughly 14GB (params) + 28GB (optimizer) + activations. You can estimate this with a static analysis tool like alloc ghost before you spend a dollar on GPU time.

2. Check the bottleneck: compute, memory, or budget?

Compute-bound (GPU utilization near 100%, waiting on math) → H100. The 3.2x compute premium is worth it.
Memory-bandwidth-bound (GPU utilization moderate, large model weights constantly shuffling) → A100-80GB. HBM bandwidth matters more than raw TFLOPS here.
Budget-bound (need VRAM, don't need maximum throughput) → L40S. 48GB at $0.80–1.50/hr is hard to beat.

3. Then check total cost, not hourly cost.

This is the part most people get wrong. Total training cost = $/hr × hours. A faster GPU at a higher hourly rate can be cheaper overall if it finishes significantly sooner.

EXAMPLE: TRAINING A 7B MODEL FOR 1,000 STEPS

GPU$/hrTimeTotal Cost
H100 SXM$4.00~2 hours$8.00
A100-80GB$2.50~5 hours$12.50
L40S$1.00~7 hours$7.00

The "expensive" H100 costs less than the A100 here, because it finishes 2.5x sooner. But the L40S is cheapest of all if your workload isn't bandwidth-bound and you can wait 7 hours. The right answer depends on whether you value wall clock time or dollars more. Both are valid.

The Hidden Costs Nobody Mentions

GPU comparison articles focus on specs and hourly pricing. But there are real costs that never show up in the spec sheet:

Queue time

H100s are still the hardest to get. If you're waiting 2 hours in a queue before your job starts, that's 2 hours of wall-clock time you didn't account for. A100s and L40S typically have much shorter queue times.

Multi-GPU overhead

NVLink vs PCIe matters enormously for multi-GPU training. Scaling from 1 to 8 GPUs on NVLink-connected H100s gives near- linear speedup. On PCIe-connected L40S GPUs, communication overhead can eat 30–50% of your theoretical gains.

Spot vs. on-demand

Spot pricing can cut costs 60–70%, but with preemption risk. An interrupted 8-hour A100 run that needs to restart from a checkpoint is more expensive than an on-demand run that finishes uninterrupted. Factor in your checkpointing strategy.

Failed run cost

When you're experimenting, every OOM or config error costs you a full provisioning cycle. On cheaper GPUs, failures are cheaper. If you're in the "trying things" phase, the L40S at $1/hr makes your experiments 3x cheaper to fail.

The GPU Pricing Landscape in 2025–2026

GPU cloud pricing has been volatile. A few things worth knowing:

  • Prices vary 2–3x between providers for the same GPU. An A100-80GB might be $2.00/hr on one provider and $3.90/hr on another. Always check multiple providers. Lambda, CoreWeave, RunPod, and Vast.ai tend to undercut the hyperscalers significantly.
  • A100 pricing has stabilized as supply caught up with demand. This is arguably the best value tier right now: mature hardware, wide availability, well-understood performance characteristics.
  • H100 pricing is trending down but still carries a premium. As H200s and B100/B200 GPUs come online, expect H100 prices to drop further. If your timeline is flexible, waiting 3–6 months could save you 20–30%.
  • L40S is the sleeper pick for teams doing LoRA fine-tuning and inference. It's often available when everything else is sold out, and the $/VRAM ratio is unbeatable.

The broader trend: GPU compute is getting cheaper per FLOP, but models are growing faster than prices are dropping. The real savings come not from finding the cheapest GPU, but from making sure you're using the right one for your specific workload.

Quick Reference: Which GPU for What

WorkloadBest FitWhy
LoRA fine-tune 7–13BL40S48GB fits comfortably, lowest cost
Full fine-tune 7BA100-40GBTight but works; best $/perf
Full fine-tune 13BA100-80GBNeeds >40GB; HBM bandwidth helps
Pre-training (any size)H100Throughput is everything; lower total cost
Large model (70B+)H100NVLink bandwidth for multi-GPU sharding
Experimentation / iterationL40SCheapest to fail on
Inference servingL40S / A100-40GBDepends on latency requirements

Not sure which GPU fits your workload?

Run alloc ghost your_model.py to estimate VRAM requirements before you provision anything. Or use alloc run to measure actual GPU utilization on whatever hardware you already have access to. Both work offline, take under a minute, and don't modify your training code.

Related Reading