By allocguy · February 2026 · 10 min read
"Which GPU should I use?" is one of the most common questions I see in ML engineering Slacks, Discord servers, and Reddit threads. The answer is always "it depends," which is technically correct and completely useless.
So here's my attempt at a practical framework. Not a vendor comparison chart that tells you the H100 is faster (no kidding), but a guide for actually deciding which GPU to rent for a specific workload, given real constraints like budget, availability, and what you're actually training.
In 2025-2026, the three GPUs most ML teams actually choose between are the A100 (40GB and 80GB variants), the H100 SXM, and increasingly, the L40S. There are other options (H200, A10G, L4, RTX 4090), but these three cover probably 80% of serious training workloads in the cloud. Let's break them down.
| Spec | A100-40GB | A100-80GB | H100 SXM | L40S |
|---|---|---|---|---|
| VRAM | 40 GB HBM2e | 80 GB HBM2e | 80 GB HBM3 | 48 GB GDDR6X |
| FP16 TFLOPS | 312 | 312 | 990 | 362 |
| BF16 TFLOPS | 312 | 312 | 990 | 362 |
| Memory BW | 1.6 TB/s | 2.0 TB/s | 3.35 TB/s | 864 GB/s |
| TDP | 400W | 400W | 700W | 350W |
| NVLink | Yes (600 GB/s) | Yes (600 GB/s) | Yes (900 GB/s) | No |
| Cloud $/hr | $1.50–3.10 | $2.00–3.90 | $3.00–4.90 | $0.80–1.50 |
Pricing ranges reflect on-demand rates across major cloud providers (AWS, GCP, Azure, Lambda, CoreWeave) as of early 2026. Spot/preemptible pricing can be 60-70% lower.
The A100-40GB is still the most widely available data center GPU in the cloud. It's been around since 2020 and supply is mature, which means availability is good and pricing is competitive. For a lot of workloads, this is the default choice, and that's not a bad thing.
Good for
Watch out for
The 80GB variant of the A100 is the same chip, same compute, just double the VRAM and slightly higher memory bandwidth (2.0 TB/s vs 1.6 TB/s). The extra 40GB of headroom is often worth the ~30% price premium over the 40GB variant, because it means the difference between "fits on one GPU" and "needs multi-GPU sharding."
Good for
Watch out for
The H100 SXM is the flagship. 3.2x the FP16/BF16 compute of the A100, 1.7x the memory bandwidth, and 50% more NVLink bandwidth for multi-GPU communication. On paper, it's the obvious choice. In practice, whether that performance premium translates to actual savings depends entirely on your workload.
Good for
Watch out for
The L40S is the GPU nobody talks about but a lot of teams are quietly adopting. 48GB of VRAM at roughly half the hourly cost of an A100-40GB. The catch? It uses GDDR6X instead of HBM, so memory bandwidth is significantly lower (864 GB/s vs 2.0+ TB/s). That tradeoff matters, but for the right workloads, it's a great deal.
Good for
Watch out for
ALLOCGUY'S TAKE
GPU comparison articles usually list specs and stop. Here's the decision process I actually use when helping teams pick hardware:
1. Start with VRAM: what does your workload actually need?
Not "what GPU sounds right," but: what is the actual peak VRAM your job will consume? This includes model parameters, optimizer states, gradients, activations, and framework overhead. For a 7B model in bf16 with AdamW, that's roughly 14GB (params) + 28GB (optimizer) + activations. You can estimate this with a static analysis tool like alloc ghost before you spend a dollar on GPU time.
2. Check the bottleneck: compute, memory, or budget?
Compute-bound (GPU utilization near 100%, waiting on math) → H100. The 3.2x compute premium is worth it.
Memory-bandwidth-bound (GPU utilization moderate, large model weights constantly shuffling) → A100-80GB. HBM bandwidth matters more than raw TFLOPS here.
Budget-bound (need VRAM, don't need maximum throughput) → L40S. 48GB at $0.80–1.50/hr is hard to beat.
3. Then check total cost, not hourly cost.
This is the part most people get wrong. Total training cost = $/hr × hours. A faster GPU at a higher hourly rate can be cheaper overall if it finishes significantly sooner.
EXAMPLE: TRAINING A 7B MODEL FOR 1,000 STEPS
| GPU | $/hr | Time | Total Cost |
|---|---|---|---|
| H100 SXM | $4.00 | ~2 hours | $8.00 |
| A100-80GB | $2.50 | ~5 hours | $12.50 |
| L40S | $1.00 | ~7 hours | $7.00 |
The "expensive" H100 costs less than the A100 here, because it finishes 2.5x sooner. But the L40S is cheapest of all if your workload isn't bandwidth-bound and you can wait 7 hours. The right answer depends on whether you value wall clock time or dollars more. Both are valid.
GPU comparison articles focus on specs and hourly pricing. But there are real costs that never show up in the spec sheet:
Queue time
H100s are still the hardest to get. If you're waiting 2 hours in a queue before your job starts, that's 2 hours of wall-clock time you didn't account for. A100s and L40S typically have much shorter queue times.
Multi-GPU overhead
NVLink vs PCIe matters enormously for multi-GPU training. Scaling from 1 to 8 GPUs on NVLink-connected H100s gives near- linear speedup. On PCIe-connected L40S GPUs, communication overhead can eat 30–50% of your theoretical gains.
Spot vs. on-demand
Spot pricing can cut costs 60–70%, but with preemption risk. An interrupted 8-hour A100 run that needs to restart from a checkpoint is more expensive than an on-demand run that finishes uninterrupted. Factor in your checkpointing strategy.
Failed run cost
When you're experimenting, every OOM or config error costs you a full provisioning cycle. On cheaper GPUs, failures are cheaper. If you're in the "trying things" phase, the L40S at $1/hr makes your experiments 3x cheaper to fail.
GPU cloud pricing has been volatile. A few things worth knowing:
The broader trend: GPU compute is getting cheaper per FLOP, but models are growing faster than prices are dropping. The real savings come not from finding the cheapest GPU, but from making sure you're using the right one for your specific workload.
| Workload | Best Fit | Why |
|---|---|---|
| LoRA fine-tune 7–13B | L40S | 48GB fits comfortably, lowest cost |
| Full fine-tune 7B | A100-40GB | Tight but works; best $/perf |
| Full fine-tune 13B | A100-80GB | Needs >40GB; HBM bandwidth helps |
| Pre-training (any size) | H100 | Throughput is everything; lower total cost |
| Large model (70B+) | H100 | NVLink bandwidth for multi-GPU sharding |
| Experimentation / iteration | L40S | Cheapest to fail on |
| Inference serving | L40S / A100-40GB | Depends on latency requirements |
Run alloc ghost your_model.py to estimate VRAM requirements before you provision anything. Or use alloc run to measure actual GPU utilization on whatever hardware you already have access to. Both work offline, take under a minute, and don't modify your training code.