We Wasted $12k on GPUs in One Month. Here's What We Learned.

By allocguy · February 2026 · 8 min read

Last quarter we decided to track every dollar of GPU spend across our team. Not the monthly invoice total. We already knew that number was bad. We wanted to know where the money actually went: which runs, which instances, which decisions burned through budget.

The answer, over one particularly painful month, was $12,000 in wasted GPU spend. Not total spend, wasted spend. Money that bought us zero training progress. Zero model improvement. Zero completed experiments.

The utilization numbers were embarrassing. We were averaging 22% GPU utilization across our fleet. Some instances were running at single-digit utilization for hours. And we're not a particularly careless team. We'd just never had the visibility to know how bad it was.

Here's the breakdown of where that $12k went, and what we eventually did about it.

OOM failures: $2,800 wasted

Out-of-memory crashes were our most expensive surprise. Not because any single OOM is costly. It's because they compound. We averaged 3–4 OOM runs per experiment before finding a configuration that actually fit in VRAM.

Each failed attempt looked the same: 15–20 minutes of provisioning and queue time on the cloud provider, then a crash somewhere between step 0 and step 50. At $3.50/hr for an A100-80GB, that's $1–2 burned per failed attempt before you even see a gradient. Across 40+ experiments in a month, it added up to $2,800.

The frustrating part: every one of those OOMs was preventable. We just didn't know our VRAM footprint before launching. We were guessing batch sizes, guessing sequence lengths, guessing how much overhead the optimizer state would add. And every wrong guess cost us money and time.

Over-provisioned instances: $4,500 wasted

This was the biggest line item and the easiest mistake to make. We had a team default: 8xA100-80GB for everything. Fine-tuning a 70B model? 8xA100. Running a 7B LoRA adapter? Also 8xA100. Quick experiment with a 1.3B model? You guessed it: 8xA100.

We were paying ~$28/hr for multi-GPU nodes when a single A10G at $0.75/hr would have been plenty. For LoRA fine-tunes of 7B models, a single 24GB card is almost always enough. But nobody on the team wanted to be the person whose run OOMed because they "cheaped out" on hardware. So we all over-provisioned. The "just in case" tax.

Over the month, we estimated $4,500 went to GPUs that were either completely idle within a node (unused cards in a multi-GPU allocation) or dramatically oversized for the workload they were running.

Idle GPU time: $3,200 wasted

This one was sneaky. It wasn't about launching the wrong instance. It was about what happened during a run. DataLoader bottlenecks where the GPU sat at 5–15% utilization while the CPU churned through preprocessing. Slow tokenization. Image augmentation pipelines that couldn't keep up with a fast GPU.

Then there were the notebook sessions. Someone would spin up a GPU instance for interactive development, run a few cells, go to lunch, come back, run a few more cells, go home. An 8-hour session where the GPU did maybe 40 minutes of actual work. We found instances left running overnight. Entire weekends, in one case. GPU at 0% utilization.

$3,200 in a month, just in GPU time where the hardware was technically allocated to us but doing essentially nothing.

The ablation tax: $1,500 wasted

Every experiment needs some trial-and-error. You try different batch sizes, learning rate schedules, warmup steps, gradient accumulation settings. Some of those are hyperparameter searches. Others are just trying to find a configuration that runs at all on a given GPU.

We tracked 10–30 trial runs per experiment across the team. Most of these were going to fail, but we didn't know which ones ahead of time. Runs that OOMed on step 200 after ten minutes of training. Runs that technically worked but were so slow it was clear the config was wrong. Runs where gradient accumulation was set too high and the effective batch size tanked throughput.

It was $1,500 in GPU time spent learning things we could have known before hitting "launch."

Why this happens

The root cause is simple: nobody has visibility into GPU utilization at the workload level.

Cloud provider dashboards show instance-level metrics. You can see that your p3.2xlarge is running, that the GPU is drawing power, that memory is allocated. But you can't see that your training loop is only using 14GB out of 80GB of VRAM. You can't see that your DataLoader is the bottleneck and the GPU is idle 70% of the time waiting for data. You can't see that switching from a 4-GPU setup to a single GPU with gradient accumulation would give you the same throughput at a quarter of the cost.

Teams don't know their VRAM requirements before they launch, so they guess high. And once a run is going, there's no feedback loop to tell you the hardware is wrong. Unless it OOMs, which only tells you the hardware was too small.

There's an information asymmetry built into the entire GPU cloud ecosystem. Providers have no incentive to tell you you're over-provisioned. You only find out when you audit your bill.

What we changed

OUR PLAYBOOK

Four changes that cut our monthly GPU spend by roughly half.

1. Pre-flight VRAM checks before every run

Before launching any training job, we started running a static scan of the model architecture and training config to estimate VRAM requirements. No GPU needed, just a quick analysis of parameter counts, optimizer states, activation memory, and gradient buffers. This alone eliminated most of our OOM failures. If the estimate says your config needs 62GB of VRAM and you're targeting a 24GB card, you know before you spend a dime.

2. 60-second calibration runs

For jobs where we needed real utilization data (not just VRAM estimates), we added a short calibration step. Launch the training, let it run for about 60 seconds until GPU metrics stabilize, then capture the actual utilization profile: VRAM usage, SM activity, memory bandwidth, throughput. This gives you a ground-truth picture of what the workload actually needs before you commit to a 12-hour training run.

3. Right-sizing: smallest viable GPU

Once we had real utilization data, we started matching workloads to the smallest GPU that could run them comfortably. LoRA fine-tunes went from 8xA100 to 1xA10G. Medium-sized training jobs went from A100 to L4 or A10G. We only pulled out the big hardware when the numbers justified it.

4. Bottleneck detection

We started monitoring for DataLoader bottlenecks and idle GPU patterns during runs. If the GPU is sitting below 30% utilization for sustained periods, something's wrong upstream, usually a slow DataLoader or an I/O bound preprocessing step. Fixing these is often free and can double effective throughput on the same hardware.

Result: 40–60% reduction in monthly GPU spend. Our runs started succeeding on the first attempt instead of the fourth. We stopped defaulting to the biggest instance available. And we actually knew what our workloads needed before we launched them.

The bigger picture

Our $12k is a rounding error compared to what the industry is spending. NVIDIA reported $47.5 billion in data center revenue for fiscal year 2024, a 217% year-over-year increase. That money is coming from somewhere, and a meaningful chunk of it is paying for underutilized hardware.

Training GPT-3 famously cost around $4.6 million in compute. Frontier models today are estimated at $50–100 million or more. Meta announced over $30 billion in AI infrastructure spending for 2024 alone. The scale is staggering and it keeps growing.

The energy side is just as stark. The IEA projects global data center electricity consumption to reach 1,000 TWh by 2026. A single H100 draws 700W at peak. A 128-GPU training cluster burns roughly 15,000 kWh per week. That's what about 25 average US households use in a month.

Here's what bothers me most: the "GPU shortage" everyone talks about is partly an efficiency problem. If organizations right-sized their workloads, if they stopped parking 8-GPU nodes on jobs that need one card, stopped leaving instances running overnight at 0% utilization, stopped burning through OOM retry cycles, there would be meaningfully more hardware available. The shortage is real, but it's amplified by waste.

I'm not saying every organization is as wasteful as we were. Some teams have excellent infrastructure discipline. But the default state of most ML teams, the path of least resistance, is to over-provision and hope for the best. That default is expensive, and it doesn't have to be.

Try it yourself

This experience is why we built Alloc. We wanted a way to know what a training job needs before we commit to expensive hardware. A ghost scan that analyzes your model architecture and config, tells you the estimated VRAM footprint, and suggests the right GPU. All in about 60 seconds, without modifying your training code.

If you want to see where your GPU budget actually goes:

pip install alloc && alloc scan train.py

It's free, it runs locally, and it takes less than a minute. You might be surprised what you find.

Related reading