Home/Docs/Right-Sizing

GPU OPTIMIZATION

GPU Right-Sizing: Stop Over-Provisioning GPUs

Match your ML workload to the smallest GPU that meets your requirements. Stop paying for compute you don't use.

FROM ALLOCGUY

GPU pricing has gone vertical. A single H100 on AWS costs $32.77/hr (p5.48xlarge). Even on spot markets, 8xA100 clusters run $12–20/hr. Meta reportedly spent over $30B on GPU infrastructure in 2024 alone. Training costs for frontier models are now measured in the tens of millions. Llama 3.1 405B used 30.84M GPU-hours across 16,384 H100s.

But here's what most people miss: the cost explosion isn't just at the frontier. Even fine-tuning a 7B model can run $500–2,000 when you factor in the trial-and-error loop. Teams pick the wrong GPU, hit OOM, switch to something bigger, discover they're data-pipeline-bound, realize the expensive GPU sits idle 60% of the time. Each iteration costs real money.

Meanwhile, electricity costs are becoming a real constraint. A single H100 SXM pulls 700W under load. A 1,024-GPU training cluster draws roughly 0.7 MW, comparable to a small factory. The IEA estimates AI will drive data center electricity demand to 945 TWh by 2030, up from 460 TWh in 2022. In the US, some utilities are already rejecting new data center connections because the grid can't keep up.

Right-sizing isn't just an optimization. It's becoming a necessity. If you're paying for an H100 when an A10G would do the job, you're not just wasting money. You're wasting power, queue time, and opportunity cost for everyone else waiting for that hardware.

– allocguy

The Problem: Default to Biggest

Most ML teams over-provision GPUs because they don't know what their workload actually needs. The default decision is "just request the biggest GPU available," which means $3-5/hr H100 instances running at 20% utilization. Teams burn thousands per month on idle VRAM and underused compute simply because there's no easy way to measure requirements before launch.

The irony: the information needed to make a good GPU decision exists in the first 30-60 seconds of any training job. Nobody captures it.

What Right-Sizing Means

GPU right-sizing means matching your workload to the smallest GPU (or multi-GPU configuration) that meets three requirements:

VRAM

Enough memory for your model, optimizer states, activations, and gradients at your target batch size.

COMPUTE

Sufficient FLOPS and memory bandwidth so the GPU is not the bottleneck in your training loop.

THROUGHPUT

Acceptable samples/second for your training timeline. The cheapest GPU that finishes on schedule wins.

How Alloc Helps

Alloc's CLI gathers the data you need to make an informed GPU decision, without modifying your training code.

1

Ghost Scan Estimates VRAM Before Launch

Ghost Scan inspects your model's architecture and estimates VRAM requirements (parameters, optimizer states, gradients, and activation memory) without running a single forward pass.

$ alloc ghost model.py

Ghost Scan Results
  Parameters:    125M (250 MB fp16)
  Optimizer:     500 MB (AdamW states)
  Activations:   ~1.2 GB (batch_size=32)
  Total VRAM:    ~2.0 GB estimated

  Minimum GPU:   T4 (16 GB), plenty of headroom
  Suggested:     L4 (24 GB) for batch_size scaling
2

Probe Monitors Actual GPU Utilization

Wrap your training command with alloc run and Alloc launches a lightweight sidecar probe. It monitors GPU memory, compute utilization, and power draw during a short calibration window, then auto-stops once metrics stabilize (typically 30-60 seconds).

$ alloc run python train.py

[alloc] Calibrating... watching GPU metrics
[alloc] GPU 0: 6.2 GB / 80 GB VRAM (7.7% utilization)
[alloc] GPU 1: 6.1 GB / 80 GB VRAM (7.6% utilization)
[alloc] Metrics stabilized after 42s. Stopping probe
[alloc] Artifact saved: .alloc/run-20250210-143022.json
3

Alloc Recommends the Best-Fit GPU

Based on observed VRAM usage and compute patterns, Alloc compares your workload against its catalog of 13+ GPU types with real cloud pricing. If you've configured your GPU fleet (via alloc init or the dashboard), recommendations prioritize GPUs you can actually provision. The recommendation includes a bottleneck classification (compute-bound, memory-bound, or data-pipeline-bound) and a suggested GPU with estimated cost per hour.

Common Waste Patterns

These are real patterns we see across ML teams. If any of these look familiar, you're leaving money on the table.

Over-provisioned

BERT-base on 4xH100s

BERT-base has 110M parameters. It fits comfortably on a single A10G (24 GB). Running it on four H100s at $12-20/hr is burning 95% of your budget on idle hardware.

Wasted VRAM

7B Model with batch_size=1

A 7B parameter model in fp16 uses about 14 GB for weights alone. On an 80 GB A100 with batch_size=1, you're leaving 90% of VRAM unused. Increase the batch size or drop to a smaller GPU.

Wrong Technique

Full Fine-Tuning When LoRA Suffices

Full fine-tuning updates every parameter, requiring optimizer states for the entire model. LoRA fine-tuning can reduce VRAM requirements by 60-80%, often letting you drop from an A100 to an A10G or L4.

Bottleneck Mismatch

DataLoader Bottleneck on Expensive GPUs

If your DataLoader can't feed data fast enough, the GPU sits idle between batches. Throwing a bigger GPU at the problem makes it worse: the bottleneck is CPU or I/O, not compute. Fix the pipeline first.

The Right-Sizing Workflow

A four-step process from unknown requirements to a data-backed GPU recommendation.

1

Estimate Requirements

Run alloc ghost or alloc scan to get a static VRAM estimate based on model architecture, optimizer choice, and batch size. No GPU required.

2

Calibrate with a Short Run

Run alloc run python train.py for a short calibration. Alloc's probe monitors actual GPU utilization and auto-stops once metrics stabilize, typically 30-60 seconds. Your training continues normally.

3

Review the Verdict

Alloc classifies your bottleneck (compute-bound, memory-bound, or data-pipeline-bound), shows your actual VRAM usage, and recommends a GPU with an estimated cost range.

4

Compare Configs on the Dashboard

Upload the artifact to the Alloc dashboard with alloc upload to see a full config comparison across GPU options: VRAM headroom, estimated cost per hour, and fit scores side by side.

GPU Selection Cheat Sheet

General guidance based on common ML workloads. These are starting points. Your actual requirements depend on model architecture, precision, optimizer, and batch size.

WorkloadSuggested GPUNotes
Fine-tuning 7B models (LoRA)A10G or L424 GB VRAM is usually sufficient for LoRA on 7B parameter models
Fine-tuning 7B models (full)A100-40GB or A100-80GBFull fine-tuning needs VRAM for optimizer states and gradients
Pre-training 13B+ modelsA100-80GB or H100Large models need high memory bandwidth and multi-GPU setups
Inference serving (small models)T4 or L4Low cost, good throughput for models under 3B parameters
Inference serving (large models)A100 or H100Large model inference benefits from high memory bandwidth
Multi-GPU trainingStart with minimum countScale up GPU count only if single-GPU VRAM is insufficient

Alloc GPU Catalog

Alloc maintains a catalog of 13+ GPU types with real cloud pricing from major providers. Browse it from the CLI to see specs, VRAM, and cost-per-hour ranges.

# List all GPUs in the catalog
$ alloc catalog list

GPU           VRAM     Compute    $/hr (range)
─────────────────────────────────────────────
T4            16 GB    65 TFLOPS  $0.35-0.76
L4            24 GB    121 TFLOPS $0.50-0.80
A10G          24 GB    125 TFLOPS $0.75-1.10
A100-40GB     40 GB    312 TFLOPS $1.50-3.10
A100-80GB     80 GB    312 TFLOPS $2.00-3.90
H100          80 GB    990 TFLOPS $3.00-4.90
...
# Get details for a specific GPU
$ alloc catalog show H100

NVIDIA H100 SXM
  VRAM:           80 GB HBM3
  FP16 Compute:   990 TFLOPS
  Memory BW:      3.35 TB/s
  TDP:            700W
  Cloud pricing:  $3.00-4.90/hr (varies by provider)

Stop guessing. Start measuring.

Alloc tells you what your workload actually needs in under 60 seconds. No code changes. No infra setup.

Related