Docs/Ghost Scan

VRAM ESTIMATION

Ghost Scan: Estimate VRAM Before You Launch

Ghost Scan analyzes your training script to estimate peak VRAM usage. No GPU required. Runs locally in seconds. Know whether your job will fit before you burn a dollar of compute.

FROM ALLOCGUY

Here's a number that surprised me: the average ML team runs 10–30 trial configurations before finding hardware that works. Each failed run costs real money. OOM after 20 minutes of provisioning and queuing, wrong dtype, batch size too aggressive. On an A100-80GB at $2.50/hr, even short failed runs add up to hundreds of dollars a week.

The irony is that most of these failures are predictable. If you know the model's parameter count, dtype, optimizer, and batch size, you can estimate peak VRAM within a reasonable range without ever touching a GPU. That's what Ghost Scan does. It's not magic. It's arithmetic that most people just don't do.

Training a 7B model in bf16 with AdamW? You're looking at ~14 GB weights + ~14 GB gradients + ~28 GB optimizer states + activations + overhead. That's 67+ GB before your code even starts. An A100-40GB was never going to work, but a lot of people find that out the expensive way.

Ghost Scan runs in seconds, costs nothing, and catches the obvious mismatches before they waste real GPU hours. We think every training job should start with a pre-flight check.

– allocguy

What is Ghost Scan?

Ghost Scan is Alloc's VRAM estimation tool. Point it at your training script and it analyzes your model architecture (parameter count, data type, optimizer choice, and batch size) to produce an itemized breakdown of GPU memory usage. It does not require access to a GPU. Everything runs locally on your machine, which means it works in air-gapped and VPC-locked environments with no outbound internet.

Think of it as a pre-flight check for your training job. The same way you'd run terraform plan before provisioning infrastructure, Ghost Scan tells you what resources your job is likely to need before it hits the queue.

What It Tells You

VRAM Breakdown

Itemized estimate for weights, gradients, optimizer states, activations, and buffer overhead.

Total Estimate

Estimated peak VRAM as a range, not a single number. Honest about the uncertainty inherent in static analysis.

GPU Fit Check

Whether your target GPU can handle the workload, or which GPUs are viable alternatives.

When to Use Ghost Scan

  • Before requesting GPU allocation. Know what you need before submitting a cluster request or spinning up cloud instances.
  • Before launching expensive training runs. Catch OOM-destined jobs at scan time, not after hours of provisioning and queuing.
  • During experiment planning. Quickly check whether a batch size increase or dtype change will still fit in memory.
  • Comparing GPU options. See if an A10G, L40S, A100-40GB, or A100-80GB is the right fit for your workload without trial and error.

CLI Usage

Run Ghost Scan from the command line. Point it at your training script and specify your target dtype and batch size:

terminal

$ alloc ghost train_7b.py --dtype bf16 --batch-size 32

Alloc Ghost Scan v0.3.0

Analyzing: train_7b.py

VRAM Breakdown (estimated)

──────────────────────────────────────

Model weights 14.0 GB (7B params x 2 bytes/bf16)

Gradients 14.0 GB

Optimizer states 28.0 GB (AdamW, fp32 master + m + v)

Activations ~8.2 GB (batch=32, seq=2048)

Buffer / overhead ~3-6 GB

──────────────────────────────────────

Total estimate ~67-70 GB

⚠ Will not fit on A100-40GB (40 GB).

✓ Fits on A100-80GB (80 GB) with ~10-13 GB headroom.

✓ Fits on H100-80GB (80 GB) with ~10-13 GB headroom.

Python API

Use Ghost Scan programmatically in notebooks or scripts. Pass your model's parameter count and dtype to get a VRAM report object:

import

from alloc import ghost

# Estimate VRAM for a 7B parameter model in bf16

report = alloc.ghost(

param_count_b=7.0,

dtype="bf16",

batch_size=32,

seq_len=2048

)

# Access the breakdown

print(report.total_gb) # e.g. 67.2

print(report.fits_gpu("A100-80GB")) # True

print(report.breakdown) # dict of components

Remote Scan

Don't have the model locally? Use alloc scan to estimate VRAM for well-known model architectures via Alloc's API. No local setup needed:

terminal

$ alloc scan --model llama-3-70b --gpu A100-80GB

Scanning: llama-3-70b on A100-80GB

⚠ Estimated peak VRAM ~142 GB. Does not fit on a single A100-80GB.

✓ Consider: 2x A100-80GB with FSDP, or 1x H100-80GB with quantization (4-bit).

VRAM Components Explained

GPU memory during training is consumed by several distinct components. Understanding what each one is helps you reason about where memory goes and what levers you have to reduce it.

Model Weights

The learned parameters of your model. Memory usage scales directly with parameter count and the bytes-per-element of your chosen data type. A 7B parameter model in fp32 uses ~28 GB for weights alone (7 billion x 4 bytes). In bf16 or fp16 that halves to ~14 GB. Quantized formats (int8, int4) reduce it further.

Gradients

During training, a gradient tensor is stored for each trainable parameter. Gradients are the same shape as the weights, so they consume roughly the same amount of memory. If your weights are 14 GB in bf16, expect another ~14 GB for gradients.

Optimizer States

Adaptive optimizers like Adam and AdamW maintain additional state for each parameter. Adam keeps a first-moment estimate (m) and a second-moment estimate (v) alongside a master copy of the weights, all in fp32. For a model with P parameters, that means Adam stores 3 copies of P in fp32, often the single largest consumer of VRAM during training.

Activations

Intermediate tensors stored during the forward pass so they can be reused in the backward pass. Activation memory depends on batch size, sequence length, hidden dimension, and the number of layers. Doubling your batch size roughly doubles activation memory. Techniques like gradient checkpointing trade compute for memory by recomputing activations instead of storing them.

Buffer / Overhead

GPU memory allocators (like PyTorch's caching allocator) reserve extra memory beyond what tensors strictly need. This includes memory fragmentation, temporary buffers for operations like matrix multiplications, CUDA context overhead, and framework bookkeeping. Ghost Scan adds a headroom buffer to account for this, because the theoretical minimum rarely matches what you actually observe.

Limitations

Ghost Scan is a static estimator. It gives you a strong starting point, but it cannot account for everything. Here is what it cannot do:

  • Dynamic memory from custom ops. If your model uses custom CUDA kernels or operators that allocate memory at runtime, Ghost Scan cannot see those allocations ahead of time.
  • Exact activation sizes for complex architectures. Models with dynamic control flow (variable-length inputs, mixture-of-experts routing, conditional computation) produce activation patterns that depend on the data. Ghost Scan uses heuristic estimates that may differ from actual runtime usage.
  • Runtime memory fragmentation. The GPU memory allocator's behavior depends on allocation order, which varies between runs. Ghost Scan includes a buffer for this, but actual fragmentation can be higher in pathological cases.

For workloads where Ghost Scan's static analysis is not enough, use Alloc Probe to run a short calibration on real hardware and get measured VRAM usage.

Next Steps

Want to see it in action? Try the interactive demo