Lab Notebook · Capacity & Measurement

What quantization actually saves: storage vs compute precision

Capacity planning on one 48 GB card comes down to three questions: will it fit, how fast, and how many can I run at once. Two controlled benchmarks settle a distinction that quietly wrecks those answers — a model being “quantized” on disk is not the same as it being quantized in VRAM.

Node

phx-ai04 · Blackwell 48 GB

Cross-check

RTX 4070 Ti SUPER (Ada)

Dates

2026-05-28 / 05-29

Status

Measured · caveats noted

The distinction this proves Counter-intuitive

Quantization is, by default, a throughput-and-storage optimization — not a memory-capacity one. Whether it saves VRAM depends on whether the runtime carries the low precision through the matmuls, and at high resolution the peak is set by activations regardless.

0 MiB

VRAM saved by fp8-on-disk

~2.6×

VRAM cut by NVFP4 compute

single-variable A/Bs, 2 GPU archs

I ran two controlled studies on the inference node. The first compared an fp8 and an fp16 image-model checkpoint, changing nothing else. The second ran a 4-bit (NVFP4) checkpoint across two serving images. Together they decompose the whole question into its three independent levers.

fp8 on disk: half the size, faster — and zero VRAM relief

Single variable: only the SD3.5 Large transformer precision changed — same resolution, steps, seed, sampler, and text encoders. Any delta is attributable to precision alone.

Axis	fp16 transformer	fp8 transformer	Δ
Disk / download	16.5 GB	8.15 GB	−50% (real)
Resident weights (loaded)	~15.37 GB	~15.37 GB	0 (identical)
Sampling peak VRAM	46,801 MiB	46,801 MiB	0 (byte-identical)
s/gen (warm)	~40 s*	29.1 s	~28% faster*

An fp8 checkpoint is half the size on disk and generates meaningfully faster — yet produced zero runtime VRAM savings. The sampling peak was byte-identical to fp16. The load log resolves why: the fp8 checkpoint reported the same resident footprint as fp16 because the runtime upcast the fp8 weights to bf16 at load for compute.

# fp8 checkpoint, but resident size is unchanged — runtime upcast to bf16
loaded completely; 15366.80 MB loaded, full load: True   # same as fp16

Storage precision ≠ compute precision

“fp8” describes how a model is stored on disk. It does not, by itself, reduce serving VRAM — the runtime decides the compute dtype, and absent an explicit fp8-compute path it upcasts. Saving VRAM requires fp8 to stay fp8 through the matmuls (a launch flag like --fp8_e4m3fn-unet or native fp8 kernels), not merely fp8 on disk. What fp8-on-disk did buy: ~28% throughput from the tensor cores, and ~50% off disk and transfer — a logistics win, not a capacity one.

Same behaviour on a second architecture

This was not a single-card artifact. On a 16 GB RTX 4070 Ti SUPER (Ada), the fp8 checkpoint alone was faster with no VRAM relief; only adding the fp8-compute flag recovered headroom — and on that card the recovered headroom was the difference between fitting larger work and not. The mechanism is identical across both: fp8 storage buys speed; only fp8 compute buys VRAM.

NVFP4: compute precision engaged

Where fp8 storage bought no VRAM, this study confirms NVFP4 keeps its 4-bit residency through load on Blackwell — the compute-precision path is live, and it changes the capacity math.

8763.64 MiB

resident transformer (4-bit)

~9.2 GB

NVFP4 weights resident

~21 GB

per-process peak @ 1024²

FLUX.1-dev in NVFP4 dispatched natively on the Blackwell card — the transformer's resident size (8763.64 MiB) matched its exact 4-bit on-disk size, ~2.6× smaller than bf16. The peak was weight-bound here: co-resident weights ~18.2 GB (transformer 8763 + encoder 9319 + VAE 160) and only a ~2.8 GB activation transient at 1024².

On “FLUX is lighter than SD3.5”

FLUX is the larger model (12B vs SD3.5's 8.1B). The low peak isn't a lighter model — it's NVFP4 cutting weights to ~9.2 GB plus a small working set. The fp8 study measured SD3.5 at a 46.8 GB peak that was activation-bound at its resolution; here the peak is weight-bound. Same lesson both directions: resolution and attention set the ceiling, not parameter count — and weight precision only moves the peak when the working set is weight-bound, as it is at 1024².

The throughput gap that wasn't the quantization

The two NVFP4 serving images differed in render time, but the difference tracked their memory-management launch flags — not the 4-bit path and not the attention backend. One image ran --highvram (everything resident); the other omitted it and added --disable-smart-memory --disable-dynamic-vram, so it offloaded and reloaded the encoder and transformer between the encode and sample phases on every generation. That per-render juggling accounts for the whole gap. Residency was identical across both, so the 4-bit path is not implicated.

Attribution discipline

By the same rule I hold elsewhere — don't credit a throughput delta without controlling the confounds — the gap is attributed to the memory-management flag asymmetry, not to NVFP4, the attention layer, or the image. The CLI asymmetry is sufficient to explain the full delta; a flag-matched re-run makes it airtight, and is the named follow-up.

The operational payoff: one fewer thing to maintain

I had been keeping two serving images on the Blackwell node on the premise that the SageAttention build might suppress native NVFP4 dispatch. This measurement removes that premise: NVFP4 runs at the same 8763.64 MiB resident on both images, so the SageAttention image hosts the 4-bit path natively, and the throughput difference is a recoverable launch-flag property. The dedicated second pod is no longer required on suppression grounds — a consolidation that takes a redundant image and a parallel pod out of the platform, which is the kind of simplification I'd rather earn from a measurement than assume.

What this means for capacity planning

Storage precision buys disk and transfer — download, registry, backup. Real, but it does not make a model fit.
Compute precision buys resident VRAM and tensor-core throughput — engaged for NVFP4 here, absent for fp8-on-disk.
Capacity is set by resolution and attention: weight-bound at 1024², activation-bound higher. When the peak is activation-bound, the only levers are resolution and attention — not weight precision.
Flipping a precision path is a GitOps env change — commit, reconcile, revert — not a hand-edit to a live workload.

Honesty & caveats

VRAM and residency figures are measured and clean (loader logs; single-variable A/Bs; the NVFP4 dispatch verdict replicated across both images). The fp8 ~28% speed figure is pending confirmation that the fp16 baseline used matched 40-step settings; the Blackwell fp8-compute arm is explicitly untested pending a follow-up, and no conclusion here depends on it. The NVFP4 throughput attribution to memory-management flags is established by the launch-flag asymmetry; a flag-matched re-run is the definitive confirmation. Outputs were produced at 1024²; a visual spot-check precedes any external use.

Lab note adapted from internal reports LAZ-2026-0519 (fp8/fp16 study) and LAZ-2026-0529 (NVFP4 dispatch). Full matrices and node identifiers live in the lab's source-of-truth.