Capacity planning on one 48 GB card comes down to three questions: will it fit, how fast, and how many can I run at once. Two controlled benchmarks settle a distinction that quietly wrecks those answers — a model being “quantized” on disk is not the same as it being quantized in VRAM.
Quantization is, by default, a throughput-and-storage optimization — not a memory-capacity one. Whether it saves VRAM depends on whether the runtime carries the low precision through the matmuls, and at high resolution the peak is set by activations regardless.
I ran two controlled studies on the inference node. The first compared an fp8 and an fp16 image-model checkpoint, changing nothing else. The second ran a 4-bit (NVFP4) checkpoint across two serving images. Together they decompose the whole question into its three independent levers.
Single variable: only the SD3.5 Large transformer precision changed — same resolution, steps, seed, sampler, and text encoders. Any delta is attributable to precision alone.
| Axis | fp16 transformer | fp8 transformer | Δ |
|---|---|---|---|
| Disk / download | 16.5 GB | 8.15 GB | −50% (real) |
| Resident weights (loaded) | ~15.37 GB | ~15.37 GB | 0 (identical) |
| Sampling peak VRAM | 46,801 MiB | 46,801 MiB | 0 (byte-identical) |
| s/gen (warm) | ~40 s* | 29.1 s | ~28% faster* |
An fp8 checkpoint is half the size on disk and generates meaningfully faster — yet produced zero runtime VRAM savings. The sampling peak was byte-identical to fp16. The load log resolves why: the fp8 checkpoint reported the same resident footprint as fp16 because the runtime upcast the fp8 weights to bf16 at load for compute.
“fp8” describes how a model is stored on disk. It does not, by itself, reduce serving VRAM — the runtime decides the compute dtype, and absent an explicit fp8-compute path it upcasts. Saving VRAM requires fp8 to stay fp8 through the matmuls (a launch flag like --fp8_e4m3fn-unet or native fp8 kernels), not merely fp8 on disk. What fp8-on-disk did buy: ~28% throughput from the tensor cores, and ~50% off disk and transfer — a logistics win, not a capacity one.
This was not a single-card artifact. On a 16 GB RTX 4070 Ti SUPER (Ada), the fp8 checkpoint alone was faster with no VRAM relief; only adding the fp8-compute flag recovered headroom — and on that card the recovered headroom was the difference between fitting larger work and not. The mechanism is identical across both: fp8 storage buys speed; only fp8 compute buys VRAM.
Where fp8 storage bought no VRAM, this study confirms NVFP4 keeps its 4-bit residency through load on Blackwell — the compute-precision path is live, and it changes the capacity math.
FLUX.1-dev in NVFP4 dispatched natively on the Blackwell card — the transformer's resident size (8763.64 MiB) matched its exact 4-bit on-disk size, ~2.6× smaller than bf16. The peak was weight-bound here: co-resident weights ~18.2 GB (transformer 8763 + encoder 9319 + VAE 160) and only a ~2.8 GB activation transient at 1024².
FLUX is the larger model (12B vs SD3.5's 8.1B). The low peak isn't a lighter model — it's NVFP4 cutting weights to ~9.2 GB plus a small working set. The fp8 study measured SD3.5 at a 46.8 GB peak that was activation-bound at its resolution; here the peak is weight-bound. Same lesson both directions: resolution and attention set the ceiling, not parameter count — and weight precision only moves the peak when the working set is weight-bound, as it is at 1024².
The two NVFP4 serving images differed in render time, but the difference tracked their memory-management launch flags — not the 4-bit path and not the attention backend. One image ran --highvram (everything resident); the other omitted it and added --disable-smart-memory --disable-dynamic-vram, so it offloaded and reloaded the encoder and transformer between the encode and sample phases on every generation. That per-render juggling accounts for the whole gap. Residency was identical across both, so the 4-bit path is not implicated.
By the same rule I hold elsewhere — don't credit a throughput delta without controlling the confounds — the gap is attributed to the memory-management flag asymmetry, not to NVFP4, the attention layer, or the image. The CLI asymmetry is sufficient to explain the full delta; a flag-matched re-run makes it airtight, and is the named follow-up.
I had been keeping two serving images on the Blackwell node on the premise that the SageAttention build might suppress native NVFP4 dispatch. This measurement removes that premise: NVFP4 runs at the same 8763.64 MiB resident on both images, so the SageAttention image hosts the 4-bit path natively, and the throughput difference is a recoverable launch-flag property. The dedicated second pod is no longer required on suppression grounds — a consolidation that takes a redundant image and a parallel pod out of the platform, which is the kind of simplification I'd rather earn from a measurement than assume.
VRAM and residency figures are measured and clean (loader logs; single-variable A/Bs; the NVFP4 dispatch verdict replicated across both images). The fp8 ~28% speed figure is pending confirmation that the fp16 baseline used matched 40-step settings; the Blackwell fp8-compute arm is explicitly untested pending a follow-up, and no conclusion here depends on it. The NVFP4 throughput attribution to memory-management flags is established by the launch-flag asymmetry; a flag-matched re-run is the definitive confirmation. Outputs were produced at 1024²; a visual spot-check precedes any external use.
Lab note adapted from internal reports LAZ-2026-0519 (fp8/fp16 study) and LAZ-2026-0529 (NVFP4 dispatch). Full matrices and node identifiers live in the lab's source-of-truth.