I refit a single bare-metal worker to a 48 GB Blackwell card to unlock 70B-class inference on-prem. The capability now exists on the node. It is not the default serving model — and that decision is the part of this work worth writing down.
One card, 48 GB. A 70B model and a diffusion stack cannot be co-resident in it. The node is single-tenant by necessity, and most of the engineering below is about making that explicit and safe rather than pretending it away.
phx-ai04 (a Dell Precision 5820) does double duty: it serves a private reasoning workload that needs to stay on-prem for sensitive analysis, and it runs a ComfyUI image/video stack. The single hard limit that shapes every decision here is physical — 48 GB will not hold both at once. So the two workloads swap the card by an explicit, GitOps-committed Recreate operation; they never coexist. Four things from this refit hold up as evidence: a procurement pivot, a silent runtime trap, the first sustained-load characterization of the new silicon, and the serving decision that followed.
The card that landed in the chassis was not the one I first selected. I had an RTX A6000 (Ampere) ordered as a price-conscious fallback; while it was still pre-ship, a new bulk-OEM RTX PRO 5000 Blackwell 48 GB appeared on the same seller below the price of a used RTX 6000 Ada. I cancelled the A6000 and re-ran the comparison from scratch rather than rationalizing the order I already had.
| Axis | RTX 6000 Ada (refurb) | PRO 5000 Blackwell (new OEM) | Verdict |
|---|---|---|---|
| Memory bandwidth | 960 GB/s (GDDR6) | 1,344 GB/s (GDDR7) | +40% on the decode-bound axis |
| Architecture | Ada SM89, no FP4 | Blackwell SM120, native FP4 | unlocks a serving point Ada can't |
| Condition | refurb, unknown hours | new, zero hours | full lifespan ahead |
| Price | $5,500–7,000 used | ~$5,200 new | cheaper, not pricier |
My earlier planning said the 5820's PCIe 3.0 slot would “partially handicap” the card. On review that was wrong, and I corrected it in the decision log. PCIe link speed touches the one-time CPU→GPU model load, not the GPU→VRAM path where the 1,344 GB/s lives and where inference actually runs. A 40 GB model loads in ~3 s on Gen3 x16 vs ~1.5 s on Gen5 — a one-time cost outside the steady state. Saying so in the record was cheaper than letting the framing propagate.
The most expensive bring-up lesson on this card had no error attached to it. The card loaded the model and ran — at roughly half its throughput, with no signal that anything was wrong.
On Blackwell (SM120), Ollama ≥ 0.24.0 is a hard prerequisite. Below it, the two levers that make the 70B fit and fly — OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0 — silently no-op. No warning. The model loads, but flash-attention and the q8_0 KV path never engage, and the 70B trickles along at ~40% CPU offload instead of running entirely on the GPU. Pinning the node to Ollama 0.24.0 on bare-metal systemd is what turned a card stuck at ~40% CPU offload into a clean, 100%-GPU 70B.
Treat a silent capability dependency as a gated prerequisite, not a tuning detail. The version check now precedes the env-var block in the runbook, stated explicitly — because a no-op-without-an-error is otherwise rediscovered the hard way, at the cost of half the card. Verify the running reality of a dependency; don't trust that “it loaded, so it's fine.”
With the runtime corrected, I drove the card under sustained 70B load for the first time. It ran clean — and the measurement surfaced a binding constraint that is neither obvious nor tunable.
| Metric | Measured | Reading |
|---|---|---|
| Fit @ num_ctx 16384 | ~46 / 48 GB, 100% GPU | clean operating point |
| Fit @ num_ctx 32768 | ~5% offload to CPU | not a clean point |
| 70B throughput | ~22–27 tok/s | at 16k, fully resident |
| 32B throughput (same card) | ~47–53 tok/s | ~2× the 70B |
| Thermal | 67–70 °C | ~23 °C headroom, zero throttle |
| Power | 300 W (hard cap) | software cap, no unlock on this SKU |
Two findings matter for operating it. First, the speed deficit is the power envelope, not heat — there is ~23 °C of cooling headroom going unused because Max Power Limit is pinned at 300 W. So the meaningful monitoring signal is sustained 300 W draw, not temperature; I re-pointed alerting accordingly. Second, the production context ceiling is 16k, not the 32k I had assumed — 32768 offloads ~5% to CPU on Q4_K_M at 48 GB, so 16k is the clean operating point and 32k is capture-once-degraded only. I corrected that assumption in the decision log before this run, not after.
The whole refit existed to make 70B-class serving possible. When I ran the A/B, I did not promote the 70B to default. Declining to deploy a capability you built is the part of this worth an interview.
I set the promotion bar in advance: the 70B had to clearly win on grounding and fabrication to justify ~2× the latency and the loss of all concurrency. Against the incumbent qwen2.5:32b on a retrieval-augmented workload, it did not clear the bar — and on one probe it regressed. So qwen2.5:32b stays the default; the 70B capability remains on the card, available and unshipped.
The verdict is STANDING, not final. Two limits sit next to it: the retrieval metrics (Hit@5 / MRR / nDCG) were structurally invalid for this run — the gold set was empty on every eval query — so the decision rests on manual answer-faithfulness review, not scores; and the A/B ran over an un-deduped collection, with textless duplicate chunks polluting the top results. Both make the call overturnable by a clean re-run on the now-deduped corpus, scored on answer-correctness. That re-run is an open item, not a closed one.
The same node runs ComfyUI, characterized in depth in a companion report. The transferable finding: a batch of 80 images that used to take 3–4 hours now finishes in under 16 minutes (two runs at 928 s and 930 s — deterministic). The lazy writeup is “new card, big speedup.” The honest root cause was a wait_for_queue submit throttle I'd added as OOM protection on the previous card and long since pure overhead — ComfyUI executes its queue serially, so per-job VRAM was already bounded. The throttle was inserting idle GPU gaps. Removing it and raising batch size did most of the work; the card's headroom was secondary.
Two more decisions followed from the single-card reality. There is no MIG on a workstation Blackwell, and time-slicing an SM array already at 98% utilization adds context-switch overhead, not throughput — so a second GPU-requesting pod stays Pending by design, and batch size is the real lever. And because ComfyUI's extension ecosystem executes arbitrary code — with active in-the-wild custom-node supply-chain compromise in 2026 — I treat it like Postgres, not a public API: a thin gateway owns auth and validates requests against a fixed registry of known-good workflows referenced by hash, models pinned, image pipelines treated as code.
“It got faster after the upgrade” is a correlation you owe it to yourself to take apart. A guard written as protection on yesterday's hardware becomes a silent tax on today's; the attribution was tested, not assumed.
This is an in-flight record, not a victory lap. The card was installed 2026-05-22 and remains inside its 30-day return window; the outgoing RTX 3090 stays on the bench as rollback hardware for the full window rather than the originally-planned week. The clean serving re-run on the deduped corpus is the named open item.
All figures here were measured on phx-ai04 during the 2026-05-22 → 05-25 bring-up window. Where a value is a target rather than a measurement it is labeled as such. The serving verdict rests on manual answer-faithfulness review; the retrieval metrics for that run were structurally invalid and are not relied upon. No conclusion rests on an unverified attribution.
Lab note adapted from internal report LAZ-2026-0526 (“Operation MaxQ”). Identifiers and the broader topology live in the lab's source-of-truth, not here.