Lab Notebook · Inference Platform

Standing up a 70B inference node — and deciding not to ship it

I refit a single bare-metal worker to a 48 GB Blackwell card to unlock 70B-class inference on-prem. The capability now exists on the node. It is not the default serving model — and that decision is the part of this work worth writing down.

Node

phx-ai04 · K3s worker

GPU

RTX PRO 5000 Blackwell 48 GB

Window

2026-05-18 → 05-26

Status

In-flight / honest record

The constraint everything hangs on

One card, 48 GB. A 70B model and a diffusion stack cannot be co-resident in it. The node is single-tenant by necessity, and most of the engineering below is about making that explicit and safe rather than pretending it away.

phx-ai04 (a Dell Precision 5820) does double duty: it serves a private reasoning workload that needs to stay on-prem for sensitive analysis, and it runs a ComfyUI image/video stack. The single hard limit that shapes every decision here is physical — 48 GB will not hold both at once. So the two workloads swap the card by an explicit, GitOps-committed Recreate operation; they never coexist. Four things from this refit hold up as evidence: a procurement pivot, a silent runtime trap, the first sustained-load characterization of the new silicon, and the serving decision that followed.

A procurement pivot, and a claim I had to walk back

The card that landed in the chassis was not the one I first selected. I had an RTX A6000 (Ampere) ordered as a price-conscious fallback; while it was still pre-ship, a new bulk-OEM RTX PRO 5000 Blackwell 48 GB appeared on the same seller below the price of a used RTX 6000 Ada. I cancelled the A6000 and re-ran the comparison from scratch rather than rationalizing the order I already had.

Axis	RTX 6000 Ada (refurb)	PRO 5000 Blackwell (new OEM)	Verdict
Memory bandwidth	960 GB/s (GDDR6)	1,344 GB/s (GDDR7)	+40% on the decode-bound axis
Architecture	Ada SM89, no FP4	Blackwell SM120, native FP4	unlocks a serving point Ada can't
Condition	refurb, unknown hours	new, zero hours	full lifespan ahead
Price	$5,500–7,000 used	~$5,200 new	cheaper, not pricier

Self-correction worth recording

My earlier planning said the 5820's PCIe 3.0 slot would “partially handicap” the card. On review that was wrong, and I corrected it in the decision log. PCIe link speed touches the one-time CPU→GPU model load, not the GPU→VRAM path where the 1,344 GB/s lives and where inference actually runs. A 40 GB model loads in ~3 s on Gen3 x16 vs ~1.5 s on Gen5 — a one-time cost outside the steady state. Saying so in the record was cheaper than letting the framing propagate.

The silent runtime trap Expensive lesson

The most expensive bring-up lesson on this card had no error attached to it. The card loaded the model and ran — at roughly half its throughput, with no signal that anything was wrong.

On Blackwell (SM120), Ollama ≥ 0.24.0 is a hard prerequisite. Below it, the two levers that make the 70B fit and fly — OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0 — silently no-op. No warning. The model loads, but flash-attention and the q8_0 KV path never engage, and the 70B trickles along at ~40% CPU offload instead of running entirely on the GPU. Pinning the node to Ollama 0.24.0 on bare-metal systemd is what turned a card stuck at ~40% CPU offload into a clean, 100%-GPU 70B.

The rule that came out of it

Treat a silent capability dependency as a gated prerequisite, not a tuning detail. The version check now precedes the env-var block in the runbook, stated explicitly — because a no-op-without-an-error is otherwise rediscovered the hard way, at the cost of half the card. Verify the running reality of a dependency; don't trust that “it loaded, so it's fine.”

First sustained-load characterization

With the runtime corrected, I drove the card under sustained 70B load for the first time. It ran clean — and the measurement surfaced a binding constraint that is neither obvious nor tunable.

Metric	Measured	Reading
Fit @ num_ctx 16384	~46 / 48 GB, 100% GPU	clean operating point
Fit @ num_ctx 32768	~5% offload to CPU	not a clean point
70B throughput	~22–27 tok/s	at 16k, fully resident
32B throughput (same card)	~47–53 tok/s	~2× the 70B
Thermal	67–70 °C	~23 °C headroom, zero throttle
Power	300 W (hard cap)	software cap, no unlock on this SKU

Two findings matter for operating it. First, the speed deficit is the power envelope, not heat — there is ~23 °C of cooling headroom going unused because Max Power Limit is pinned at 300 W. So the meaningful monitoring signal is sustained 300 W draw, not temperature; I re-pointed alerting accordingly. Second, the production context ceiling is 16k, not the 32k I had assumed — 32768 offloads ~5% to CPU on Q4_K_M at 48 GB, so 16k is the clean operating point and 32k is capture-once-degraded only. I corrected that assumption in the decision log before this run, not after.

Built it, measured it, didn't ship it Not promoted

The whole refit existed to make 70B-class serving possible. When I ran the A/B, I did not promote the 70B to default. Declining to deploy a capability you built is the part of this worth an interview.

I set the promotion bar in advance: the 70B had to clearly win on grounding and fabrication to justify ~2× the latency and the loss of all concurrency. Against the incumbent qwen2.5:32b on a retrieval-augmented workload, it did not clear the bar — and on one probe it regressed. So qwen2.5:32b stays the default; the 70B capability remains on the card, available and unshipped.

No grounding win on the adversarial probe. Asked to list every model benchmarked on a node, both models emitted an identical phantom list including a nonexistent entry, each with a fabricated citation. The 70B fabricated exactly as confidently as the 32B.
An actual regression. On an infrastructure-fact query the 70B parroted a chunk in context and returned the wrong model ID; the 32B answered correctly.
Marginal wins elsewhere — on two queries the 70B hedged or stayed sourced where the 32B added unsourced specifics. Better, but nowhere near enough to pay for 2× latency and zero concurrency.

Validity caveats — stated, not buried

The verdict is STANDING, not final. Two limits sit next to it: the retrieval metrics (Hit@5 / MRR / nDCG) were structurally invalid for this run — the gold set was empty on every eval query — so the decision rests on manual answer-faithfulness review, not scores; and the A/B ran over an un-deduped collection, with textless duplicate chunks polluting the top results. Both make the call overturnable by a clean re-run on the now-deduped corpus, scored on answer-correctness. That re-run is an open item, not a closed one.

A speedup I refused to credit to the new card

The same node runs ComfyUI, characterized in depth in a companion report. The transferable finding: a batch of 80 images that used to take 3–4 hours now finishes in under 16 minutes (two runs at 928 s and 930 s — deterministic). The lazy writeup is “new card, big speedup.” The honest root cause was a wait_for_queue submit throttle I'd added as OOM protection on the previous card and long since pure overhead — ComfyUI executes its queue serially, so per-job VRAM was already bounded. The throttle was inserting idle GPU gaps. Removing it and raising batch size did most of the work; the card's headroom was secondary.

Two more decisions followed from the single-card reality. There is no MIG on a workstation Blackwell, and time-slicing an SM array already at 98% utilization adds context-switch overhead, not throughput — so a second GPU-requesting pod stays Pending by design, and batch size is the real lever. And because ComfyUI's extension ecosystem executes arbitrary code — with active in-the-wild custom-node supply-chain compromise in 2026 — I treat it like Postgres, not a public API: a thin gateway owns auth and validates requests against a fixed registry of known-good workflows referenced by hash, models pinned, image pipelines treated as code.

Break the correlation before claiming the cause

“It got faster after the upgrade” is a correlation you owe it to yourself to take apart. A guard written as protection on yesterday's hardware becomes a silent tax on today's; the attribution was tested, not assumed.

What I take from it

Verify running reality, not declared config — the Ollama no-op had no error; only checking the live runtime version surfaced it.
Set the promotion bar before you measure, so a null result can't be rationalized after the fact.
Decline to deploy when the evidence doesn't support it — capability built is not capability shipped.
State your own validity caveats next to the verdict; that is why this one is honestly labeled STANDING rather than final.
Keep model selection, card hand-off, and the workflow registry as committed, reviewable GitOps artifacts — never a direct edit to a live cluster object.

Status & honesty

This is an in-flight record, not a victory lap. The card was installed 2026-05-22 and remains inside its 30-day return window; the outgoing RTX 3090 stays on the bench as rollback hardware for the full window rather than the originally-planned week. The clean serving re-run on the deduped corpus is the named open item.

All figures here were measured on phx-ai04 during the 2026-05-22 → 05-25 bring-up window. Where a value is a target rather than a measurement it is labeled as such. The serving verdict rests on manual answer-faithfulness review; the retrieval metrics for that run were structurally invalid and are not relied upon. No conclusion rests on an unverified attribution.

Lab note adapted from internal report LAZ-2026-0526 (“Operation MaxQ”). Identifiers and the broader topology live in the lab's source-of-truth, not here.