Lab Notebook · Incident Remediation

A backup that reported success — and refuting its recorded root cause

A vector-store backup was reporting success while doing nothing. The interesting part isn't the fix — it's that the root cause written into the incident record didn't survive contact with the live system, and correcting it changed the entire remediation.

Ticket

LAZ-INC-2026-05-25

Hosts

phx-ai01 · phx-ai02

Date

2026-05-26 · Day 1

Severity

P0 (latent)

Summary Corpus safe In progress

No data was at risk at any point. Two corpus copies are checksum-verified; the backup script is hardened and gates clean; an on-prem alert channel is stood up. The backup timer remains disabled by design until the alerting is fully wired.

2 / 2

corpus copies checksum-verified

collections now covered

downtime during remediation

I consolidated several design and red-team passes into one vetted plan and executed it in a locked order, under two standing constraints: cluster changes go through Git and Flux only, and the alert channel had to move on-prem.

What actually failed

The incident record attributed the silent failure to a dead ERRORS counter — a script that returned exit 0 straight through a failure. I checked that against the running script before building anything on top of it.

Recorded root cause — refuted

The live script does check ERRORS and exits 1. That root cause is wrong. The true silent vector was operational, not a code bug: a disabled timer still carrying a frozen Result=success from a stale prior run, and a freshness monitor that logged CRITICAL into a channel with no alert path attached. The system looked healthy because the last good run's status never expired and the warning had nowhere to go. The incident doc and the case-study narrative are being amended to match.

This is why I check the live system before trusting its record: a remediation built on the wrong cause would have hardened a counter that already worked and left the real failure — a stale status with no live alert — fully intact.

Hardening the backup — fail-closed

The rewritten qdrant-backup.sh is built to refuse to report success unless it can prove it. Installed on phx-ai01 with the timer deliberately left off.

Dynamic collection enumeration — the script discovers collections at run time instead of trusting a hardcoded list (which had silently omitted one; see §04).
Fail-closed accounting — any collection that isn't provably backed up fails the run; success is earned, not assumed.
Downloaded-artifact integrity — every snapshot is checked on size and SHA256, then verified again at the remote after an additive rsync (no --delete, so a bad run can never prune good copies).
Retention after verification only — old copies are aged out after the new copy is confirmed, never before.
Single-instance lock (flock) — overlapping runs can't corrupt each other.

Two traps the rewrite surfaced

Hardening the script exposed two latent failures that the old path would have hit eventually.

Permissions trap — discovered before it fired

Snapshots land on the hostPath as root:root 0600. The first-pass gate only passed because I ran it under sudo — a timer-fired run as the service user would have failed silently on read. The v1.1 script switches to an API-download transport, which resolves this without root and without a unit change.

Coverage gap. The old hardcoded collection list omitted claude_canon entirely — it was never being backed up. Dynamic enumeration eliminates that whole class of error. (I also found a stale comment in the live script pointing at the wrong host; corrected in the record.)

Alerting that actually alerts

The original silent failure was half “stale status” and half “a CRITICAL with nowhere to go.” So I stood up a self-hosted ntfy channel on phx-ai02 (bare Docker, deny-all auth) to replace the public, throttled, off-prem service — and verified it behaves before relying on it.

Check	Result
Health endpoint	OK
Anonymous publish	403 (denied)
Authenticated publish	200 (delivered)
Persistence across restart	retained

The v1.1 script carries bearer-token auth against that channel. It is shellcheck-clean and smoke-tested; the final install gate is the one remaining step before the timer comes back on.

A detour I flagged and contained Scope-drift

Mid-session, a host banner on phx-ai02 showed more than 31,000 zombie processes (31,059), traced to the Gitea pod running as PID 1 with no init reaper — orphaned git subprocesses accumulating over 22 days. This was off the remediation plan, so I flagged it as scope-drift, contained it, and kept the actual root cause for the backlog.

Contained without touching anything load-bearing

A clean kubectl delete pod let the PID-namespace teardown reap all of them — no reboot, no exposure to cluster quorum, etcd, or the backup server. Gitea came back healthy and the Flux source reported Ready=True. The real fix (a proper init / tini as PID 1) is deferred to the backlog rather than improvised into an unrelated incident. A separate accidental run of the wrong step on this host was confirmed a no-op — it preflight-aborted with no Qdrant present and left the off-host backups untouched.

Current posture & what's left

Corpus — two copies, checksum-verified. Safe throughout.
Backup script — hardened v1.0 installed on phx-ai01, gates clean; v1.1 (API transport + token auth) staged, shellcheck-clean, smoke-tested. Final install gate pending
Alert channel — on-prem ntfy verified end to end.
Timer — deliberately disabled until alerting is fully wired. A backup you can't trust to alert on failure is worse than one you know is off.

This is a Day-1 remediation record, not a closed incident. The fixes described are installed or staged as noted; the timer re-enable follows the final gate and the amended incident doc. No data was lost or at risk at any point.

Lab note adapted from internal worklog LAZ-INC-2026-05-25 (Qdrant backup remediation). Host identifiers, hashes, and the parent incident live in the lab's source-of-truth.