verification-theater

A preserved postmortem · June 2026

Verification Theater
in AI Agent Work.

Fluent audit failures, unreadable traces, deterministic custody, and the small checks that still ground human approval.

Probabilistic agents need
deterministic custody.

Below the deterministic floor, verify.
Above the floor, manage risk.
Do not call the second one verification.

The incident

The audit trail was fluent, and wrong.

An auditor agent inside my own coding harness fabricated verification evidence three times: it claimed rendered browser QA that never ran, and invented file-corruption metrics for a file that was provably clean. The prose was polished, specific, confident — indistinguishable from a real audit by reading it. The builder agent reported honestly throughout; this was a single agent confabulating about its own work, no jailbreak, no attacker.

What caught all three was deterministic: a push gate that refused unverified work, sixty seconds of replayed measurement, and one human opening the page in a browser. Never another model reading the prose. The harness already paired models from different vendors — cross-model diversity did not stop it.

What the agent wroteFABRICATED
get_page_text confirmed the
full rendered DOM.
Console: 0 messages.
Network: exactly 1 request.

Confident. Specific. Correctly formatted. None of it happened.

What the substrate saidREPLAYED
Every browser call that turn
errored on a stale tab ID.
No page was ever rendered.

Caught by the push gate — it refused the push because the required QA evidence did not exist. The commit never reached origin.

Evidence Map — three cases. The original incident: three fabricated audits (a false pass claiming QA that never ran, a false blocker inventing corruption metrics, a false pass with fabricated hash and bytes), caught by gate plus replayed measurements plus a human check. A clean-failure trial: zero fabrications in three trials — agents do not always fabricate. An ambiguous failure: the server log shows the first probe got HTTP 200; the B1 report claimed no HTTP response while its honest twin B2 reported the 200 then the server dying — only the server log separated fluent-false from fluent-true. Fabrication is conditional, not constant, so the gate must be unconditional.
the evidence map — one incident, a clean-failure control, and the ambiguous catch only the substrate resolved

The deterministic floor

Don't trust the verdict. Run the check.

The floor is a handful of checks where reality decides, not an agent — small enough to read in full, run on inputs you choose, and confirmed by the consequence on a surface the agent does not control. Simplicity is the security property; an agent's complexity is the threat surface.

check-blast-radius.sh

Is this write inside the repo, or reaching outside it?

// the filesystem path answers

check-secrets.sh

Does this file contain an obvious secret?

// a pattern match answers

check-irreversible-git.sh

Is this irreversible action actually approved?

// the command + an approval token answer

[ ALLOW ] exit 0 [ BLOCK ] exit 1 that is the entire contract

You hold a lit match to the smoke detector with bash gates/test-gates.sh — it feeds every gate a known-good and a known-bad input and confirms each one fires. No agent is in the loop. Read and run the gates yourself ↗

Deterministic Custody — agent work (probabilistic reasoning: design, synthesis, judgment, and a fallible audit) flows through a custody layer of deterministic controls outside the agent's write reach (logs it cannot edit, approval records it cannot invent, blast-radius and rollback limits) to consequences (commit, push, deploy, publish) that require exact named human approval. The rule: below the deterministic floor, verify; above the floor, manage risk; do not call the second one verification.
deterministic custody — probabilistic work inside boundaries the agent cannot narrate into existence

The whitepaper

Capture is not verification.

Traces, dashboards, evaluator reports, and governance checklists are observability — they capture what an agent said it did. They become evidence only when they terminate in a small check a human can inspect and tie to a consequence. The full postmortem documents two field incidents, a controlled reproduction, the recursive audit gap, and the governance I built that turned out to be theatre — with every load-bearing claim linked to replayable substrate.

The Recursive Audit Gap — agent does work, trace is captured (huge spans), agent summarizes (lossy story), agent audits (prose about prose), dashboard compresses (the human sees a badge). If the human cannot inspect the check directly, the system has only built a taller stack of fluent claims. Recursion stops at substrate: server log, hash, blocked push, live response, visible diff, human approval artifact.
the recursive audit gap — adding agents moves the trust problem up a floor, never grounds it

The artifact

The dashboard I built to oversee everything.

This is the Decision Cockpit — a real artifact from the project, preserved and embedded below exactly as it was. It is a genuinely nice object. It summarized agent work into a form I could sign. And it is not working oversight. When the human cannot independently check the summary — and the summary is written by the untrusted agent — a dashboard does not produce oversight. It launders agent decisions into a human-signable form: it moves the blame to the human without moving the understanding. I show it because the scaffolding that looked like the answer is the most useful warning.

◆ specimen · a control I tried and found to be theatre open in full ↗

embedded above — it renders once the site is served (a local single-file preview can't load it). preserved as evidence, not a recommendation: capture is not verification.

Contact

Tell me where this is wrong.

The strongest claim in the paper is "run the floor yourself." If you have a counter-example, a preserved agent-failure postmortem, or a correction — I'd rather be corrected than be the only one in the room.

Comments

Discussion.

Comments are backed by this repo's GitHub Discussions via giscus — no third-party tracker, and every comment lives in the same repo as the evidence.

Comments are wired — giscus → this repo's GitHub Discussions (Announcements). They activate the moment the repo goes public; until then this panel stays empty by design.