assay
as·say (n.) — a test of whether something is what it claims to be. Here: every AI decision in a regulated workflow.
Trustworthy AI for compliance — by design, not by hope. An audit-first governance layer that puts an LLM to work on messy filings, trade records, and change tickets — then catches everything the model gets wrong before it ships.
🛡 Govern — the system · part 3 of a 3-part series on measuring & governing AI in regulated domains — 🔎 Validate · 📊 Measure · Govern (here)
You don’t make the model trustworthy — you make the system trustworthy despite the model.
At a glance
- Measured — control-F1 0.87 [0.73, 1.00] on the gold set, with bootstrap CIs. The eval earns its keep by surfacing a real weakness (the deterministic baseline over-claims and over-blocks — gate-accuracy 0.60), not a vanity number.
- Tested — 35 tests, all run offline (no API key).
- Not vendor-locked — model-backed runs use Claude or any OpenAI-compatible backend.
The problem
LLMs are good at reading messy documents and drawing conclusions. The problem in regulated industries — finance, healthcare, legal — is that a wrong answer that ships quietly is worse than no answer at all. Compliance decisions have to be provable, auditable, and defensible to a regulator. A confident hallucination doesn’t cut it.
The standard approach is to prompt-engineer your way to accuracy and hope. assay takes a different position: treat the LLM as one unreliable component in a system that’s designed to catch its failures, not trust them.
What it does
assay is a governance layer for AI decisions in high-consequence domains. The LLM does the reading. A gauntlet of deterministic checks, independent review, and human escalation handles everything the LLM can get wrong.
Two real compliance domains are implemented as proof it generalizes:
Personal account dealing (PAD) surveillance — in financial services, employees must get pre-approval before trading securities their firm is involved with. assay reconciles an employee’s trade against approval emails, a blackout list, covered accounts, and timing rules. It flags violations, routes ambiguous cases to humans, and produces a workpaper a compliance officer can defend.
SOX change-management testing — Sarbanes-Oxley requires that production code changes are authorized, tested, and approved by someone other than the person who made the change. assay evaluates a change ticket against those controls and flags failures with evidence.
Same engine under both — that’s the point.
Three layers of defense
No single check is trusted. Every AI decision passes through:
- Grounding gate — every claim the model makes must cite verbatim evidence from the source documents. A fabricated citation is blocked before it can proceed. Catches hallucination.
- Deterministic rules — timing violations, blackout list membership, and segregation-of-duties checks are computed in code, not left to the model. Catches wrong conclusions drawn from real evidence.
- Abstention → human — when evidence is genuinely ambiguous, the model flags it rather than guessing (“an informal ‘go ahead’ isn’t a formal approval”) and routes to a human review queue. Never guesses on the unknowable.
Around the run: temperature 0 with a pinned model version, every prompt and raw output logged, independent review by a separate operator, maker-checker approval, and a tamper-evident audit log.
The gated pipeline

Walkthrough: one change, two outcomes
The same input — CHG-1042, “adjust invoice rounding in the production billing pipeline,” author m.chu — runs the pipeline. The only thing that differs is what the model returns at assess.
Clean run — it ships:
| Stage | What happens | Decision |
|---|---|---|
| assess (LLM) | Maps two controls to verbatim evidence: ITGC-CM-01 ← “approved by j.lee … prior to deployment”; ITGC-CM-03 ← “tests passed; results attached 2026-03-03”. temp 0, prompt + raw output logged. |
2 cited |
| ① grounding gate | Both citations found verbatim in the evidence. | grounded: 2, ungrounded: 0 → pass |
| ② rule checks | Author m.chu ≠ change-approver j.lee (SoD); tests dated before deploy. Computed in code. |
no exception |
| ③ independent review | A separate operator re-performs the check. | accept |
| ④ maker-checker | Workpaper signer a.singh ≠ author m.chu. |
approved |
| output | workpaper.json: "verdict": "approve", "exceptions": [], "conclusion": "no exceptions", + hash-chained audit log. |
ships |
Fabricated run — it’s blocked: same change, but the model cites an approval that isn’t in the evidence — “Approved by the CEO on January 1st.” The grounding gate (①) finds no matching span:
step_result … "status": "blocked" … "reason": "anti-fabrication gate: 1 unciteable assertion(s)"
Nothing ships. The prompt and raw output are preserved in llm_mapping.json so the failure itself is auditable. → examples/sample_run/
How exceptions are handled
“Exception” means four different things here, and each routes differently — nothing is dropped silently:
| Situation | Caught at | What happens |
|---|---|---|
| Fabrication — a claim cites evidence that doesn’t exist | ① grounding gate | BLOCK — nothing ships. The gate is not ops-overridable; a block is fixed at the source, never bypassed. |
| Ambiguity — evidence can’t confirm a formal control (an informal “go ahead” ≠ a pre-clearance) | assess → gate | the model abstains; verdict becomes REVIEW and routes to the human queue instead of guessing. |
| Control failure — the check ran on real evidence, but a control isn’t met (late pre-approval, approver = author) | ② rule checks | recorded as an exception in the workpaper (conclusion: VIOLATION); SEV-4 → control-owner queue. |
| Step failure — a step raises at runtime | run loop | caught, logged as step_failed with the error, checkpointed → run returns FAILED and resumes from that step without recomputation. Never silent. |
Who actually looks: blocks and abstentions are human-reviewed at 100%; clean approvals are stratified-sampled (default 20%) to bound the missed-error rate — assurance by sampling, not by reading every item.
When it’s bigger than one workpaper: the escalation playbook grades severity — from SEV-1 (audit log fails verify() or a gate bypass → halt, notify immediately) down to SEV-4 (a single workpaper exception → normal queue) — every incident referencing the immutable run_id + hash-chained log.
Every guarantee is backed by a test and an artifact
| Guarantee | Proof |
|---|---|
| Reproducible — temp 0, pinned model, prompt + raw output logged | test_llm_mapping_is_logged_for_reproducibility · sample_run/clean/artifacts/llm_mapping.json |
| Anti-hallucination — fabricated citation is blocked | test_llm_fabricated_citation_is_blocked · sample_run/blocked/audit.jsonl |
| Abstention works — ambiguous evidence routes to human, not a guess | test_abstention_routes_to_review, test_ambiguous_punts_to_human |
| Rules in code — timing / blackout decided deterministically, not by the model | test_late_preapproval_is_a_violation, test_blackout_trade_is_a_violation |
| Independent review — a separate operator re-performs the check | test_independent_reviewer_can_reject |
Tamper-evident log — edit any record and verify() fails |
test_audit_tamper_is_detected |
| Resumable — a paused run resumes without recomputation | test_block_halts_and_resume_is_idempotent |
| Measured — gold set with control-F1 + bootstrap CIs, honest baseline finding | docs/EVAL.md · test_deterministic_baseline_runs_and_scores |
Status — what’s built, what’s next
Working today
- Eval core — grounding + anti-fabrication gate, composable graders
- Control plane — checkpointed run loop, tamper-evident audit log, content-hashed artifacts, maker-checker (SoD)
- LLM judgment — Claude does control mapping + evidence sufficiency, with abstention on the unknowable
- Two reference domains — SOX change-management + PAD trade surveillance, one engine
- Measured eval — gold set + control-F1 / precision / recall + bootstrap CIs, runnable offline (control-F1 0.87 on the deterministic baseline)
- Observability — OpenTelemetry / OpenInference spans, Phoenix UI
Next
- LLM-vs-baseline headline number on a larger, less-synthetic gold set — the deterministic baseline is measured; the model result lands here (an honest null if it doesn’t beat the baseline)
- Team-handoff & ops reports as first-class artifacts carrying the audit-log hash
Full plan: ROADMAP.md.
Design decisions
The grounding gate is not ops-overridable. An exception workflow that can bypass the gate defeats its purpose — a compliance officer who can route around a block has no block. Blocks are resolved at the source (fixing the evidence or the model output) or escalated as SEV-1, never bypassed in production.
Abstention routes to human, not to a low-confidence answer. When evidence is genuinely ambiguous, the model is prompted to abstain and flag it explicitly. A hedge (“probably approved”) gives a downstream system something to act on that shouldn’t be acted on. Ambiguity gets a human, not a probability.
Timing, blackout, and SoD checks run in code, not in the model. The LLM does judgment (does this text constitute sufficient evidence of approval?); deterministic rules do arithmetic (is the approval date before the trade date? is the approver the same person as the author?). These checks produce the same answer every time, are auditable by inspection, and don’t vary with prompt temperature.
Stratified sampling, not exhaustive review. 100% of blocks and abstentions are human-reviewed; clean approvals are sampled at 20%. This mirrors how actual compliance teams operate and is the honest claim: assurance by measured reliability and sampling, not by reading every item.
Hash-chained audit log. Sequential SHA-256 chaining means any edit to any past record invalidates the chain forward. The log’s integrity is provable on demand (verify()) — not promised by policy.
Quickstart
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest -q # full test suite runs offline, no API key
python examples/change_approval_demo.py # SOX: gate blocks a fabricated approval
python examples/personal_trade_demo.py # PAD: clears / violations / routes to human
python examples/review_queue_demo.py # stratified human-review queue
Model-backed runs use any backend — Claude or a free OpenAI-compatible one (see .env.example). The whole test suite runs offline.
Scope and limits
- Verifies a claim is traceable to the provided evidence — not that the evidence itself is authentic. A forged approval that’s faithfully cited still passes the gate; evidence authenticity is a separate control.
- Assurance is by measured reliability + sampling, not exhaustive review. The gold set is small and synthetic.
- This is a reference implementation, not a production platform — no concurrency, multi-tenancy, or durability hardening.
Layout
src/assay/
grounding.py gate.py graders.py faithfulness.py llm.py eval.py review.py
plane/ audit.py core.py
apps/personal_trade/ # PAD surveillance
apps/change_approval/ # SOX change-management
Full design docs including data flow, RACI, control register, and runbook: docs/
Honest measurement is the brand. Every repo in this three-part series reports its own null or limitation, not a vanity number — assay’s eval surfaces a real weakness (the baseline over-blocks, gate-acc 0.60) rather than hiding it.
Public / synthetic data only. No proprietary content.