assay — roadmap & plan

The third part of a three-part arc — 🔎 Validate (agentic-test-eval) → 📊 Measure (filing-event-eval) → 🛡 Govern (assay). assay is the governance layer: an audit-first evaluation + control plane that makes AI behavior provable and auditable in regulated environments.

Status

Done

Eval core — grounding, anti-fabrication gate, composable graders.
Control plane — deterministic checkpointed run loop, tamper-evident audit log, content-hashed artifacts, maker-checker (SoD) human approval.
LLM judgment step — Claude does control mapping + evidence sufficiency, with abstention.
Reference domains — SOX ITGC change-management and PAD trade surveillance on one engine (synthetic data).
Observability — OpenTelemetry / OpenInference spans, Phoenix UI.

Next

Eval layer — control-F1 + bootstrap CIs on a larger, less-synthetic gold set (the measured result).
Team-handoff & ops reports as first-class, audit-hash-carrying artifacts.

Decided direction

AI-first, then refine. Put a real LLM (Claude) in the judgment steps (control mapping + evidence sufficiency) so there’s genuine AI behavior to evaluate, then refine prompts/graders. (needs ANTHROPIC_API_KEY)
- ⏰ Reminder to revisit: gold-set strategy — hand-verified vs. AI-drafted-then-reviewed (the recall-rigor decision).
Auditable team-handoff report. Beyond the internal workpaper, each run emits a control report for a consuming team (operations / audit / risk): a clean, relyable summary (scope, results, exceptions, period) carrying the audit-log hash so the receiving team can verify it. SOC-style handoff.
Observability now, with an operations report. Phoenix / OpenTelemetry traces across the run, rolled up into an ops report — clean vs. issues — so operations can triage which runs passed and which hit gates / exceptions / failures.
Reference app #2 — incident runbook (SRE). A runbook workflow on the same plane with SLAs and escalation paths (breach → escalate; human approval for risky steps) — proving the plane generalizes beyond audit.
The eval layer. Gold set + metrics: precision / recall of control mapping, faithfulness of the rationale, deficiency-call accuracy. This is what makes it a measured result, not just orchestration.

Build order

Each step ships code and its governance doc together — the doc isn’t an afterthought, it’s part of the deliverable. (✅ = doc already drafted in docs/; 🔧 = doc finalized when this step lands.)

Governance spine — DATA_FLOW, STAKEHOLDERS/RACI, CONTROLS, ARTIFACTS, VALIDATION drafted up front. ✅ (done)
LLM judgment step (control mapping + evidence sufficiency, behind the existing deterministic step interface). → exercises CONTROLS.md against real model behavior; needs ANTHROPIC_API_KEY.
Team-handoff + ops reports as first-class artifacts. 🔧 finalizes ARTIFACTS.md (report schemas + audit-hash).
Observability — OpenTelemetry / OpenInference (vendor-neutral; Phoenix default, W&B Weave export, self-host for on-prem) feeding the ops “clean vs. issues” report. 🔧 finalizes OBSERVABILITY.md (spans → SLIs → ops queue).
Eval layer + gold set (revisit hand-verification). 🔧 finalizes VALIDATION.md (metrics + acceptance thresholds).
Reference app #2 — incident runbook (SLA + escalation). 🔧 finalizes RUNBOOK.md + ESCALATION.md (SLA targets, SEV tiers, on-call).

Public / synthetic data only — never employer control, ticket, or audit data.