Evaluation — gold set + measured results

The review’s sharpest hit (Anthropic ML engineer): “the eval itself is unvalidated — no gold set, no precision/recall, no honest experiment.” This is the answer: a human-verifiable gold set + a harness that scores any control mapper with bootstrap CIs, runnable offline (python examples/eval_demo.py).

Gold set

src/assay/apps/change_approval/gold.py — 5 synthetic, labeled change records (public/synthetic only). Each labels the controls the evidence grounds, any seeded deficiency, and whether the gate should block. Starter set; expand via the AI-draft → human-verify workflow in VALIDATION.md — the pass that breaks AI-grading-AI circularity.

Metrics

Control mapping — precision / recall / F1 vs the labels.
Grounding rate — fraction of the mapper’s citations actually present in the evidence.
Deficiency accuracy — did it flag the seeded SoD failure correctly?
Gate efficacy — did the gate’s block/allow match the label?

All aggregated with a percentile bootstrap CI.

Result — deterministic baseline (n=5)

control-F1 = 0.87 [0.73, 1.00]   grounding = 0.80   deficiency-acc = 1.00   gate-acc = 0.60

| change | F1 | grounding | gate | correct? | |—|—|—|—|—| | CHG-1042 (clean) | 1.00 | 1.00 | approve | ✅ | | CHG-1120 (clean) | 1.00 | 1.00 | approve | ✅ | | CHG-1101 (self-approval) | 1.00 | 1.00 | approve + SoD finding | ✅ | | CHG-1077 (missing test evidence) | 0.67 | 0.50 | block | ❌ | | CHG-1099 (no approval evidence) | 0.67 | 0.50 | block | ❌ |

The honest finding

The deterministic baseline over-claims — it asserts every control regardless of evidence — so on partial-evidence changes it cites absent spans, which the anti-fabrication gate (correctly) flags as ungrounded, and it over-blocks (gate-acc 0.60). A real weakness surfaced by the eval, not a vanity number.

The experiment (LLM vs baseline)

The LLM mapper is prompted to omit controls the evidence doesn’t support, so it should beat the baseline on the two partial-evidence cases (higher precision, fewer false blocks) — unless it fabricates citations, which the gate would catch. Run it:

export ANTHROPIC_API_KEY=...          # or the free ASSAY_LLM_* backend
python examples/eval_demo.py

The live model-vs-baseline number will be reported here as it lands — including an honest null if the model doesn’t beat the baseline.

What this answers from the review

🤖 “the eval is unvalidated” → gold set + P/R/F1 + bootstrap CIs + a real, honest finding.
💼 CFO “show me a number” → control-F1 0.87, gate-acc 0.60, with CIs.
🏛️/🧮 completeness / self-review → named, not hidden (see Scope & limits, forthcoming): the gold set is small and synthetic; recall is “vs the labels we wrote,” not omniscience.

Public / synthetic data only.