SRE runbook & SLAs (design — matures with reference app #2: incident runbook)

Operating the pipeline

Start a run: Run(...).execute(payload).
Resume: re-execute() the same payload — idempotent; resumes from checkpoint.
Approve (maker-checker): record_approval(rundir, key, approver), then re-execute.
Verify integrity: AuditLog(rundir / "audit.jsonl").verify().

SLAs (targets — business hours, tiered by change type)

Three separate clocks — don’t collapse them into one “human SLA”:

Clock	Standard in-scope change	Emergency	Breach
Independent operator review (the gating judge)	4 business hrs (½ day)	1 hr	escalate to control owner
Maker-checker approval	1 business day	1 hr + retroactive	escalate to process owner
3rd-party observation — exceptions / blocks	24 hrs (next business day)	4 hrs	SEV-2/3
3rd-party observation — clean population	periodic (e.g. monthly sample)	—	completeness review

The clock starts at handoff. A breached SLA is itself an exception (a finding), not just a late ticket — it is logged and escalated, never silently cleared. The 3rd party does not observe every clean run in real time: exceptions fast, completeness over a period (this is the continuous-controls-monitoring model).

Conditions & response

Retention & data lifecycle

SLAs govern timeliness of review, not deletion. Run evidence — the hash-chained audit log, workpaper, handoff report, and the logged model prompt/output — is retained per the records-retention policy (SOX-relevant evidence is typically kept ~7 years), so any control test can be re-performed long after the run.

A 24-hour review SLA does not clear data at 24 hours — the evidence persists.
Retention is its own control; early deletion is itself a control failure.
▶ Exact retention period + storage (ideally WORM / write-once) are owner-assigned.

Targets marked ▶ are owner-assigned.