SRE: runbook generation that captures the response

A senior SRE at a fintech told us his team's runbook library was "85% out-of-date and 15% wrong." Runbooks were written from memory weeks after incidents. By the time anyone wrote them, the incident's specifics had blurred. The result: when the next similar incident hit, the runbook was either irrelevant or actively misleading.

Claude Code makes runbook generation a fresh-incident discipline rather than a quarterly cleanup project. The runbook gets written from the incident's actual artifacts — chat logs, terminal commands, dashboard screenshots — while the response is recent. The library stays current.

The incident-to-runbook pipeline

The pattern: every incident produces a runbook entry, drafted within 48 hours of resolution.

Inputs:

Incident channel chat history.
The on-call engineer's terminal commands (where shareable).
Dashboard screenshots and metrics references.
The PagerDuty incident timeline.
The post-mortem (if one was written).

The AI synthesises these into a runbook draft:

Symptom. What the alert said. What customers saw. What dashboards looked like.
Diagnosis steps. What the on-call did to identify the cause, in order.
Resolution. The specific commands or actions that fixed it.
Prevention. What's been done (or filed) to prevent the next occurrence.
Owner. Who knows this system best.

The on-call reviews and tightens. The runbook lands in the team's library.

Owner discipline

Every runbook has a named owner. Not "the team owns it." A specific person. The owner:

Reviews the runbook quarterly.
Updates it when the underlying system changes.
Retires it if the system or the failure mode is gone.

Without a named owner, runbooks decay. With one, they survive.

The AI's draft includes the owner field. The owner is usually the on-call engineer who handled the incident, or whoever owns the affected service.

The drill loop

Runbooks that aren't drilled don't work. The team's drill loop:

Quarterly, pick a runbook entry.
Run a tabletop exercise based on it.
The on-call engineer follows the runbook step-by-step.
The team identifies gaps, ambiguities, and steps that don't actually work.
The runbook gets updated.

The AI helps facilitate the drill — generating the scenario from the runbook, surfacing the unstated assumptions, drafting the post-drill update notes. The exercise is human; the prep is AI-accelerated.

Search-friendly format

Runbooks that can't be found in a moment of need are runbooks that don't exist. The format prioritises searchability:

Title is the symptom, not the resolution. ("Customer-API 500s rising" not "Restart the api-gateway pod.")
Symptoms section uses the literal language of alerts and customer reports.
Tags include the affected service, the alert name, the dashboard URL.
Cross-links to related runbooks.

The AI ensures consistency in formatting. The on-call ensures the content matches the system.

A real runbook

A scenario: an incident where Postgres connections exhausted, causing API failures.

The runbook drafted from the incident artifacts:

Symptom. API latency spikes >5s, accompanied by connection refused or too many connections in api-server logs. Customer-facing 500 errors.

Diagnosis.

Check pg_stat_activity for connection count: SELECT count(*) FROM pg_stat_activity WHERE state = 'active'.

If count is at or near max_connections (currently 200), connection-pool exhaustion is the issue.

Check api-server pod count and per-pod connection setting (PGBOUNCER_POOL_SIZE).

Resolution.

If api-server pod count > 30 due to autoscaling: temporarily scale down (kubectl scale deployment api-server --replicas=20).

If individual queries are stuck (> 30s active): investigate per-query, kill if needed (SELECT pg_terminate_backend(pid)).

After incident: review the autoscaling threshold; raise max_connections if needed.

Prevention.

Filed PR-1923 to add a connection-count alert at 80% threshold.

Filed ENG-447 to evaluate pgbouncer for connection pooling.

Owner. @yash-platform (Yash, platform team).

Total drafting time with AI assistance: 15 minutes. The prevention items are real (linked to PRs). The diagnosis steps are testable. The next on-call can follow this end-to-end.

What stays human

The decision about which incidents become runbooks (some are unique enough to skip).
The owner assignment.
The drill schedule.
Reviewing and signing off on the prevention work.

The AI handles the typing. The team handles the discipline.

What we won't ship

Auto-publishing runbooks without on-call review.

Runbooks for systems the engineer doesn't fully understand.

Generic runbooks copied from elsewhere; runbooks must be specific to your system.

Runbooks without prevention items. Recurring incidents without prevention indicate a process gap.

How to start

Pick the next 5 incidents. Run the workflow. Build the first 5 runbook entries. Establish the quarterly drill cadence. Within two quarters, the library has 30+ entries and on-call quality measurably improves.

Close

Runbook generation with Claude Code is the discipline of capturing knowledge while it's fresh. The library stays current. On-call quality improves because the next on-call has access to the last on-call's learning. The incidents stop being one-off heroics and start being shared institutional knowledge.