A senior SRE at a fintech told us his team's runbook library was "85% out-of-date and 15% wrong." Runbooks were written from memory weeks after incidents. By the time anyone wrote them, the incident's specifics had blurred. The result: when the next similar incident hit, the runbook was either irrelevant or actively misleading.
Claude Code makes runbook generation a fresh-incident discipline rather than a quarterly cleanup project. The runbook gets written from the incident's actual artifacts — chat logs, terminal commands, dashboard screenshots — while the response is recent. The library stays current.
The incident-to-runbook pipeline
The pattern: every incident produces a runbook entry, drafted within 48 hours of resolution.
Inputs:
- Incident channel chat history.
- The on-call engineer's terminal commands (where shareable).
- Dashboard screenshots and metrics references.
- The PagerDuty incident timeline.
- The post-mortem (if one was written).
The AI synthesises these into a runbook draft:
- Symptom. What the alert said. What customers saw. What dashboards looked like.
- Diagnosis steps. What the on-call did to identify the cause, in order.
- Resolution. The specific commands or actions that fixed it.
- Prevention. What's been done (or filed) to prevent the next occurrence.
- Owner. Who knows this system best.
The on-call reviews and tightens. The runbook lands in the team's library.
Owner discipline
Every runbook has a named owner. Not "the team owns it." A specific person. The owner:
- Reviews the runbook quarterly.
- Updates it when the underlying system changes.
- Retires it if the system or the failure mode is gone.
Without a named owner, runbooks decay. With one, they survive.
The AI's draft includes the owner field. The owner is usually the on-call engineer who handled the incident, or whoever owns the affected service.
The drill loop
Runbooks that aren't drilled don't work. The team's drill loop:
- Quarterly, pick a runbook entry.
- Run a tabletop exercise based on it.
- The on-call engineer follows the runbook step-by-step.
- The team identifies gaps, ambiguities, and steps that don't actually work.
- The runbook gets updated.
The AI helps facilitate the drill — generating the scenario from the runbook, surfacing the unstated assumptions, drafting the post-drill update notes. The exercise is human; the prep is AI-accelerated.
Search-friendly format
Runbooks that can't be found in a moment of need are runbooks that don't exist. The format prioritises searchability:
- Title is the symptom, not the resolution. ("Customer-API 500s rising" not "Restart the api-gateway pod.")
- Symptoms section uses the literal language of alerts and customer reports.
- Tags include the affected service, the alert name, the dashboard URL.
- Cross-links to related runbooks.
The AI ensures consistency in formatting. The on-call ensures the content matches the system.
A real runbook
A scenario: an incident where Postgres connections exhausted, causing API failures.
The runbook drafted from the incident artifacts:
Symptom. API latency spikes >5s, accompanied by
connection refusedortoo many connectionsin api-server logs. Customer-facing 500 errors.Diagnosis.
- Check
pg_stat_activityfor connection count:SELECT count(*) FROM pg_stat_activity WHERE state = 'active'.- If count is at or near
max_connections(currently 200), connection-pool exhaustion is the issue.- Check api-server pod count and per-pod connection setting (PGBOUNCER_POOL_SIZE).
Resolution.
- If api-server pod count > 30 due to autoscaling: temporarily scale down (
kubectl scale deployment api-server --replicas=20).- If individual queries are stuck (> 30s active): investigate per-query, kill if needed (
SELECT pg_terminate_backend(pid)).- After incident: review the autoscaling threshold; raise
max_connectionsif needed.Prevention.
- Filed PR-1923 to add a connection-count alert at 80% threshold.
- Filed ENG-447 to evaluate pgbouncer for connection pooling.
Owner. @yash-platform (Yash, platform team).
Total drafting time with AI assistance: 15 minutes. The prevention items are real (linked to PRs). The diagnosis steps are testable. The next on-call can follow this end-to-end.
What stays human
- The decision about which incidents become runbooks (some are unique enough to skip).
- The owner assignment.
- The drill schedule.
- Reviewing and signing off on the prevention work.
The AI handles the typing. The team handles the discipline.
What we won't ship
Auto-publishing runbooks without on-call review.
Runbooks for systems the engineer doesn't fully understand.
Generic runbooks copied from elsewhere; runbooks must be specific to your system.
Runbooks without prevention items. Recurring incidents without prevention indicate a process gap.
How to start
Pick the next 5 incidents. Run the workflow. Build the first 5 runbook entries. Establish the quarterly drill cadence. Within two quarters, the library has 30+ entries and on-call quality measurably improves.
Close
Runbook generation with Claude Code is the discipline of capturing knowledge while it's fresh. The library stays current. On-call quality improves because the next on-call has access to the last on-call's learning. The incidents stop being one-off heroics and start being shared institutional knowledge.
Related reading
- SRE: postmortem first drafts — companion role.
- DevOps: CI pipeline diagnosis at 2am — same fix-and-document pattern.
- A senior engineer's day with Claude Code
We build AI-enabled software and help businesses put AI to work. If you're improving on-call quality, we'd love to hear about it. Get in touch.