Jaypore Labs
Back to journal
Engineering

SRE: runbook generation that captures the response

Most runbooks are written from memory months after the incident. AI-assisted runbook generation captures the actual response while it's fresh.

Yash ShahApril 15, 20265 min read

A senior SRE at a fintech told us his team's runbook library was "85% out-of-date and 15% wrong." Runbooks were written from memory weeks after incidents. By the time anyone wrote them, the incident's specifics had blurred. The result: when the next similar incident hit, the runbook was either irrelevant or actively misleading.

Claude Code makes runbook generation a fresh-incident discipline rather than a quarterly cleanup project. The runbook gets written from the incident's actual artifacts — chat logs, terminal commands, dashboard screenshots — while the response is recent. The library stays current.

The incident-to-runbook pipeline

The pattern: every incident produces a runbook entry, drafted within 48 hours of resolution.

Inputs:

  • Incident channel chat history.
  • The on-call engineer's terminal commands (where shareable).
  • Dashboard screenshots and metrics references.
  • The PagerDuty incident timeline.
  • The post-mortem (if one was written).

The AI synthesises these into a runbook draft:

  • Symptom. What the alert said. What customers saw. What dashboards looked like.
  • Diagnosis steps. What the on-call did to identify the cause, in order.
  • Resolution. The specific commands or actions that fixed it.
  • Prevention. What's been done (or filed) to prevent the next occurrence.
  • Owner. Who knows this system best.

The on-call reviews and tightens. The runbook lands in the team's library.

Owner discipline

Every runbook has a named owner. Not "the team owns it." A specific person. The owner:

  • Reviews the runbook quarterly.
  • Updates it when the underlying system changes.
  • Retires it if the system or the failure mode is gone.

Without a named owner, runbooks decay. With one, they survive.

The AI's draft includes the owner field. The owner is usually the on-call engineer who handled the incident, or whoever owns the affected service.

The drill loop

Runbooks that aren't drilled don't work. The team's drill loop:

  • Quarterly, pick a runbook entry.
  • Run a tabletop exercise based on it.
  • The on-call engineer follows the runbook step-by-step.
  • The team identifies gaps, ambiguities, and steps that don't actually work.
  • The runbook gets updated.

The AI helps facilitate the drill — generating the scenario from the runbook, surfacing the unstated assumptions, drafting the post-drill update notes. The exercise is human; the prep is AI-accelerated.

Search-friendly format

Runbooks that can't be found in a moment of need are runbooks that don't exist. The format prioritises searchability:

  • Title is the symptom, not the resolution. ("Customer-API 500s rising" not "Restart the api-gateway pod.")
  • Symptoms section uses the literal language of alerts and customer reports.
  • Tags include the affected service, the alert name, the dashboard URL.
  • Cross-links to related runbooks.

The AI ensures consistency in formatting. The on-call ensures the content matches the system.

A real runbook

A scenario: an incident where Postgres connections exhausted, causing API failures.

The runbook drafted from the incident artifacts:

Symptom. API latency spikes >5s, accompanied by connection refused or too many connections in api-server logs. Customer-facing 500 errors.

Diagnosis.

  1. Check pg_stat_activity for connection count: SELECT count(*) FROM pg_stat_activity WHERE state = 'active'.
  2. If count is at or near max_connections (currently 200), connection-pool exhaustion is the issue.
  3. Check api-server pod count and per-pod connection setting (PGBOUNCER_POOL_SIZE).

Resolution.

  1. If api-server pod count > 30 due to autoscaling: temporarily scale down (kubectl scale deployment api-server --replicas=20).
  2. If individual queries are stuck (> 30s active): investigate per-query, kill if needed (SELECT pg_terminate_backend(pid)).
  3. After incident: review the autoscaling threshold; raise max_connections if needed.

Prevention.

  • Filed PR-1923 to add a connection-count alert at 80% threshold.
  • Filed ENG-447 to evaluate pgbouncer for connection pooling.

Owner. @yash-platform (Yash, platform team).

Total drafting time with AI assistance: 15 minutes. The prevention items are real (linked to PRs). The diagnosis steps are testable. The next on-call can follow this end-to-end.

What stays human

  • The decision about which incidents become runbooks (some are unique enough to skip).
  • The owner assignment.
  • The drill schedule.
  • Reviewing and signing off on the prevention work.

The AI handles the typing. The team handles the discipline.

What we won't ship

Auto-publishing runbooks without on-call review.

Runbooks for systems the engineer doesn't fully understand.

Generic runbooks copied from elsewhere; runbooks must be specific to your system.

Runbooks without prevention items. Recurring incidents without prevention indicate a process gap.

How to start

Pick the next 5 incidents. Run the workflow. Build the first 5 runbook entries. Establish the quarterly drill cadence. Within two quarters, the library has 30+ entries and on-call quality measurably improves.

Close

Runbook generation with Claude Code is the discipline of capturing knowledge while it's fresh. The library stays current. On-call quality improves because the next on-call has access to the last on-call's learning. The incidents stop being one-off heroics and start being shared institutional knowledge.

Related reading


We build AI-enabled software and help businesses put AI to work. If you're improving on-call quality, we'd love to hear about it. Get in touch.

Tagged
Claude CodeSRERunbooksIncident ResponseAI Development
Share