Incident Response

The First 15 Minutes

Incidents cost $300K to $1M+ per hour. But the real cost isn't the downtime. It's the 15 minutes your on-call engineer spends searching Slack for a thread that ends in 'nvm, figured it out,' reading a runbook last updated by someone who left, and escalating because the knowledge walked out the door. We build systems where that context is already there when they open their laptop.

3:47 AM

What happens next

ClearTrack is a Series C expense management company with about 95 engineers. Their transaction sync service pulls corporate card transactions from a banking partner API. A deploy goes out. Transactions stop flowing in. Here's what the next 15 minutes look like when AI agents are part of your incident response.

3:47 AMPagerDuty

Alert fires.

Transaction sync error rate breached threshold. Transactions stopped flowing in. Two AI agents activate in parallel.

3:47 AMAI Incident Commander

Incident opened.

SEV-3 ticket created. Slack channel #inc-txn-sync-0422 opened.

Impact: ~340 customers affected. No data loss. Upgraded to SEV-2.

3:49 AMAI Triage Engineer

Quick triage posted.

Sync success rate: 99.8% → 12%.

Recent deploy: v14.6.0 changed retry logic. All other services healthy.

Probable cause: v14.6.0 retry backoff change.

3:53 AMAI Triage Engineer

Root cause confirmed.

TimeoutError on /v2/transactions/list. Retry backoff changed from 2s to 5s in v14.6.0.

Past incident found: similar timeout config issue three months ago. Runbook pulled.

3:54 AMAI Triage Engineer

Recommended action.

Revert to v14.5.12. Retry backoff change is causing pagination timeouts on the banking API.

3:56 AMOn-Call Engineer

On-call arrives.

Opens the Slack channel. Impact, root cause, past incidents, and recommended action already there. No searching. No guessing.

3:59 AMOn-Call Engineer

Mitigation applied.

Reviews the triage. Confirms the rollback.

$ cleartrack-infra deploy-app 14.5.12
Deploying v14.5.12... done.
Transaction sync service restarting... healthy.

4:02 AMOn-Call Engineer

Mitigation verified.

Sync success rate recovering. Incident marked mitigated. Backfill in progress.

4:06 AMOn-Call Engineer

Incident resolved.

Backfill complete. No data loss. Incident marked resolved.

Total time: 19 minutes. Human time: 10 minutes.

4:06 AMAI Incident Commander

PIR generated.

Drafted from the Slack thread. No 90-minute writing session.

Root cause: retry backoff change caused pagination timeouts

Impact: ~340 customers, no data loss

Detection gap: 4 min between deploy and alert

Next dayEngineering Team

PIR reviewed.

Team reviews the AI-drafted PIR. Discusses the detection gap. Confirms three repair items. Assigns owners and deadlines.

Next dayAI Incident Commander

Repair items filed.

Add sync lag monitoring (P1, 3 days)

Update retry config runbook (P2, 1 week)

Add integration tests for retry config (P2, 1 week)

Next dayEngineering Team

Incident closed.

Repairs filed and assigned. Incident closed. Learnings flow back into the runbook.

Get Started

Let's build yours.

The teams that run this well stop losing weeks to problems no one surfaced. They ship faster, respond quicker, and improve every week. Let us understand your org and build the system that gets you there.

Get in touch