Active Incidents — Tell Me Which Cluster Is Spiking Right Now
Failure clusters last week told you what failure patterns exist. They didn't tell you which one is on fire right now. Two new rate windows split clusters into "active incidents" (escalating) and everything else — so when you arrive during an incident, the page tells you where to look.
The Other Half of Cluster View
Last week we shipped failure clusters — distinct failure patterns aggregated by fingerprint. Useful for understanding what's happening, less useful for understanding what's urgent. A cluster with 5,000 deliveries from last week is bigger than one with 30 from the last 5 minutes, but the small fresh one is the active fire.
Today the cluster page learns to tell the difference.
Two Rate Windows
Each cluster now computes two rates inside the existing fn_delivery_clusters SQL function:
- Recent rate — failed deliveries in the last 5 minutes ÷ 5
- Baseline rate — failed deliveries in the 5–65 minute window ÷ 60
A cluster is escalating when its recent rate is ≥ 0.5/min (at least one failure in the last 5 min) AND > 5× its baseline rate. The threshold is intentionally simple — a real rolling-baseline z-score is overkill for this signal, and we already have a separate anomaly_volume alert type for source-side traffic anomalies that does the statistical heavy lifting.
The Active Incidents Section
The clusters page now splits results into two sections:
Active incidents (pulsing destructive-themed banner, pinned at top) shows clusters whose recent rate exceeds the threshold. Each card shows the recent rate, the baseline rate, and a prominent Edit & Replay all button — because if it's escalating, the answer is usually "fix it now."
Other clusters (below, normal styling) shows everything else — high-count historical patterns that aren't currently active, slow-burn issues, transient blips that already resolved.
When you arrive at the clusters page during an incident, the active-incidents section tells you exactly where to look — no scrolling, no scanning, no doing the math in your head.
On the MCP Side
The same signal flows out through MCP for AI-driven triage. hookbase_list_delivery_clusters now returns recentRatePerMin, baselineRatePerMin, and a boolean escalating flag per cluster. An agent investigating an incident can ask for clusters, filter to escalating === true, and prioritize those.
The Tradeoff
A 5-minute / 60-minute comparison is a deliberate choice. Shorter windows are noisier (any sub-5min blip looks like an active incident); longer windows lag (an incident has to last several minutes to show up). 5/60 with a 5× threshold catches anything that doubles or more on the order of seconds-to-minutes without firing on isolated retries.
We'll tune these if customer signal points elsewhere. For now, the cluster page is one place that answers two related but different questions: what failure patterns exist? and which one is happening right now?