Failure Clusters — Collapse 1,000 Identical Failures Into One Root Cause
When something breaks in webhook land, it breaks in waves. Every failed delivery now gets a normalized fingerprint — same root cause, same hash — and a new Failure Clusters page groups the wave into one row. Click any cluster to inspect a sample and replay every matching delivery with one override.
The Wave Problem
When something breaks in webhook land, it breaks in waves. A destination goes down for 10 minutes — that's not one failure, that's every delivery to that destination for those 10 minutes. A Stripe webhook secret rotation — every delivery from that source until you swap the secret on your side. A bad transform deploy — every delivery on that route between deploy and revert.
The events page can sort and filter, but you're still scrolling a list of identical-looking failures. RCA is per-delivery, so if you analyze the first one and then the second and then the third, you're paying for 1,000 Claude calls to find out it's all one bug.
Today we're shipping failure clusters — a different way to look at the failure surface.
What a "Cluster" Is
A cluster is a group of failed deliveries that share a fingerprint — a SHA-256 hash of:
- The route id
- The destination id
- The HTTP response status (or null)
- A normalized excerpt of the response body (UUIDs, timestamps, and digits stripped)
- A normalized excerpt of the error message (same normalization)
- The RCA category, if RCA has run
Normalization is the part that does the work. "Connection timed out after 5234ms" and "Connection timed out after 8901ms" produce the same fingerprint because we replace runs of digits with <num>. UUIDs become <uuid>. ISO timestamps become <ts>. Whitespace collapses. Lowercase.
Two failures that differ only in run-specific numbers cluster together. Two failures that differ in actual cause stay separate.
The Cluster Page
A new Failure Clusters entry in the sidebar opens a page that lists every distinct failure pattern in your chosen window — last hour, 6h, 24h, or 7d. Each row shows:
- Total count
- Number of distinct events affected
- Route → destination
- Response status (if any)
- Error excerpt
- RCA category (if RCA ran on a sample)
- First and last seen
Click any cluster to drill into the sample delivery, with a banner reminding you that N other deliveries failed the same way.
Edit & Replay All
On each cluster row, Edit & Replay all opens a modal you've already seen — same destination / transform / headers / persistTransform override matrix as per-delivery replay-with-edit. Submit, and every delivery in the cluster gets a fresh replay with your overrides.
persistTransform is only allowed when every delivery in the cluster shares a single route — otherwise saving the new code is ambiguous. We reject the request with an explicit PERSIST_TRANSFORM_MULTI_ROUTE error rather than picking one route silently.
The cluster cap is 2,000 deliveries per replay call. Real waves usually fit inside that.
The Inline Banner
Even if you arrive at a failed delivery via the normal events page, the event debugger now shows a small amber banner inside that delivery's card whenever its fingerprint matches more than one delivery:
N other deliveries failed the same way — Same route, destination, status, and error pattern. Replay or fix them together. → View cluster
One click and you're in the cluster view, ready to fix everything at once.
Why This Is the Centerpiece
Replay-with-edit gives you the primitive. Clusters give you the scope. Together they're the difference between "I'll deal with the rest after I figure out the first one" and "I just fixed 1,400 failures with one click."
This is the load-bearing piece for several follow-up features: anomaly-rooted alerts, the failure pattern library, MCP-driven recovery loops — all of them reference clusters. More on those in the coming weeks.