Failure Clusters — Collapse 1,000 Identical Failures Into One Root Cause

The Wave Problem

When something breaks in webhook land, it breaks in waves. A destination goes down for 10 minutes — that's not one failure, that's every delivery to that destination for those 10 minutes. A Stripe webhook secret rotation — every delivery from that source until you swap the secret on your side. A bad transform deploy — every delivery on that route between deploy and revert.

The events page can sort and filter, but you're still scrolling a list of identical-looking failures. RCA is per-delivery, so if you analyze the first one and then the second and then the third, you're paying for 1,000 Claude calls to find out it's all one bug.

Today we're shipping failure clusters — a different way to look at the failure surface.

What a "Cluster" Is

A cluster is a group of failed deliveries that share a fingerprint — a SHA-256 hash of:

The route id
The destination id
The HTTP response status (or null)
A normalized excerpt of the response body (UUIDs, timestamps, and digits stripped)
A normalized excerpt of the error message (same normalization)
The RCA category, if RCA has run

Normalization is the part that does the work. "Connection timed out after 5234ms" and "Connection timed out after 8901ms" produce the same fingerprint because we replace runs of digits with <num>. UUIDs become <uuid>. ISO timestamps become <ts>. Whitespace collapses. Lowercase.

Two failures that differ only in run-specific numbers cluster together. Two failures that differ in actual cause stay separate.

The Cluster Page

A new Failure Clusters entry in the sidebar opens a page that lists every distinct failure pattern in your chosen window — last hour, 6h, 24h, or 7d. Each row shows:

Total count
Number of distinct events affected
Route → destination
Response status (if any)
Error excerpt
RCA category (if RCA ran on a sample)
First and last seen

Click any cluster to drill into the sample delivery, with a banner reminding you that N other deliveries failed the same way.

Edit & Replay All

On each cluster row, Edit & Replay all opens a modal you've already seen — same destination / transform / headers / persistTransform override matrix as per-delivery replay-with-edit. Submit, and every delivery in the cluster gets a fresh replay with your overrides.

persistTransform is only allowed when every delivery in the cluster shares a single route — otherwise saving the new code is ambiguous. We reject the request with an explicit PERSIST_TRANSFORM_MULTI_ROUTE error rather than picking one route silently.

The cluster cap is 2,000 deliveries per replay call. Real waves usually fit inside that.

The Inline Banner

Even if you arrive at a failed delivery via the normal events page, the event debugger now shows a small amber banner inside that delivery's card whenever its fingerprint matches more than one delivery:

N other deliveries failed the same way — Same route, destination, status, and error pattern. Replay or fix them together. → View cluster

One click and you're in the cluster view, ready to fix everything at once.

Why This Is the Centerpiece

Replay-with-edit gives you the primitive. Clusters give you the scope. Together they're the difference between "I'll deal with the rest after I figure out the first one" and "I just fixed 1,400 failures with one click."

This is the load-bearing piece for several follow-up features: anomaly-rooted alerts, the failure pattern library, MCP-driven recovery loops — all of them reference clusters. More on those in the coming weeks.

The Wave Problem

Today we're shipping failure clusters — a different way to look at the failure surface.

What a "Cluster" Is

A cluster is a group of failed deliveries that share a fingerprint — a SHA-256 hash of:

The route id
The destination id
The HTTP response status (or null)
A normalized excerpt of the response body (UUIDs, timestamps, and digits stripped)
A normalized excerpt of the error message (same normalization)
The RCA category, if RCA has run

Two failures that differ only in run-specific numbers cluster together. Two failures that differ in actual cause stay separate.

The Cluster Page

A new Failure Clusters entry in the sidebar opens a page that lists every distinct failure pattern in your chosen window — last hour, 6h, 24h, or 7d. Each row shows:

Total count
Number of distinct events affected
Route → destination
Response status (if any)
Error excerpt
RCA category (if RCA ran on a sample)
First and last seen

Click any cluster to drill into the sample delivery, with a banner reminding you that N other deliveries failed the same way.

Edit & Replay All

The cluster cap is 2,000 deliveries per replay call. Real waves usually fit inside that.

The Inline Banner

N other deliveries failed the same way — Same route, destination, status, and error pattern. Replay or fix them together. → View cluster

One click and you're in the cluster view, ready to fix everything at once.

Failure Clusters — Collapse 1,000 Identical Failures Into One Root Cause

The Wave Problem

What a "Cluster" Is

The Cluster Page

Edit & Replay All

The Inline Banner

Why This Is the Centerpiece

Related Articles

Install Hookbase as an App

Why Webhooks Arrive Out of Order (and How to Handle It)

Fan-Out: Deliver One Webhook to Many Destinations

Ready to Try Hookbase?

Failure Clusters — Collapse 1,000 Identical Failures Into One Root Cause

The Wave Problem

What a "Cluster" Is

The Cluster Page

Edit & Replay All

The Inline Banner

Why This Is the Centerpiece

Related Articles

Install Hookbase as an App

Why Webhooks Arrive Out of Order (and How to Handle It)

Fan-Out: Deliver One Webhook to Many Destinations

Ready to Try Hookbase?