Hookbase
LoginGet Started Free
Back to Blog
Product Update

Failure Clusters — Collapse 1,000 Identical Failures Into One Root Cause

When something breaks in webhook land, it breaks in waves. Every failed delivery now gets a normalized fingerprint — same root cause, same hash — and a new Failure Clusters page groups the wave into one row. Click any cluster to inspect a sample and replay every matching delivery with one override.

Hookbase Team
May 20, 2026
5 min read

The Wave Problem

When something breaks in webhook land, it breaks in waves. A destination goes down for 10 minutes — that's not one failure, that's every delivery to that destination for those 10 minutes. A Stripe webhook secret rotation — every delivery from that source until you swap the secret on your side. A bad transform deploy — every delivery on that route between deploy and revert.

The events page can sort and filter, but you're still scrolling a list of identical-looking failures. RCA is per-delivery, so if you analyze the first one and then the second and then the third, you're paying for 1,000 Claude calls to find out it's all one bug.

Today we're shipping failure clusters — a different way to look at the failure surface.

What a "Cluster" Is

A cluster is a group of failed deliveries that share a fingerprint — a SHA-256 hash of:

  • The route id
  • The destination id
  • The HTTP response status (or null)
  • A normalized excerpt of the response body (UUIDs, timestamps, and digits stripped)
  • A normalized excerpt of the error message (same normalization)
  • The RCA category, if RCA has run

Normalization is the part that does the work. "Connection timed out after 5234ms" and "Connection timed out after 8901ms" produce the same fingerprint because we replace runs of digits with <num>. UUIDs become <uuid>. ISO timestamps become <ts>. Whitespace collapses. Lowercase.

Two failures that differ only in run-specific numbers cluster together. Two failures that differ in actual cause stay separate.

The Cluster Page

A new Failure Clusters entry in the sidebar opens a page that lists every distinct failure pattern in your chosen window — last hour, 6h, 24h, or 7d. Each row shows:

  • Total count
  • Number of distinct events affected
  • Route → destination
  • Response status (if any)
  • Error excerpt
  • RCA category (if RCA ran on a sample)
  • First and last seen

Click any cluster to drill into the sample delivery, with a banner reminding you that N other deliveries failed the same way.

Edit & Replay All

On each cluster row, Edit & Replay all opens a modal you've already seen — same destination / transform / headers / persistTransform override matrix as per-delivery replay-with-edit. Submit, and every delivery in the cluster gets a fresh replay with your overrides.

persistTransform is only allowed when every delivery in the cluster shares a single route — otherwise saving the new code is ambiguous. We reject the request with an explicit PERSIST_TRANSFORM_MULTI_ROUTE error rather than picking one route silently.

The cluster cap is 2,000 deliveries per replay call. Real waves usually fit inside that.

The Inline Banner

Even if you arrive at a failed delivery via the normal events page, the event debugger now shows a small amber banner inside that delivery's card whenever its fingerprint matches more than one delivery:

N other deliveries failed the same way — Same route, destination, status, and error pattern. Replay or fix them together. → View cluster

One click and you're in the cluster view, ready to fix everything at once.

Why This Is the Centerpiece

Replay-with-edit gives you the primitive. Clusters give you the scope. Together they're the difference between "I'll deal with the rest after I figure out the first one" and "I just fixed 1,400 failures with one click."

This is the load-bearing piece for several follow-up features: anomaly-rooted alerts, the failure pattern library, MCP-driven recovery loops — all of them reference clusters. More on those in the coming weeks.

product-updateobservabilityclustersfingerprintingrecovery

Related Articles

Product Update

MCP Tools for Webhook Recovery — Let Claude or Cursor Drive the Fix

The clusters page, replay-with-edit modal, and pattern hints we shipped over the last three weeks are all the same loop: triage → probe → fix → confirm → fan out. Today that loop is callable from MCP, so any AI assistant can drive recovery end to end.

Product Update

Active Incidents — Tell Me Which Cluster Is Spiking Right Now

Failure clusters last week told you what failure patterns exist. They didn't tell you which one is on fire right now. Two new rate windows split clusters into "active incidents" (escalating) and everything else — so when you arrive during an incident, the page tells you where to look.

Product Update

Two New Tabs That Tell You What Likely Broke, Before RCA Even Runs

A hand-curated library of 12 common webhook failure patterns matches every failed delivery in microseconds — likely cause and suggested fix appear before any AI call. Alongside it, a new Recent Changes tab pulls every audit log entry for the route/destination/transform involved in the failure over the last 14 days.

Ready to Try Hookbase?

Start receiving, transforming, and routing webhooks in minutes.

Get Started Free
Hookbase

Reliable webhook infrastructure for modern teams. Built on Cloudflare's global edge network.

Product

  • Features
  • Pricing
  • Use Cases
  • Integrations
  • ngrok Alternative

Resources

  • Documentation
  • API Reference
  • CLI Guide
  • Blog
  • FAQ

Free Tools

  • All Tools
  • Webhook Bin
  • HMAC Calculator
  • JSONata Playground
  • Cron Builder
  • Payload Formatter
  • Local Testing

Legal

  • Privacy Policy
  • Terms of Service
  • Contact
  • Status

© 2026 Hookbase. All rights reserved.