Why Your Webhooks Keep Timing Out (and How to Fix Them)
Webhook timeouts cascade into duplicates, missed events, and angry providers throttling your endpoint. Here is what actually causes them and the fixes that hold up under real traffic.
The Symptom
Your webhook endpoint works fine in testing. You ship it. Two weeks later, you notice deliveries failing in your provider's dashboard with timeout errors. Or worse, you don't notice — until a customer asks why their order never processed.
Webhook timeouts are sneaky because they're often invisible from your side. Your handler logs say everything succeeded. The provider's logs say it never finished. Both are correct.
How Provider Timeouts Actually Work
Most webhook providers give you a hard time budget — usually shorter than you think:
| Provider | Timeout | |----------|---------| | Stripe | 30 seconds | | GitHub | 10 seconds | | Shopify | 5 seconds | | Slack | 3 seconds | | Twilio | 15 seconds |
If you don't return a 2xx response within that window, the provider treats it as a failure and retries. From your handler's perspective, the work might still be running — but the provider has moved on, and the next retry will trigger another run of the same logic.
This is where idempotency problems compound timeout problems. Each timeout-triggered retry is another chance for duplicate side effects.
The Real Causes
1. Doing the Work Synchronously
The most common cause: the handler does all the work — database writes, third-party API calls, email sends — before responding.
// This will time out under load
app.post('/webhook', async (req, res) => {
await db.insertOrder(req.body);
await stripe.charge(req.body.amount);
await sendgrid.sendReceipt(req.body.email);
await slack.notify('#orders', req.body);
res.send('ok');
});
Each of those calls has its own latency and failure mode. A slow Stripe API or a SendGrid hiccup blows your entire handler past the timeout.
Fix: Acknowledge first, process async. Validate the signature, persist the raw payload to a queue or table, return 200, then process out of band.
app.post('/webhook', async (req, res) => {
if (!verifySignature(req)) return res.status(401).send();
await queue.enqueue(req.body);
res.send('ok');
});
2. Cold Starts on Serverless
If you're on Lambda, Cloud Functions, or Vercel, your handler can sit idle for hours. The first request after that idle period pays the cold start tax — anywhere from 200ms to 8 seconds depending on runtime, dependencies, and VPC config.
Fix: Use a runtime with negligible cold starts (Cloudflare Workers, Bun on Lambda) or keep the function warm. For Lambda, Provisioned Concurrency removes cold starts entirely but costs more.
3. Database Connection Exhaustion
Your handler opens a database connection on each request. Under burst traffic — Stripe sending 50 events in two seconds when you re-enable a webhook — you exhaust the pool. Subsequent requests wait, then time out.
Fix: Use a connection pooler (PgBouncer, Supabase pooler, RDS Proxy). Cap your handler's connection acquisition timeout below the provider's webhook timeout so you fail fast rather than hanging.
4. Slow Downstream APIs
You call a third-party service inside your handler. That service is slow today. Your handler waits. The provider times out.
Fix: Set aggressive timeouts on every outbound call inside a webhook handler — usually 2-3 seconds maximum. If the call doesn't finish, queue it for retry instead of blocking.
5. Synchronous Logging
Believe it or not — synchronous writes to a logging service over the network can add 500ms+ to every request. Multiply by a few logs per handler and you've eaten half your timeout budget.
Fix: Use async logging that batches in the background. Most modern logging libraries do this by default; verify yours does.
How to Detect Timeouts You're Missing
If your handler runs to completion, you'll see "success" in your own logs even when the provider gave up on you 25 seconds ago. To catch this:
- Compare your provider's delivery log against your processing log. If the provider shows more attempts than you have processed events, you have silent timeouts.
- Track end-to-end latency from provider to acknowledgement. Most providers expose this. Alert when p99 crosses 50% of the timeout budget.
- Log the duration of every handler. Not just slow ones — all of them. Patterns emerge from the histogram.
The Architecture That Doesn't Time Out
The pattern that scales:
- Edge function or lightweight handler receives the webhook
- Verify the signature (always)
- Write the raw payload to durable storage (queue, log, or database)
- Return 200 immediately
- A separate worker picks up the payload and does the actual processing — at its own pace, with its own retry logic, isolated from the provider's timeout
The acknowledge-then-process pattern is the only one that survives real traffic. Everything else is a question of when, not if, you start dropping webhooks.
How Hookbase Eliminates This Class of Problem
Hookbase sits between the provider and your handler. We acknowledge the webhook in milliseconds — your handler gets the event from us with no provider timeout pressure. If your handler is slow, we retry on your behalf. If it's down entirely, we hold the events in our DLQ until you're ready.
Your provider stays happy. Your handler runs at whatever pace makes sense. Timeouts become a problem you don't have to solve.