The Failure Is Part of the Design

Webhook delivery is not guaranteed. The service sending the webhook doesn't know whether your server was up, whether your handler crashed halfway through, or whether your database was running a slow migration and the response timed out. From their perspective, if they didn't get a 200 response within their timeout window, the delivery failed — full stop.

Most services handle this by retrying. Stripe will retry a failed webhook for up to three days. GitHub retries three times over about an hour. The exact schedule varies, but the behavior is universal: they will try again, and they may succeed on a subsequent attempt, which means your handler might process the same event more than once.

This is not a bug — it's a deliberate design trade-off. "At least once" delivery is much easier to guarantee than "exactly once" delivery, and so that's what almost everyone offers. Your job as the handler author is to write code that doesn't break when an event arrives twice.

The contract:

Webhook providers guarantee at-least-once delivery. They do not guarantee exactly-once delivery. Design your handler around this from day one — retrofitting idempotency into existing code is significantly more painful than building it in upfront.

What Actually Causes Webhook Failures

Before you can handle failures well, it helps to know what you're actually defending against. Webhook deliveries fail for a surprisingly small number of distinct reasons:

Timeout before you respond

Most services expect a response within 5–30 seconds. Stripe times out at 30 seconds. GitHub gives you 10. If your handler takes longer than that — because it's doing a slow database write, calling a third-party API, or sending an email — the sender marks it as failed and queues a retry, even if your code eventually finishes successfully. This is the most common failure mode by far.

Your server returns a 4xx or 5xx

A 500 from an unhandled exception, a 503 during a deploy, a 422 from a validation error — any non-2xx response is treated as a failure. Be careful with 4xx codes: returning a 400 because you couldn't parse the event type will trigger retries for an event your code will never be able to handle. Use 400 only for genuinely malformed requests; for events you don't know about, return 200 and discard.

Your server is simply down

Deploys, reboots, a crashed process, a runaway memory leak that took down Node — any of these mean incoming requests get connection-refused or never arrive at your handler. The retry mechanism is designed exactly for this case: you come back up, and the next retry lands successfully.

Network blips between the sender and you

Transient DNS failures, routing hiccups, or TCP connection resets can all cause a delivery to fail even when your server is healthy. These are rare but they happen, especially if the sender and receiver are in different cloud regions.

How Retry Schedules Work in Practice

Retry schedules use exponential backoff — each attempt waits longer than the previous one. This is the right approach because it avoids hammering a server that's already struggling, and it gives you time to recover from an outage before the next attempt arrives.

Stripe's retry behavior is worth knowing specifically: they retry with increasing delays — roughly at 5 minutes, 30 minutes, 2 hours, 5 hours, 10 hours, and so on — for up to 72 hours (3 days). That means if your server has a bad night, events can keep arriving the next morning and you need to handle them correctly even if significant time has passed.

// Approximate retry schedule (illustrative)
// Attempt 1: immediate
// Attempt 2: ~5 minutes later
// Attempt 3: ~30 minutes later
// Attempt 4: ~2 hours later
// Attempt 5: ~5 hours later
// Attempt 6: ~10 hours later
// ...continues for up to 72 hours (Stripe)
// ...3 attempts over ~60 minutes (GitHub)

GitHub is more aggressive and less patient: GitHub retries webhook deliveries up to 3 times over approximately 60 minutes before giving up. For CI/CD pipelines where a missed push event means a build didn't trigger, that's worth knowing.

The practical implication of long retry windows: your handler will sometimes receive an event that was originally sent many hours ago. Don't assume recency. An order fulfillment event might arrive 6 hours after the customer paid. Build your logic to handle that.

Idempotency: Process the Same Event Twice, Get the Same Result

Idempotency is a fancy word for a simple idea: calling a function multiple times with the same input should produce the same result as calling it once. For webhook handlers, it means processing the same event ID twice should leave your system in exactly the same state as processing it once.

The naive non-idempotent version looks like this: you receive payment_intent.succeeded with event ID evt_abc123, create an order, send a confirmation email, and charge a loyalty reward. If that event arrives again due to a retry, you create a second order, send a second email, and charge the reward twice. The customer gets charged once but gets two orders and two emails. That's a bad day.

The fix is to track which event IDs you've already processed and short-circuit if you see one again:

// Idempotent webhook handler pattern
app.post('/webhooks/stripe', express.raw({ type: 'application/json' }), async (req, res) => {
  // 1. Verify signature first (always)
  let event;
  try {
    event = stripe.webhooks.constructEvent(req.body, req.headers['stripe-signature'], process.env.STRIPE_WEBHOOK_SECRET);
  } catch (err) {
    return res.status(400).send('Signature verification failed');
  }

  // 2. Check if we've already processed this event
  const alreadyProcessed = await db.processedEvents.findOne({ eventId: event.id });
  if (alreadyProcessed) {
    // Acknowledge it — we just won't do the work again
    return res.json({ received: true, duplicate: true });
  }

  // 3. Process the event
  if (event.type === 'payment_intent.succeeded') {
    await fulfillOrder(event.data.object);
  }

  // 4. Record that we processed it
  await db.processedEvents.insertOne({
    eventId: event.id,
    processedAt: new Date(),
    type: event.type,
  });

  res.json({ received: true });
});

The event ID is your idempotency key. Stripe event IDs (the evt_ prefixed strings) are stable — retries of the same event will always carry the same ID. Store processed event IDs in a database table with a unique index on the event ID column, and check it before doing any meaningful work.

Race condition warning

If two retries arrive close together and you check-then-insert non-atomically, both might pass the "already processed?" check before either has inserted. Use a database-level unique constraint on the event ID column so that the second insert fails with a conflict error. Catch that error and return 200 — you're in a race, but the other handler already did the work.

Acknowledge First, Process Later

The single most important pattern for reliable webhook handling is this: return 200 immediately, then do the real work asynchronously in a background job. Don't process the event in the request handler itself.

The reasoning is straightforward. You have maybe 10–30 seconds to respond. Fulfilling an order might involve writing to a database, calling a shipping API, sending a transactional email, and updating an inventory count. Any one of those could take longer than expected or fail partway through. If you're doing all of that synchronously in the HTTP handler, you're one slow query away from a timeout — which triggers a retry, which doubles your problem.

// Pattern: enqueue immediately, process in background
app.post('/webhooks/stripe', express.raw({ type: 'application/json' }), async (req, res) => {
  // Verify the signature (fast — just a crypto operation)
  let event;
  try {
    event = stripe.webhooks.constructEvent(req.body, req.headers['stripe-signature'], process.env.STRIPE_WEBHOOK_SECRET);
  } catch (err) {
    return res.status(400).send('Invalid signature');
  }

  // Push the raw event onto a queue — this should take < 50ms
  await queue.push({
    id: event.id,
    type: event.type,
    data: event.data,
    receivedAt: Date.now(),
  });

  // Acknowledge immediately — we're done here
  res.json({ received: true });
});

// Separate worker process picks up queue items and processes them
// with full retry logic, error handling, and idempotency checks
worker.process(async (job) => {
  const { id, type, data } = job.data;
  // ... actual business logic here
});

The queue doesn't have to be complicated. For most applications, a simple database-backed job queue works fine. For higher volume, something like SQS, BullMQ (Redis-backed), or a managed queue service is worth considering. The key property you want is persistence — if your worker crashes, the job should survive and be retried.

One practical note: when you enqueue the event, store the full raw payload. Don't try to pre-process it or extract just the fields you think you need. If you later discover you needed a field you discarded, you have no way to get it back. Store everything, parse what you need in the worker. When debugging a queue item, being able to paste the payload into JsonFormatter.ai to pretty-print and inspect the full structure is a genuine time-saver.

At-Least-Once vs Exactly-Once Delivery

"At least once" delivery means the provider guarantees that every event will eventually be delivered to your endpoint at least one time. It may be delivered more than once (due to retries), but it won't be silently dropped. This is the standard guarantee.

"Exactly once" delivery — where every event arrives precisely once, no duplicates, no missed events — is distributed systems hard mode. Achieving it requires coordination between sender and receiver that most webhook implementations simply don't do. Some queue systems provide it, but it comes with significant complexity and performance trade-offs.

The pragmatic approach is to embrace at-least-once delivery and make your handlers idempotent. That way duplicates are harmless, and you still benefit from the retry mechanism when genuine failures occur. It's not as theoretically clean as exactly-once, but it's far simpler to implement correctly and it works well in practice.

Stripe's best practices documentation explicitly tells you to design for at-least-once delivery and use event IDs as idempotency keys. They've seen every way this goes wrong, and that's the pattern they recommend.

When Retries Run Out: Dead Letters and Reconciliation

What happens when Stripe has been retrying for 72 hours and your server is still failing? They stop. The event is lost — or more precisely, it's sitting in Stripe's delivery log marked as failed, and Stripe will never try again unless you manually trigger a retry from their dashboard.

For low-stakes events this might be acceptable. For payments, subscriptions, and anything financially significant, it's not. You need a recovery strategy.

The Stripe approach is to use their Events API as a source of truth. You can query GET /v1/events to retrieve all events from a time range, with filtering by type. A background reconciliation job that runs periodically — once a day, or after you recover from an outage — can pull all events from the past 24 hours and check whether you have a processed record for each one. Anything missing gets reprocessed. This is called the reconciliation pattern, and it's a critical backstop for production systems handling real money.

// Reconciliation job (run periodically)
async function reconcileStripeEvents(lookbackHours = 24) {
  const since = Math.floor(Date.now() / 1000) - (lookbackHours * 3600);

  // Pull all events from the past N hours
  const events = await stripe.events.list({
    created: { gte: since },
    type: 'payment_intent.succeeded',
    limit: 100,
  });

  for (const event of events.data) {
    const alreadyProcessed = await db.processedEvents.findOne({ eventId: event.id });
    if (!alreadyProcessed) {
      console.warn(`Missed event: ${event.id} (${event.type}) — reprocessing`);
      await processStripeEvent(event);
    }
  }
}

For your own job queue, the analogous concept is a dead letter queue (DLQ) — a holding place for jobs that have exceeded their maximum retry attempts. Instead of silently discarding them, you route them to the DLQ and alert. A developer can then inspect the failed jobs (their payload, the error, the retry history), fix the underlying issue, and requeue them manually.

Most job queue libraries (BullMQ, Sidekiq, Delayed Job) have built-in DLQ support. AWS SQS has a dedicated dead letter queue feature where you configure a maximum receive count, after which messages are automatically moved to the DLQ rather than deleted. Set up alerting on your DLQ depth so you know within minutes when events are failing — don't find out days later by a customer complaint.

Stripe's Webhook Dashboard and Manual Retries

One practical advantage of working with Stripe specifically: they surface webhook delivery history directly in their dashboard. Under Developers → Webhooks, you can see every delivery attempt for every event — the request, the response code, the response body, and whether it succeeded or failed. You can manually trigger a retry for any failed event from the dashboard UI without touching any code.

This is genuinely useful during development and after outages. If you had a bug in your handler that caused 50 events to fail over a two-hour window, you can fix the bug, deploy, then bulk-retry them from the dashboard. No reconciliation job needed for those specific events.

The Stripe CLI also lets you replay specific events locally, which is great for testing your idempotency logic. You can fire the same event ID at your local server twice and verify your handler handles the duplicate correctly before deploying to production.

# Replay a specific event locally using the Stripe CLI
stripe events resend evt_abc123 --webhook-endpoint=localhost

# Trigger a test event (will fire it twice to simulate a retry)
stripe trigger payment_intent.succeeded
stripe trigger payment_intent.succeeded  # same event ID — your handler should dedupe it

Not every webhook provider gives you this visibility. GitHub has a delivery log in repository settings, but it's less detailed. If you're building a platform that sends webhooks to your own users, consider investing in a delivery dashboard — it dramatically reduces the support burden when integrations aren't working as expected. Check the HTTP status code reference to make sure you're logging the right response codes and surfacing them clearly.

The Production-Ready Webhook Handler Checklist

Pull these threads together and you get a handler that's genuinely resilient. Here's what a production webhook receiver should do:

Verify the signature on the raw body before doing anything

Reject requests that don't have a valid signature with a 401. Do this before any database access or business logic.

Enqueue the event immediately and return 200

Push the full raw payload to a persistent queue. Don't process inline. The entire handler should complete in under a second.

Check the event ID before processing in your worker

Use a database table with a unique constraint on event ID. If the insert fails due to conflict, the event was already processed — skip it and mark the job complete.

Route failed jobs to a dead letter queue and alert

Don't silently discard jobs that exhaust retries. Route them to a DLQ and set up a alert. Finding out about failures via a customer email is not an acceptable monitoring strategy.

Run a periodic reconciliation job for critical event types

For anything involving payments or subscriptions, cross-reference the provider's event API against your processed events table daily. Anything missing gets reprocessed.