The deploy that tripled our error rate at 2 AM
What we saw, what we did, and the three guardrails we added so it can't happen the same way again.
The silence of a 2:00 AM bedroom is broken by the specific, rhythmic buzz of a PagerDuty alert. You open the dashboard half-awake and see a vertical wall: the error rate hasn't just spiked, it has tripled. If you're on a per-event plan, your very first thought—before "what broke?"—is often "how much is this outage costing me in overages?" That is a deranged thing to be thinking during an incident, and yet it is exactly where per-event pricing puts your head.
For us, the focus stayed where it belonged: on a checkout flow that had quietly stopped working for a slice of users, throwing a TypeError that passed every CI check we had. This is the full postmortem—what we saw, what we did, the root cause that turned out to be a Cloudflare Worker binding, and the three guardrails we added so that the next 2 AM page is a ticket instead of a fire.
The Ghost in the Deploy
The deploy that caused it was the most boring kind. A routine config change went out late in the evening, sailed through staging green, and got merged. Nobody thought about it again until the pager went off hours later.
The false security of green CI/CD
CI tells you the code does what the tests assert. It does not tell you the code does what production needs, because the test environment is not production. Our staging environment had a particular secret and binding configuration; production had a slightly different one. The tests exercised the code path, but against staging's bindings, where everything resolved. Green CI gave us false confidence in a change whose failure mode only existed in the production environment.
Why the 2 AM window is the danger zone
Late deploys correlate with incidents for unglamorous reasons: fewer eyes on the dashboard, slower human response, and traffic patterns that differ from peak hours. The specific edge case that broke us only triggered for a subset of requests that happened to route through a particular configuration. During the day that subset is a firehose and the problem would have been obvious in minutes. At 2 AM it was a trickle that still tripled our (low, overnight) baseline error rate—loud enough to page, quiet enough to be easy to misread. A config that looked perfectly fine was the culprit:
# wrangler.toml -- looks completely normal
name = "checkout-api"
main = "src/index.ts"
[[d1_databases]]
binding = "DB" # code references env.DB
database_name = "checkout-prod"
database_id = "..."
# The secret CHECKOUT_SIGNING_KEY was set in staging
# but never promoted to the production environment.
# Nothing here fails to deploy. It fails at runtime.Triaging the Spike: Noise vs. Signal
The first job in any incident is not to fix anything. It is to figure out whether you are looking at one problem or a thousand, and whether it is real or noise.
Reading the wall of red
The error distribution chart told the story immediately. Before the deploy marker, a flat, low overnight baseline. After it, a sharp step-change—a wall of red—that started precisely at the deploy timestamp. That alignment is the single most useful signal in incident triage: when the error rate steps up at the exact moment of a release, you have your suspect before you've read a single stack trace.
Fingerprinting 10,000 instances into one cause
Ten thousand individual error events is not ten thousand problems. Good fingerprinting groups them by their structural signature—same error type, same call site—so a flood collapses into a handful of distinct issues. Our wall of red collapsed into exactly one fingerprint: a single TypeError originating from the checkout submission path. That instantly told us this was one root cause with massive blast radius, not a systemic meltdown across the app. (We go deep on this in error fingerprinting and deduplication.)
Beyond the Stack Trace: Why Logs Weren't Enough
Knowing which deploy and which fingerprint still doesn't tell you why. And the raw stack trace actively worked against us.
The frustration of minified traces
The first thing the trace showed was the classic edge-and-bundler garbage: undefined is not a function pointing at index.js:1:48201. That column number is meaningless on its own—it is a position inside a single-line minified bundle that stitches together dozens of source files. At 2 AM, the last thing you want is to do arithmetic on a minified blob.
How source maps complicate the 2 AM triage
Source maps fix this—if the map uploaded at build time matches the exact artifact that's running. During a hurried late deploy that linkage is fragile, and a desynced release tag means the resolver can't find the right map and you're back to staring at column 48201. When it works, the difference is night and day:
// Before (raw):
TypeError: undefined is not a function
at o (index.js:1:48201)
// After (source-mapped):
TypeError: Cannot read properties of undefined (reading 'prepare')
at submitCheckout (src/checkout/submit.ts:42:18)
at handler (src/index.ts:71:9)That resolved trace pointed straight at env.DB being undefined inside submitCheckout. If you have a stray minified trace right now, our free deminifier resolves it in the browser; for the durable version, see automatic source-map resolution on ingest.
The Aha Moment: Session Replay to the Rescue
The resolved trace told us a binding was undefined. It did not tell us what the user experienced, or whether the failure was as total as it looked. For that we jumped from the error to the replay.
Connecting the error ID to a replay ID
Each captured error carried a replay ID, so going from "here is the TypeError" to "here is a recording of a user hitting it" was a single click. No reproduction, no asking a customer what they did.
Watching the rage click happen
The replay was the gut-punch confirmation. The user filled in checkout, clicked "Submit," and—nothing. The button click fired, the request to the Worker failed because env.DB was undefined, the promise rejected, and there was no UI-level catch to surface anything. So from the user's side the page just sat there. They clicked Submit again. And again—textbook rage clicks—then abandoned. The combination of the resolved stack trace (the what) and the replay (the experienced impact) gave us total clarity within minutes of waking up.
The Root Cause: A Cloudflare Worker Binding Issue
With the trace pointing at env.DB and the replay confirming a total submit failure, the root cause came together fast.
The environment mismatch
The config change had restructured how the Worker accessed its D1 database and its signing secret. In staging, both the binding and the secret were present, so the code path resolved and tests passed. In production, the signing secret had never been promoted to the production environment, and a guard clause that depended on it caused the database accessor to short-circuit to undefined. Calling .prepare() on that undefined accessor threw the TypeError on every checkout submission.
// The failing path, simplified
async function submitCheckout(env, body) {
const key = env.CHECKOUT_SIGNING_KEY; // undefined in prod
const db = key ? env.DB : undefined; // guard short-circuits
// env.DB was fine; the missing SECRET nulled it via this guard
return db.prepare("INSERT ...").bind(...).run(); // throws here
}It was not a code bug in the traditional sense. The logic was internally consistent. It was an environment-parity bug—the kind that is invisible until the exact production configuration meets the exact code path. For the deeper mechanics of running and instrumenting Workers so these failures actually report, see error tracking on Cloudflare Workers. The fix itself was trivial: promote the secret, redeploy. Total time from page to mitigation was under twenty minutes, most of it spent confirming rather than guessing.
The Hidden Cost of Errors (It's Not Just the Bill)
Here is the broader point the incident drove home. The real cost of an outage is engineering time and lost conversions—not the line item on your monitoring invoice. But per-event pricing inserts the invoice into your incident response anyway.
Why flat-rate changes the psychology
When a tripled error rate also means a tripled (or worse) monitoring bill, part of your brain is doing cost arithmetic during the exact minutes you need to be debugging. Worse, the long-term effect is that teams pre-emptively sample or filter errors to control spend—which means the one event that explains the incident may be the one you chose not to capture. Flat-rate pricing removes that pressure entirely: you capture everything, always, and the cost is the same whether it's a quiet Tuesday or a 2 AM meltdown.
The cost comparison that matters
Cost to track the incident (per-event): spikes with the outage
Cost to track the incident (flat-rate): $0 marginal -- same as always
Cost to fix the incident (engineering): ~20 min x on-call eng
Cost of NOT seeing the incident clearly: hours of lost checkoutThe tooling fee should be the most boring number in the entire postmortem. We dig into the per-event-vs-flat-rate math in our pricing breakdown.
Guardrail 1: Error Budget Alerts
The first guardrail was alerting that triggers on the right shape of failure. Absolute thresholds are noisy; a fixed "100 errors" trigger means something different at 2 AM than at noon. We alert on a percentage deviation against a rolling baseline, scoped to a release.
{
"metric": "error_rate",
"comparison": { "baseline": "trailing_1h", "increase_pct": 200 },
"scope": { "release": "current" },
"severity": {
"warning": { "increase_pct": 100, "notify": "slack:#deploys" },
"critical": { "increase_pct": 200, "notify": "pagerduty" }
}
}Warning-level deviations go to Slack where someone catches them in the morning; a true tripling pages a human. Tying this to a formal error budget also forces the healthy conversation about how much failure you're willing to spend on velocity—see calculating an error budget from your tracker.
Guardrail 2: Automated Replay on New Errors
The second guardrail: any error fingerprint we've never seen before gets prioritized for full replay capture and a high-priority flag, automatically. New fingerprints are exactly what a bad deploy produces—a failure mode that didn't exist an hour ago. Prioritizing "first seen" events means the explanatory replay is already waiting when the alert fires, instead of being something you hope you captured. The same workflow doubles as QA: after we shipped the secret fix, we used live replays of real checkout sessions to confirm the flow was healthy again before standing down.
Guardrail 3: Canary Deployments
The third guardrail attacks blast radius. The whole incident affected production-wide traffic the instant the deploy went live. Cloudflare's gradual deployments let you route a small percentage of traffic to a new version first.
# Roll the new version to 10% of traffic, watch, then ramp.
wrangler deploy
wrangler versions deploy --percentage 10 # canary
# Monitor GlitchReplay filtered by the new version/release tag.
# If error_rate on that release spikes, roll back before
# the other 90% of users ever see it.By filtering our error data by the release tag, we can watch the canary's error rate in isolation. Had this been in place that night, the missing-secret TypeError would have shown up on 10% of traffic, tripped the budget alert on that release alone, and auto-rolled-back—and the other 90% of users would never have seen a broken checkout. The page might not have happened at all.
Making 2 AM Boring Again
The point of all this is not heroics. It is the opposite. The right tooling turns a 2 AM fire into a 2 AM ticket: a deploy marker that names the suspect, fingerprinting that collapses ten thousand events into one cause, source maps that turn gibberish into a file and line, a replay that confirms the human impact, and guardrails that shrink the blast radius before a human is even involved. That is the difference between monitoring—knowing something is wrong—and observability—knowing what is wrong and why.
We built GlitchReplay around exactly this stack: a Sentry-compatible SDK, automatic source-map resolution, error-to-replay linking, and budget-based alerts, all at a flat rate so the cost of an outage is never the cost of seeing the outage. Stop paying a success tax on your own incidents. Track every error and watch every replay for one predictable price—and make 2 AM boring again.
GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.