We migrated 50M events/month off Sentry. Here's what broke.
The four edge cases the migration script missed, the alerting gap that lasted six hours, and the customer-facing communication that worked.
At 50 million events per month, Sentry stops being a tool and becomes a line item that starts rivaling your infrastructure spend. We had watched the "Sentry tax" eat into our margins for two years, and we had also told ourselves the comforting lie that "just switching" was impossible—too risky, too entangled, too much custom alerting to rebuild. Eventually the math won. When we finally flipped the DSN over to GlitchReplay, we did find the cost savings. We also found four specific edge cases our migration scripts missed, including a six-hour alerting blackout that taught us exactly where the "SDK-compatible" abstraction leaks.
This is the honest version of that migration—not the marketing one. If you are moving high event volume off a per-event platform, these are the four things most likely to break, and the dual-write safety net that made the breakage survivable.
The Financial Gravity of 50M Events
Per-event pricing creates a perverse incentive: the better your product does, the more it costs to know how it's doing. At 50M events a month, that pressure stops being abstract.
The Sentry tax and the cost of visibility
We had quietly built a culture of not tracking things to save money—sampling traces down to a few percent, inbound-filtering high-frequency "noise" errors, dropping breadcrumbs. Every one of those decisions was a small surrender of visibility made for budgetary reasons, and collectively they meant our most expensive incidents were the ones we had the least data about. The rough comparison at our volume looked like this:
Monthly events: ~50,000,000
Per-event platform (est.): $9,000 - $12,000 / mo
+ overage spikes during incidents
Flat-rate (GlitchReplay): fixed monthly, no per-event meter
no overage, no sampling pressure
Result we cared about most: we could stop samplingWhy SDK-compatibility was non-negotiable
We had Sentry SDKs threaded through a dozen services, frontends, and Workers. A migration that required rewriting instrumentation everywhere was a non-starter. The only acceptable path was a destination that spoke the same SDK and protocol, so the change could be, in theory, a DSN swap. "In theory" is doing a lot of work in that sentence—which is the whole point of this post.
The Migration Strategy: The Dual-Write Bridge
You do not cut over 50M events/month in one move and hope. We built a dual-write bridge: for the migration window, every event went to both Sentry and GlitchReplay simultaneously, so we could compare them side by side and keep the old system as a live safety net.
A beforeSend wrapper for dual ingestion
Rather than run two full SDKs, we wrapped the existing client so the same captured event was serialized once and POSTed to both ingest endpoints. The GlitchReplay write was fire-and-forget so it could never delay or block the primary path:
Sentry.init({
dsn: SENTRY_DSN,
beforeSend(event) {
// Mirror the already-built event to the new ingest.
// Fire-and-forget: never block or fail the primary write.
try {
const envelope = JSON.stringify(event);
fetch(GLITCHREPLAY_INGEST, {
method: "POST",
body: envelope,
keepalive: true, // survive page unload
}).catch(() => {}); // swallow -- mirror must never break prod
} catch {}
return event; // primary path is untouched
},
});Does double-reporting hurt Core Web Vitals?
The reasonable worry: does a second fetch in beforeSend add user-facing latency? In practice, no—the call is asynchronous, fire-and-forget, and uses keepalive, so it doesn't sit on the critical path or block navigation. We measured no meaningful INP or LCP change during the dual-write window. The dual write was a backend cost (briefly paying for two systems), not a frontend one.
Failure #1: The Sourcemap Release Desync
The first thing to break was deminification. Stack traces that resolved perfectly in Sentry came through as minified gibberish on the new side.
Why "latest" is a dangerous release tag
In Sentry, releases had become a loose, almost global concept—much of our tooling leaned on a latest-style tag. Source-map resolution, though, is exact: the engine matches the release on the incoming event to the release the map was uploaded under. Our build uploaded maps under a precise commit-hash release, but a chunk of our SDK config still reported the loose tag. The two didn't match, so the resolver had no map to apply.
The 404 for the artifact hash
The symptom in the ingest logs was unambiguous once we knew to look:
[ingest] event release=web@latest
[sourcemap] lookup release=web@latest -> 404 no artifacts
[sourcemap] uploaded releases: web@a1b9f3c, web@7d2e0aa, ...
[stacktrace] de-minify SKIPPED, serving raw framesThe fix was to make the SDK's release identical to the build's upload release—the exact commit hash—everywhere, with no exceptions. The lesson generalizes: during a migration, every place a release is named must agree, byte for byte. Our source-maps documentation covers the upload-on-build flow; for one-off triage during the gap we leaned on the browser-based deminifier.
Failure #2: The PII Scrubbing Leak
The second failure was the scary one, because it was about data we never wanted stored at all.
The password-field edge case
Our PII protection relied on a client-side beforeSend scrubber with a homegrown regex set. It had been tuned, over years, to Sentry's exact serialization quirks. When events started flowing through a second pipeline, the assumptions baked into those regexes—field ordering, nesting depth, how the SDK rendered certain objects—didn't all hold, and a nested billing_address.phone_number that the client regex happened to catch in one shape slipped through in another. Client-side scrubbing is brittle precisely because it is coupled to the exact shape of the data at one moment in one environment.
Why client-side scrubbing is never enough at scale
The real fix was not to patch the regex. It was to stop treating the client as the last line of defense. We moved authoritative scrubbing to the ingest side, where a single set of rules applies to every event from every service, unbypassable, regardless of which SDK or which environment produced it:
// Old: client-side, brittle, easy to bypass or desync.
beforeSend(event) {
if (event.request?.data?.password) delete event.request.data.password;
return event; // misses nested + new fields by construction
}
// New: server-side ingest rule, applied to 100% of events.
// rules:
// - match: key ~= /(password|ssn|card|cvv|phone)/i -> redact
// - match: value ~= CREDIT_CARD | SSN | JWT -> redact
// - scope: ALL projects, ALL environmentsThis is a big enough topic that it has its own writeup—testing your error tracker for PII leaks—and a dedicated reference in our PII docs. The short version: client-side scrubbing is a bandwidth optimization, not a security control. You can audit your own payloads with the free PII scanner.
Failure #3: The Breadcrumb Overflow
The third failure came from the sheer volume of context we'd been carrying without noticing.
The memory pressure of too much context
Once we stopped sampling—the whole reason we'd migrated—we were suddenly capturing full breadcrumb trails on events we used to drop. Many of our services attached dozens of info-level breadcrumbs per event: every XHR, every console log, every state transition. At full fidelity, that turned modest error payloads into heavy ones, and our session replays ballooned past 20MB for long sessions.
Pruning the event stream
The distribution was lopsided—the exception itself was tiny; the breadcrumbs and serialized DOM were the bulk:
Event component Share of payload
------------------- ----------------
exception + stack ~5%
tags + context ~10%
breadcrumbs ~35%
replay DOM snapshots ~50%The fix was to cap breadcrumbs to the last 20–30 genuinely useful ones, drop noisy info-level entries we never read, and lean on replay's incremental snapshotting instead of fat breadcrumb trails. Our always-on replay with a 3-day prune writeup covers how we keep full-fidelity capture affordable to store; you can model the storage yourself with the replay storage calculator.
Failure #4: The 6-Hour Alerting Gap
The most painful failure produced no data corruption and no leak. It produced silence.
The regression-logic mismatch
We cut alerting over to the new platform assuming "resolved" and "regressed" meant the same thing on both sides. They didn't, quite. Our old setup leaned heavily on "resolved in next release" semantics—an issue marked resolved would re-alert only if it reappeared in a later release. During the cutover, the release-tracking that this logic depends on was exactly what was desynced (see Failure #1). So a batch of P0-class errors came in tagged with a release the alerting engine considered already-resolved, and the notifications never fired.
Why Slack went silent for P0
For about six hours overnight, real production errors were being captured perfectly—and saying nothing. The data was all there in the dashboard; the routing to humans was broken. We only caught it because the dual-write bridge meant Sentry was still alerting in parallel, and someone noticed Sentry paging while the new channel stayed quiet. That parallel-alerting safety net is the single most important thing we did right. The post-mortem timeline:
00:00 Alerting cut over to new platform
00:10 Release tags desync (Failure #1 still in flight)
00:15 P0 errors arrive, tagged "already resolved" -> suppressed
02:30 Sentry (still dual-writing) pages on the same errors
02:35 On-call notices: Sentry loud, new channel silent
06:?? Root cause found: regression logic + release desync
06:30 Release tags fixed, alerts replay, notifications fireLesson: never cut alerting over in a single step. Run alerts in parallel on both systems until you have proven, with a real fired alert, that the new one matches.
Customer Communication: Transparency as a Feature
Migrations of this size affect more than engineers. Stakeholders who depend on the dashboards needed to trust the new numbers before we removed the old ones.
The shadow-mode dashboard
During the dual-write window we gave stakeholders a side-by-side view: the same time range, the same filters, Sentry on the left and GlitchReplay on the right. Watching the two charts track each other day after day built more confidence than any reassurance could. When they diverged, it was usually because the new side was catching events the old side had sampled away—which made the case for the migration better than we could have.
Success metrics beyond "it's cheaper"
We defined success up front as: zero data loss versus the dual-write baseline, alert parity proven by matched fired alerts, source-map resolution rate at or above the old system, and only then cost. Leading with cost would have made it look like we were trading reliability for savings. We weren't.
Final Results: 90% Cost Reduction, 0% Data Loss
After fixing the four edge cases, the payoff was exactly what we'd hoped—and one thing we hadn't.
The freedom of zero sampling
The cost dropped roughly 90% at our volume, which was the headline. But the change that altered how we actually work was zero sampling. We stopped making little budgetary surrenders of visibility. We could finally capture the 1-in-a-million error—the rare race condition that only a fully-sampled stream ever sees—and the high-frequency "noise" errors we used to filter turned out to contain a real bug we'd been blind to for a year.
The operational overhead of a Cloudflare-native, SDK-compatible stack turned out to be lower than the SaaS we left, not higher—mostly because we no longer spent meetings deciding what to not track. If you're weighing a similar move, our migration guide walks the happy path and the alternatives comparison covers the landscape.
The meta-lesson of the whole thing: "SDK-compatible" gets you 95% of the way in an afternoon, and the last 5%—releases, PII, breadcrumbs, alert semantics—is where the real work lives. Plan a dual-write bridge, keep the old alerts running in parallel until the new ones prove themselves, and treat the cutover as a sequence, not a switch. That is how you move 50M events/month and live to write the postmortem. GlitchReplay is built for exactly this destination: Sentry-compatible ingest, full session replay, ingest-side PII scrubbing, and flat-rate pricing so growth never taxes your visibility.
GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.