Always-on session replay, priced like on-error: how 3-day pruning makes it work

Why we record every session and prune anything not tied to an error after 3 days. Storage math, the late-arriving-error buffer, and why this beats the 30-second pre-error window.

GlitchReplay team··
replaypricing

It's 2:00 AM. The pager goes off because checkout is throwing a 500 for one specific user, and only that user. You open the replay attached to the error — the feature that's supposed to make these moments survivable. It shows the last 30 seconds before the throw: the user clicks "Pay," the spinner spins, the error fires. That's it. No browse history, no cart-edit sequence, no clue what got them into the broken state.

So you do what every on-call has done in 2026: you DM the customer to ask what they did before the error. They reply at 9 AM with a vague "I think I changed quantities a couple times?" You spend the morning trying to reproduce. The bug, you eventually discover, was triggered by a coupon-application step the user took four minutes before checkout. Your replay started thirty seconds before the throw. The fix isn't a longer buffer. It's recording the entire session and pruning aggressively when nothing breaks.

Why the 30-second buffer was always a compromise

The standard SDK pattern in 2025 looked like this: replaysSessionSampleRate: 0, replaysOnErrorSampleRate: 1.0. Translated, that means "don't record anything by default; when an error fires, flush the last ~30 seconds of buffered events." Sentry shipped it that way. Most replay vendors did. It existed for one reason: per-session pricing made full recording financially insane.

The "buffer was too short" problem

Real bugs don't politely announce themselves within 30 seconds of the user action that caused them. State accumulates. A user changes a shipping address, navigates away, comes back, applies a discount, and only then does the cart calculation throw. Your 30-second window catches the throw. It misses the address change two minutes earlier — the actual root cause. Every on-call engineer has had this conversation: "I can see the error, but the replay doesn't go back far enough to show me what they did." You end up debugging by interview rather than by replay.

The "we missed the lead-up" gap

The other failure mode is the inverse: an error that fires within seconds of the page load, before the buffer has anything meaningful in it. The user lands on the page, an effect fires, something blows up. Your replay shows two seconds of a half-rendered DOM. You can't see the URL params they came from, the cookie state, the prior-session context the SPA hydrated from local storage. The 30 seconds you saved didn't buy you debuggability — it bought you a tape that started rolling after the interesting part was over.

Why this compromise was forced by per-session pricing

If your vendor charges $6 per 1,000 replays, "record every session" is a tax on every visitor. A site with 100,000 monthly sessions would be looking at $600/mo just for replays — most of which would never be watched. The on-error buffer was the rational compromise: only pay for replays you might actually need. Decouple the storage cost from the per-session billing model and a different default becomes obvious.

The two-bucket retention model

Here's what GlitchReplay does instead. Every session gets recorded in full. Then there are two retention buckets, and a session moves between them based on whether it produced an error.

3 days is enough for a bug reported a day or two late

The non-error bucket keeps every session for 3 days, then prunes. Three days is the floor on every plan, including Free. The reasoning is empirical: a customer reports a bug within 1–2 days of hitting it about 95% of the time, and the long tail past 3 days almost always corresponds to errors that already fired (and therefore already moved into the error-attached bucket). Three days covers the "I noticed something weird yesterday" cases without ballooning storage.

3 days is short enough not to become a UX-research product

This is deliberate. Tools like FullStory and Hotjar exist to keep months of session data so PMs can analyze funnels and watch heatmaps. Different product, different buyer, different price tag. GlitchReplay is for engineers debugging errors. The 3-day floor keeps us out of the UX-research category, keeps storage costs predictable, and lets us include replay on every plan instead of charging $6/1k for it.

Linking is automatic via the SDK's replay_id

When an error event fires, the SDK attaches the active replay_id to it. The ingest pipeline writes a row into issue_replays the first time it sees that pairing. The nightly retention sweep checks each replay segment: if any error in the project references its replay_id, the segment moves into the error-attached bucket and is kept for the plan's retention window — 14 days on Free, 90 days on Pro, 365 days on Business. Otherwise it's pruned at the 3-day floor. No tagging, no manual classification, no PM clicking "save this replay."

The storage math

The whole model only works if "record everything" doesn't become "pay everything." The arithmetic falls out cleanly once you accept that 99% of sessions never see an error, which means 99% of bytes evaporate at the 3-day prune.

Worked example: 100k sessions/day

Take a Pro-tier customer running 100,000 sessions/day with the rrweb defaults the replay storage estimator assumes — a 120 KB initial DOM serialization plus roughly 40 mutations/minute at 400 bytes each over a 4-minute average session. Raw, that's about 184 KB per session. Brotli-compressed at the standard ~0.3 ratio, you store roughly 55 KB per session.

Without the prune model, holding all of that for 365 days at Cloudflare R2's $0.015/GB·mo would run about $920/mo. Most replay vendors avoid that bill by simply not offering it — they charge per session and quietly cap retention. With the two-bucket model, the math collapses. Assume a 1% error rate (generous; most production apps sit closer to 0.3%). 99% of sessions live for 3 days; 1% live for 365 days. Steady-state storage works out to about 18 GB resident, or roughly $0.27/mo. The error-attached bucket — the part you actually need long-term — is two orders of magnitude smaller than the raw stream.

On-error 30s buffer vs always-on with 3-day prune

Here's the comparison nobody runs in public: a 30-second buffer at 100k sessions/day, recording only on error, with 30-day retention, stores about as many bytes as 100k sessions/day always-on with the 3-day prune and 365-day error retention. The all-in storage cost is in the same single-digit-dollars-per-month neighborhood. You're paying roughly the same R2 bill either way. One of those two designs lets you open every error with the full session that produced it. The other one lets you watch the user click "Pay" and nothing else.

The late-arriving-error gotcha

The naive implementation of this is a SQL job that runs DELETE FROM replay_segments WHERE created_at < NOW() - INTERVAL '3 days' AND id NOT IN (SELECT replay_id FROM issue_replays). Run that nightly and you will, occasionally, delete a session that's about to be linked to an error.

Errors can arrive seconds-to-minutes after a session ends

A user hits an error at the end of their session, the SDK queues the event, the user closes the tab, sendBeacon fires on unload. In the happy path that arrives in milliseconds. In the unhappy path the network was flaky, the device went to sleep, and the event lands when it reconnects. We've seen errors arrive 90+ seconds after the session segment closed on normal traffic, and much longer on mobile networks. If the prune ran in that window, you'd link the error to a replay that no longer exists.

Solution: prune at 4 days, not 3

The fix is a 1-day buffer. The retention sweep deletes non-error sessions older than 4 days, even though the advertised retention is 3 days. Day 4 is the safety margin. For an error to bind to a now-missing session, the error event would have to arrive more than 24 hours after the session ended — well outside any plausible network-delay or buffered-beacon scenario. You give up nothing user-facing (sessions are still "gone" at 3 days from the customer's perspective; they're unlinkable in the UI), and you sidestep the late-arrival race entirely.

Edge case: errors that match a session that was already pruned

You will rarely see an error event come in with a replay_id that no longer has segments on disk. The pipeline logs it and moves on. The error itself is fine — stack trace, breadcrumbs, tags all intact. The replay link in the UI renders as "replay no longer available." Nobody's data is corrupted. We see it on the order of a handful of events per month per project. Not worth a more elaborate solution.

What this enables product-side

Once the storage model stops fighting the debugging model, a few things become possible that weren't before.

Every error opens with the full session

Click into any error in the dashboard and the replay player loads from page-load through the throw. Not the last 30 seconds. The whole thing. The 2 AM SRE story stops being a story — you scrub back four minutes, watch the coupon get applied, see the bug. You don't DM the customer. You don't try to reproduce. You write the fix.

Free tier becomes evaluation-grade

Free is 10,000 sessions/month, recorded in full, errors kept 14 days, non-error sessions pruned at 3 days. That's a real evaluation. Sentry's Developer free tier gives you 50 replays per month — enough to confirm the feature exists, not enough to actually debug with. We'd rather you see the product working on your real traffic for a month and then upgrade because you want the longer error-attached retention, not because you ran out of replay quota in week one.

The upgrade lever moves

The question stops being "do you get replay" — everyone gets replay — and becomes "how long do we keep the session that broke you." Free keeps it for 14 days. Pro keeps it for 90. Business keeps it for 365. That's a clean, honest upgrade conversation, and it lines up with how compliance and incident-review processes actually work. Nobody wants to argue with finance about replay sample rates. They want to know whether they can pull the replay for an incident that surfaced six months later in a customer support escalation.

Try it

If you're running the standard replaysSessionSampleRate: 0, replaysOnErrorSampleRate: 1.0 config and you've had even one 2 AM debugging slog where the buffer was too short, the change is one line:

Sentry.init({
  dsn: "https://your_api_token@glitchreplay.com/project_id",
  replaysSessionSampleRate: 1.0, // record every session
  replaysOnErrorSampleRate: 1.0,
});

You can sign up free and point an existing Sentry-compatible SDK at the GlitchReplay DSN in under five minutes — the full walkthrough is in our migration guide. The next time the pager goes off, the replay will start where the session started. Not where the buffer caught up.

Stop watching your error bill spike.

GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.