Reading your error budget: a guide for engineering managers
How to convert raw error counts into a number leadership cares about, without hiring an SRE or installing Datadog.
It's 4:00 PM on a Thursday. Your dashboard shows 14,000 new issues in the last 24 hours. Your product manager is pushing for a Friday release of the new checkout feature. Is the system stable enough to ship? If you can't answer that question with a single number that reflects what users are actually experiencing, you're not managing technical debt—you're reacting to it. The error budget is the tool that turns "it feels risky" into "we're at 90% budget exhaustion, so no."
Error budgets get framed as SRE magic—the kind of thing that requires a dedicated reliability team and a six-figure observability contract. They don't. For an engineering manager, an error budget is a negotiating instrument, and you can build one from the error data you already have. This guide shows you how to turn raw counts into a number leadership cares about, without hiring an SRE or installing Datadog.
The Noise Problem: Why Raw Error Counts Are Useless
"14,000 errors" sounds catastrophic and tells you nothing. It doesn't say whether the business is dying or whether a deprecated analytics script is throwing a harmless warning on every page load. Raw counts have no denominator and no severity weighting, which makes them worse than useless—they're actively misleading.
The volume-vs-impact trap
One hundred errors on the login page—where every failure is a user who can't get in—matter far more than one thousand errors on a footer link nobody clicks. Counting events treats both the same. A high-volume, low-severity notice that breaks nothing can completely drown out a low-volume, critical failure that's blocking payments. The metric you need weights by impact, not occurrence.
Why inbox-zero for errors is a path to burnout
Some managers chase a clean error inbox, treating every issue as a ticket to close. That's a treadmill to nowhere. A healthy production system always emits some errors. The goal isn't zero; it's "below the threshold where users notice." An error budget defines that threshold explicitly, so your team stops triaging noise and starts protecting the experiences that matter.
Defining Your Currency: SLIs and SLOs for the Rest of Us
Three terms, demystified in the context of application errors.
The SLI: your good vs. bad events
A Service Level Indicator is a measurement of one thing users care about, expressed as a ratio of good events to total events. For error tracking, the simplest SLI is: requests that completed without an unhandled exception, divided by all requests. That's it. You're classifying every event as "good" or "bad" and taking the ratio.
The SLO: your target
A Service Level Objective is the target you hold that SLI to—say, 99.9% of requests succeed. It's a number you choose, a promise about the experience you intend to deliver.
The error budget: one minus the SLO
The error budget is simply 100% - SLO. If your SLO is 99.9%, your budget is 0.1%—the amount of failure you're allowed to ship before you stop and fix things. It reframes reliability from "avoid all errors" (impossible) to "spend a fixed allowance wisely." Here's what the nines translate to per month:
- 99% — ~7.2 hours of downtime, or 1 bad request in 100.
- 99.9% — ~43 minutes, or 1 in 1,000.
- 99.99% — ~4.3 minutes, or 1 in 10,000.
The Math: Calculating Your Budget from Error Data
You don't need a metrics platform. You need two numbers your error tracker already has.
Step 1: count total transactions
Count total requests or page loads over your window. If your error SDK captures transaction volume, use that. If not, your CDN or edge analytics will—Cloudflare gives you request counts for free.
Step 2: count unsuccessful transactions
Count the bad events: unhandled exceptions and HTTP 500s. Be deliberate about exclusions—404s from bots and errors thrown by third-party browser extensions are not your failures and shouldn't spend your budget.
Step 3: the calculation
-- Success rate over the last 30 days (e.g. Cloudflare D1)
SELECT
(1.0 - (
CAST(SUM(CASE WHEN level = 'error' AND handled = 0 THEN 1 ELSE 0 END) AS REAL)
/ COUNT(*)
)) * 100 AS success_rate_pct
FROM events
WHERE ts >= unixepoch('now', '-30 days');If that returns 99.94% and your SLO is 99.9%, you're under budget—ship. If it returns 99.82%, you've blown it—stop and fix. One query, one decision. We also built a free error budget calculator if you'd rather not write SQL.
The Burn Rate: Knowing When to Stop Shipping
A static budget number tells you where you stand. Burn rate tells you how fast you're getting there, and it's what turns a budget into an early-warning system.
Predicting budget exhaustion
Burn rate is the speed at which you're consuming the budget relative to a sustainable pace. A burn rate of 1x means you'll spend exactly your monthly allowance over the month—fine. A burn rate of 10x means you'll exhaust a 30-day budget in three days. Watching the rate lets you predict the exhaustion date before you hit it.
Setting budget policy
Decide the thresholds in advance, when nobody's panicking. A reasonable policy: at 50% spent, the team reviews what's consuming it. At 100% spent, you enter a feature freeze until the budget recovers. Writing this down before an incident is what makes it stick during one.
Catching regressions, ignoring blips
Burn-rate alerts are smarter than count alerts because they ignore a momentary spike that self-corrects while firing hard on a deploy that starts consuming 10% of the monthly budget per hour. That's your signal to roll back immediately—a clean-looking deploy quietly draining the budget is exactly the systemic regression burn rate is built to catch. (For the deploy-spike scenario specifically, see when a bad deploy spikes your bill.)
Communicating Up: Translating "500s" into Revenue Risk
The error budget's real power is in the room with your VP of Product or CEO.
Stop talking about stack traces
Nobody in leadership wants to hear about a TypeError. They want to hear about user success rate. "97% of checkout attempts succeeded this week" lands; "4,000 unhandled exceptions" does not. Same data, completely different conversation.
The budget as a technical-debt shield
When you need to push back on a risky launch, the budget gives you defensible language: "We're at 90% budget exhaustion this month. Shipping this feature now risks dropping us below the 99.5% reliability we committed to." That's not an engineer being conservative—it's a shared, pre-agreed number doing the arguing for you.
The Goldilocks SLO
Resist the urge to target 100%. A perfect-reliability goal means you can never take a risk, never ship fast, never experiment—and your competitors will eat you alive. The budget formalizes the idea that some failure is acceptable in exchange for velocity. Pick the lowest number your users will tolerate, and spend the rest on shipping.
Cultural Implementation: Making the Budget Stick
The math is the easy part. The human part is what determines whether it survives contact with a deadline.
Shared ownership
Both engineering and product must sign off on the SLO. If product doesn't own the number, they'll treat a feature freeze as engineering being obstructionist. When they've agreed to 99.9% in advance, the freeze is just the policy they helped write.
Tie spend to incidents
Every meaningful budget burn should link to a post-mortem. That connection turns an abstract percentage into concrete learning: "the Tuesday deploy cost us 30% of the month's budget—here's what we changed." And consider rewarding budget preservation in performance reviews, so stability work gets the same recognition as feature delivery.
The Low-SRE Approach: Reliability on a Budget
You can run this entire framework without a reliability team, and pricing is the reason it's feasible.
Flat-rate pricing
Accurate SLIs need a complete denominator—you have to count all the events, good and bad. Under per-event pricing, capturing high-volume data to compute an honest success rate is exactly what spikes your bill, so teams sample and end up with a budget calculated on guesses. GlitchReplay's flat-rate model means you can capture everything and trust the number.
Replay tells you why
The budget tells you when to worry; session replay tells you how to fix it. When burn rate spikes, you open the replays behind the failing transactions and watch the bug happen instead of guessing from a stack trace.
Built on Cloudflare
Because GlitchReplay runs on Cloudflare, you can pull total-request counts from the same edge that serves your traffic—a more accurate denominator than client-side sampling for the math in this post. If you want to formalize the policy side, our companion piece on setting up SLOs from your error tracker walks through the dashboard and freeze process.
An error budget converts the chaos of a 14,000-issue dashboard into one number you can take into any meeting and defend. You don't need a new tool, a new hire, or a new contract—just two counts, one ratio, and the discipline to act on it. Start measuring the success rate; the conversations get a lot easier from there.
GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.