Setting up SLOs from your error tracker (without an SRE team)
A pragmatic SLO definition that uses the data you already have, and the dashboard that makes it visible to the rest of the company.
Monday morning fatigue: you walk in to 437 unread Slack notifications from your error tracker. Ninety percent are noise, nine percent are expected failures, and the one percent that actually matters is buried somewhere in the middle. You want to care about reliability, but the signal-to-noise ratio has made it impossible. This is the moment most teams realize they don't need more alerts. They need a Service Level Objective that tells them, with a single number, when to stop shipping features and start fixing the foundation.
SLOs have a reputation for requiring Prometheus, a dedicated SRE team, and a quarter of setup work. They don't. Your error tracker already captures the one thing an SLO is built on—the unhappy path. This post shows how to turn the SDK data you already have into a pragmatic SLO framework that a non-SRE team can actually maintain, and how to make reliability visible to the rest of the company.
Why Error Trackers Are Secretly SLO Engines
An SLO is, at its core, a ratio of good events to bad events held against a target. Your error tracker is already counting bad events all day long. You're most of the way there and didn't know it.
The good-event vs. bad-event mindset
Stop thinking of errors as a to-do list and start thinking of them as one half of a ratio. Every request is either good (completed without an unhandled exception) or bad (didn't). The SLO is just "what fraction were good, and is that fraction high enough?" That reframe is the whole game.
Why your error tracker is more user-centric than your infra logs
Infrastructure metrics lie by omission. A request can return a clean 200 OK at the load balancer while a JavaScript error completely breaks the page the user is staring at. Your Nginx log says success; your user says the site is down. Error trackers capture the failure that infra logs miss because they run where the user actually is—in the browser, in the render. That makes them a more honest SLI source than a raw 500 count.
Defining Your First Pragmatic SLO
The hardest part is picking a number that isn't a trap.
The 99.9% trap
Three nines sounds like the responsible default. For many products it's too many. Each additional nine costs exponentially more engineering effort, and if your users genuinely don't notice the difference between 99.5% and 99.9%, those extra nines are velocity you burned for nothing. Pick the lowest target your users will actually tolerate, not the most impressive-looking one.
Categorizing accountable errors
Before you compute anything, decide which errors count against you. A 404 from a bot crawling dead links, a ResizeObserver warning, an error thrown by a user's browser extension—none of these are your reliability failures, and including them poisons the SLI. Define "accountable" errors narrowly: unhandled exceptions and server failures in code you own. Here's what the targets translate to:
- 99% — ~7.2 hours of failure per month, 1 bad event in 100.
- 99.5% — ~3.6 hours, 1 in 200.
- 99.9% — ~43 minutes, 1 in 1,000.
Turning Error Counts into an Error Budget
Once you have an SLO, the error budget is what makes it operational. Think of it as a bank account for unreliability: the SLO sets the balance, and every bad event is a withdrawal.
The math
The SLI is (Total Events - Bad Events) / Total Events. The budget is 100% - SLO—the maximum bad-event fraction you're allowed before you've overdrawn. If your SLO is 99.5% and 10,000 requests came in, you can absorb 50 accountable failures before the account is empty.
The look-back window
Compute the budget over a rolling window, not a calendar reset. A 30-day rolling window is the common default—long enough to smooth out a single bad afternoon, short enough to reflect current reality. A 7-day window reacts faster but is noisier. Define it once in config:
# reliability.yml — a pragmatic SLO definition
slo:
name: checkout-availability
target: 99.5 # percent of good events
window: 30d # rolling look-back
good_event: "transaction completed, no unhandled exception"
bad_event:
- unhandled_exception
- http_5xx
exclude: # not accountable to us
- http_404
- third_party_script
- browser_extensionIdentifying Budget Drainers with Session Replay
A budget line on a graph tells you the account is draining. It doesn't tell you why. That's where replay turns a number into a fix.
Connecting the dip to a cohort
When the SLI drops, the first question is "who is this hitting?" A 0.1% budget hit sounds trivial—until you open the replays behind it and discover it's 100% of Safari users failing at checkout while everyone else is fine. The aggregate number hid a total outage for one cohort. Replay surfaces that immediately.
Critical failure vs. nuisance
Two errors can have identical counts and wildly different severity. Watching the replay tells you whether a "payment failed" error is a retried transient blip the user shrugged off, or a hard wall they hit three times before rage-quitting. The budget says when to look; replay says how bad it really is. Our walk-through on reading your error budget goes deeper on the management side of this.
Building the "Rest of the Company" Dashboard
SLOs die when only engineers can see them. To change behavior, the number has to be legible to product and leadership.
The traffic-light system
Collapse the math into three states everyone understands. Green: budget healthy, ship freely. Yellow: budget draining faster than planned, investigate before shipping anything risky. Red: budget blown, feature freeze until it recovers. A PM doesn't need to know what an SLI is to respect a red light.
From total errors to percentage of success
Reframe the headline metric. "4,000 errors occurred" means nothing to a VP and sounds alarming to everyone. "98.4% of checkout attempts succeeded this week" is the same data, framed as a business outcome, and it's a sentence you can put on a leadership dashboard without translation.
Operationalizing the Budget: The Policy
An SLO with no consequence is a vanity metric. The policy is what gives it teeth.
The feature-freeze conversation
The budget converts "engineering wants to slow down" into "we agreed that at zero budget we freeze, and we're at zero." It removes the personality from the argument. Scope creep meets a number everyone signed off on, not an engineer's gut feeling. Keep the policy to a few lines:
- Budget above 50%: ship normally.
- Budget 20–50%: new risky work needs reliability sign-off.
- Budget below 20%: feature freeze; the next sprint prioritizes stability.
- Every budget breach gets a post-mortem linked to the spend.
Automated vs. manual intervention
Start manual—a human reads the dashboard and calls the freeze. As trust builds, automate the alerting on burn rate so you're warned before the budget is gone, not after.
Implementation: Setting This Up in Your Error Tracker
You can build all of this on existing SDK features—no new platform required.
Tags and environments to segment SLOs
Different surfaces deserve different targets. Checkout might warrant 99.9% while the marketing blog is fine at 99%. Tag events so you can compute per-surface SLIs instead of one blurry company-wide number.
Alert on rate, not instance
The mindset shift that ends alert fatigue: stop alerting on every error instance and start alerting on the rate crossing a threshold. One PaymentError is noise; PaymentError climbing to 5% of checkout traffic in ten minutes is a budget emergency. Mark the events that matter so your alerting can focus:
// Flag SLO-relevant events so rate alerts can target them.
Sentry.setTag("slo_surface", "checkout");
Sentry.setTag("is_critical", "true");
// Alert config (pseudocode): fire when the bad-event RATE,
// not a single instance, breaches the budget burn threshold.
// WHEN rate(is_critical = true AND handled = false) > 5% OVER 10mGlitchReplay supports this end to end—Sentry-compatible tags and environments for segmenting SLIs, rate-based alerting, and session replay attached to the events draining your budget, all on flat-rate pricing so capturing the complete denominator never costs you extra. You can sketch the math first with our free error budget calculator.
You don't need an SRE team to run SLOs. You need a ratio, a target your users actually care about, a policy with teeth, and a dashboard the whole company can read. The data is already in your error tracker. Turn it into a percentage of success, put a traffic light on it, and stop firefighting one alert at a time.
GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.