How to alert on Web Vitals without alert fatigue
Threshold tuning, baseline drift, and the routing rules that send the right alert to the right team — without paging at 3 AM for a country-specific blip.
It's 3:00 AM and your pager is screaming. The culprit? LCP > 2.5s on your checkout page. You drag yourself to a laptop, pull up the dashboard, and spend the next hour digging through traces—only to discover that the "regression" was a handful of users on congested mobile networks in a region your company doesn't even ship to. You acknowledge the alert, mute it, and go back to bed knowing that tomorrow night it will fire again, for the same non-reason.
This is the Web Vitals noise machine, and it is how teams take their most valuable UX signal and turn it into their single biggest source of on-call burnout. The problem is almost never the metric. It is that we alert on Web Vitals the way we alert on errors—as if a slow LCP were a binary, page-worthy failure—when in reality Web Vitals are a noisy, high-variance distribution that demands a completely different alerting philosophy.
Why Web Vitals Are Alert-Fatigue Magnets
A 500 error is a clean signal. The request either threw or it didn't. When it throws, something is genuinely wrong, and there is usually a specific line of code to go fix. You can reasonably page a human for it.
A Web Vital is the opposite of clean. Largest Contentful Paint for a single page view depends on the user's device, their network, their distance from your nearest edge, whether a third-party script decided to block the main thread this time, and pure chance. Plot LCP across your real traffic and you don't get a line; you get a wide, lumpy distribution with a long, ugly tail.
The high variance of real-user data
One slow third-party tag—a chat widget, an A/B testing script, an analytics beacon that occasionally stalls—can drag a few percent of your sessions into "poor" territory while the median stays perfectly healthy. If your alert watches the average, it is now hostage to the worst-behaved fraction of your traffic. The average is the most easily poisoned statistic you can possibly alert on for a high-variance metric.
Why Google's "Good" thresholds aren't alert definitions
Google's Core Web Vitals guidance gives you tidy buckets—LCP good under 2.5s, CLS good under 0.1, INP good under 200ms, all measured at P75. Those are reporting and SEO targets, evaluated over a 28-day field window. They are emphatically not real-time alert thresholds. Wiring LCP > 2.5s straight into PagerDuty means you are paging on the instantaneous edge of a noisy distribution against a number that was designed to be read monthly. Of course it never stops firing.
Threshold Tuning: Beyond the "Good" Label
The first move toward a quiet on-call is to stop alerting on absolute Google thresholds and start alerting on percentiles that mean something for your application.
From averages to P75 and P99
Alert on P75, not the mean. P75 is the experience of your typical-to-slightly-unlucky user, and crucially it is resistant to the handful of pathological outliers that wreck an average. Track P99 too, but treat it as a dashboard metric for investigation, not a paging trigger—the P99 of any RUM metric is so volatile that paging on it is self-inflicted suffering. The rule of thumb: page on the percentile that represents a real cohort, watch the extreme tail, and never page on the mean.
Critical vs. warning, by business impact
Not every vital deserves the same severity, and not every page deserves the same vital. A layout shift (CLS) on a checkout button that makes users misclick is a genuine money problem. A 300ms LCP drift on your documentation index is a Tuesday. Build an explicit table:
Page group Metric P75 condition Action
----------- ------ -------------------- ---------------------
checkout INP > 350ms (3 windows) PagerDuty (critical)
checkout CLS > 0.15 sustained PagerDuty (critical)
checkout LCP +20% vs 7d baseline Slack (warning)
marketing LCP +30% vs 7d baseline Slack (warning)
docs any n/a dashboard onlyThe number that should page you is rarely an absolute value—it is a change against your own baseline on a page that actually makes money.
Baselining and Trend Detection
"Slow" is relative. A 3-second LCP is a disaster for a landing page and completely normal for a data-dense internal dashboard. Static thresholds cannot encode that, which is why they generate so much noise. Baselines can.
Environment- and route-specific baselines
Compute a baseline per route group, per environment. Staging traffic is synthetic and sparse; holding it to the same bar as production is meaningless. A heavy reporting page legitimately has a higher baseline than your homepage. The alert should ask "is this route slower than it normally is?" not "is this route slower than some global constant?"
Drift vs. spikes
There are two failure shapes and they want different handling. A spike is a sudden jump—usually a deploy or an incident—and warrants a fast, possibly paging alert. Drift is a slow creep over days as you accrete dependencies and bloat; it deserves a low-urgency ticket, never a 3 AM page. Use a sliding window to separate them:
// Drift detection: compare a short window to a longer trailing baseline.
const recent = p75(lcp, lastWindow("6h"));
const baseline = p75(lcp, trailingWindow("7d", { exclude: "6h" }));
const deltaPct = ((recent - baseline) / baseline) * 100;
if (deltaPct > 30 && sustainedFor("3 windows")) {
alert("warning", { metric: "lcp", route, deltaPct, recent, baseline });
}Requiring the condition to hold across several consecutive windows is the single most effective noise filter you can add. A momentary blip resolves itself before the third window; a real regression does not.
Segmenting Alerts to Filter Noise
The most powerful way to kill fatigue is to stop alerting on things you cannot or will not fix.
Geographic and device segmentation
If you are a B2B SaaS whose users are on corporate Wi-Fi and laptops, a poor LCP for 3G mobile users in a region you don't serve is not your problem—and it should never reach your pager. Segment your RUM data and scope alerts to the cohorts you actually own. Conversely, segmentation is also how you find real problems: a regression that looks like a mild aggregate blip often turns out to be a 200ms hit concentrated in one country, which usually means a localized infrastructure or edge-routing issue rather than a code bug.
Filtering by page importance
Separate the money path from everything else. Checkout, signup, and the core product flow get tight, sometimes-paging thresholds. Marketing and docs get warning-only Slack notifications or dashboard tracking. Spending your alerting budget on pages that don't convert is how you train your team to ignore the channel entirely.
Routing: Right Alert, Right Team, Right Time
An alert that lands in a generic #dev-alerts firehose is an alert nobody owns. Route by URL ownership and by severity.
routes:
- match: "/checkout/*"
owner: payments-team
rules:
- { metric: inp, p75_over: 350ms, sustained: 3, notify: pagerduty }
- { metric: lcp, drift_pct: 20, notify: slack:#payments-perf }
- match: "/dashboard/*"
owner: app-team
rules:
- { metric: lcp, drift_pct: 25, notify: slack:#app-perf }
- match: "/blog/*"
owner: web-team
rules:
- { metric: lcp, drift_pct: 40, notify: dashboard-only }The escalation principle is simple: drift and warnings go to Slack where someone can pick them up during working hours; a sustained "poor" on a money path pages the team that owns that path, and only that team.
The Why Behind the What: Connecting Vitals to Replays
A well-tuned alert tells you that a page got slow. It almost never tells you why. This is where a number-only monitoring stack leaves you stuck, and where session replay closes the loop.
When an INP or CLS alert fires, the question is always "what did the user actually experience?" A session replay tied to the slow sessions answers it directly. You watch a layout shift happen—an ad slot or a late-loading hero image pushing the "Add to Cart" button down half a second after the user's thumb is already moving—and you see the rage clicks that follow. The CLS number told you 0.18; the replay tells you it was a single late image and shows you the misclicks it caused. That is the difference between an alert that creates work and an alert that resolves work.
Implementation Checklist for a Quiet On-Call
If you are starting from a noisy setup, work through these in order:
- Audit existing thresholds. List every Web Vitals alert you have. For each, ask: when this last fired, did anyone take action? If the honest answer is "no" more than once, it is noise. Delete it or downgrade it to dashboard-only.
- Switch averages to P75, page only on sustained change. Replace absolute-threshold alerts with baseline-relative ones that must hold across multiple windows.
- Segment and scope. Restrict every paging alert to the cohorts and routes you actually own and can fix.
- Route by ownership. No more shared firehose. Each alert names a team.
- Add mute periods for known maintenance. A scheduled deploy or load test should not page anyone.
- Run a weekly fatigue report. Track alert volume per person and the acknowledge-to-action ratio. If alerts are climbing and actions aren't, your thresholds have drifted out of tune again.
The goal is a channel your team trusts: when it fires, something real is happening, on a page that matters, for users you serve. Everything else is a dashboard you check on your own schedule. You can sanity-check any URL's field vitals with our free Web Vitals checker, and if you want to formalize how much regression you're willing to tolerate, fold these signals into an error budget rather than treating every blip as a fire. For the business case behind the metrics you're alerting on, see how INP, LCP, and CLS map to conversion.
This is the model GlitchReplay is built for: full-fidelity, flat-rate RUM—so your percentiles are real measurements and not undersampled guesses—with session replay attached to every alert, so the moment a vital regresses you can watch exactly what your users saw instead of waking up to interrogate a chart. Stop paging for P75. Build alerts that earn the interruption.
GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.