The "noisy errors" problem and how to triage at scale

Inbox-zero for error trackers: the five rules that turn a 10,000-issue backlog into a manageable weekly review.

GlitchReplay team·February 23, 2026·

managementtriage

It's 9:00 AM on Monday, and your Slack "alerts" channel has 452 new messages. By 9:05 AM, you've muted the channel. This is the "noisy error" death spiral: when the volume of signals exceeds a team's capacity to process them, the tool becomes a graveyard of ignored data. The most dangerous bug isn't the one that crashes the server; it's the one buried under 5,000 "Script error" notifications from a legacy browser extension that a single user in a remote office happens to have installed. When your monitoring tool becomes a source of anxiety rather than a source of truth, you haven't just lost visibility—you've lost the ability to respond to reality.

Most engineering teams treat error tracking as a passive "log of everything that went wrong" rather than a high-signal stream. They wait for the dashboard to turn red, or for a customer to complain, before they actually look at the data. This reactive posture is exactly why "on-call" is a dirty word in so many organizations. This post provides a tactical framework for moving from alert fatigue to "Inbox Zero" by applying five structural rules to error management. You don't need more data; you need a better filter.

Why "Ignoring" is Not a Strategy

Technical debt is often discussed in terms of messy code or outdated libraries, but there is a specific kind of "observability debt" that occurs when a team allows their error backlog to grow unchecked. Every time you ignore a recurring error because "we know about that one," you are training your brain to ignore the entire system. This is the "boy who cried wolf" effect in software engineering. When the signal-to-noise ratio drops, the cognitive load required to identify a genuine regression becomes too high, and people simply stop looking.

The Hidden Cost of Alert Fatigue

Alert fatigue isn't just about being annoyed by notifications; it's a significant risk to system stability. When a developer is bombarded with 500 alerts a day, their threshold for "normal" shifts. They start to assume that a spike in errors is just "the usual noise" from a recent deployment or a bot crawling the site. I've seen cases where a critical API failure—one that was preventing 10% of users from checking out—was missed for six hours because it shared a generic "Internal Server Error" fingerprint with a known, harmless bug in a background cleanup task. The team saw the error count go up, but because "that error always happens," nobody clicked through to see that the context had changed.

Signal vs. Noise: Defining the Triageable Error

We need to distinguish between "telemetry" and "triageable errors." Telemetry is data you keep for historical analysis or debugging after the fact. A triageable error is an actionable event that represents a failure of the system to meet a user's expectation. If an error is not actionable—if there is literally nothing a developer can do to fix it (like a user's ISP dropping a connection halfway through a script download)—it shouldn't be in your primary triage queue. It should be filtered out or moved to a secondary "telemetry" bucket. By narrowing the definition of what constitutes an error, you immediately reclaim 50% of your attention.

Rule 1: Master the Fingerprint

The core of any error tracking tool is its grouping algorithm. Sentry, and by extension GlitchReplay, uses a "fingerprint" to determine if two error events are the same. By default, this is usually based on the stack trace. But stack traces are often misleading. A single bug might produce five different stack traces depending on the code path, or five different bugs might produce the same stack trace if they all pass through a common middleware or utility function.

When Stack Traces Lie

Consider a database connection timeout. Depending on which part of your application is trying to reach the database, the stack trace will look completely different. Your error tracker will show you 50 different "unique" issues, even though they all have the same root cause: the database is overloaded. Conversely, if you have a generic try/catch block in your main Express or Next.js layout that logs every error to the console, the stack trace might just show that catch block, effectively merging unrelated bugs into one "noisy" bucket. This is where custom fingerprinting becomes mandatory.

Custom Grouping Logic

To fix this, you should override the default grouping logic for known categories of errors. If you know that any error with a code of E_DB_TIMEOUT is logically the same, tell your SDK to group them that way. This collapses those 50 unique issues into a single, high-volume event that clearly signals a systemic problem. Here is an example of a middleware snippet that applies custom fingerprinting for common issues in a Cloudflare Workers or Next.js environment:

// In your Sentry/GlitchReplay initialization
beforeSend(event, hint) {
  const error = hint.originalException;

  // Group all database timeouts together, regardless of stack trace
  if (error && error.message && error.message.includes('Database timeout')) {
    event.fingerprint = ['database-connection-timeout'];
  }

  // Group third-party API rate limits by the service name
  if (error && error.status === 429) {
    const url = new URL(event.request.url);
    event.fingerprint = ['rate-limit', url.hostname];
  }

  return event;
}

Rule 2: Filter at the Edge

The most efficient way to manage noise is to ensure it never reaches your database in the first place. Every event you process costs you—if not in money, then in cognitive cycles. There is a specific class of "junk" errors that plague almost every web application, and they should be discarded at the SDK level.

Discarding "Junk" Errors

If you are running a client-side JavaScript application, you are at the mercy of the user's environment. This means you will see errors caused by browser extensions, ad blockers, and malware injectors. These are not your bugs, and you cannot fix them. For example, ResizeObserver loop limit exceeded is a notorious "junk" error that happens in almost every modern React app due to how browsers handle layout transitions; it almost never impacts user experience in a meaningful way. Other common offenders include Script error (which happens when a cross-origin script fails to load without CORS headers) and errors originating from chrome-extension:// or safari-extension:// URLs.

Inbound Filters vs. SDK Filtering

You have two places to filter: in the SDK (the user's browser) or via Inbound Filters (on the server). Filtering in the SDK is better for performance and privacy, as the data never leaves the user's machine. However, server-side filters are easier to update globally without a new deployment. A robust strategy uses both. Here is a "Top 5 Junk Errors" list that you should likely be filtering right now:

ResizeObserver loop limit exceeded
Non-Error promise rejection captured (usually from third-party libraries)
Errors containing top.GLOBALS (usually from legacy toolbars)
Errors from ucbrowser or other non-standard mobile browsers with aggressive injections
404s for apple-touch-icon.png or favicon.ico generated by bots

Rule 3: The "Impact Score" Triage

When you have 100 unresolved issues, how do you decide what to fix on Monday morning? Most people sort by "Total Events," but this is a trap. An error that happens 10,000 times to a single bot crawling your site is significantly less important than an error that happens 5 times to 5 different users attempting to pay you money.

Volume vs. Users Affected

The most important metric in triage is "Users Affected." This immediately separates "flappy" errors from systemic failures. If the event-to-user ratio is 1:1, you have a widespread problem. If it's 1000:1, you likely have a specific edge case or a retry-loop bug that is annoying for one person but not breaking the product for everyone else. Your triage workflow should prioritize the "Unique Users" count over the "Total Events" count every time.

Tagging for Context

To make this even more effective, you should attach business context to your errors. If you use tags like plan_type: enterprise or is_checkout: true, you can create "High Priority" views in your dashboard. An error in the checkout flow of an enterprise customer should trigger an immediate page; an error in the "Settings > Change Avatar" flow for a free user can wait until the next sprint. By weighting your priority based on business value, you align engineering effort with company goals.

Rule 4: Visual Verification with Session Replay

One of the biggest time-wasters in triage is the "reproduction" phase. A developer sees an error, looks at the stack trace, and says, "I don't see how this could happen." They then spend two hours trying to mimic the user's state, often failing because they don't know the exact sequence of clicks that led to the crash. This is why "Cannot Reproduce" is the most common resolution for noisy errors—and also the most dangerous, because the bug is still there.

Reducing "Cannot Reproduce" Closures

Session Replay changes the math of triage. Instead of guessing what the user did, you watch a video of it. You see them click a button three times, wait for a slow API response, and then try to navigate away—causing a race condition that your code didn't handle. When you can see the user's journey, the "reproduction" time drops to zero. In our experience, teams using session replay see a 30-40% reduction in their average time-to-resolution (TTR) because they skip the back-and-forth between QA and Dev.

Distinguishing UI Glitches from Logic Errors

Sometimes an error is "noisy" but harmless. By using the console log overlay in a replay, you can see if a reported error actually broke the layout or just logged a warning. If the user continues their journey successfully despite a console error, that error is low priority. If the screen goes white or the "Submit" button stops responding, it's a blocker. Replay provides the "human" context that a stack trace lacks.

Rule 5: The Weekly "Error Budget" Review

Triage is not a one-time event; it's a recurring process. You need to treat your error queue like a financial budget. The Google SRE book popularised the concept of an "Error Budget"—the idea that you have a certain amount of unreliability you are willing to tolerate in exchange for development speed. But for most teams, the budget isn't about uptime; it's about attention.

The 20-Minute Triage

Every week (or every sprint), the on-call engineer should spend exactly 20 minutes clearing the "Unresolved" queue. This is not for fixing bugs; it's for categorizing them. Every new issue should end up in one of four states:

Fix Now: High impact, many users, clear path to resolution.
Snooze: Not enough data yet. "Ignore until this happens 50 more times" or "Ignore until this affects 10 unique users."
Archive/Filter: This is junk. Filter it at the SDK or server level so we never see it again.
Known Issue: We know why this is happening but we aren't fixing it yet. Mute it so it doesn't clutter the dashboard.

The goal of error budget management is to ensure that the "Unresolved" tab is as empty as possible. If it stays empty, you'll actually notice when something new and dangerous appears.

Flat-Rate Freedom: Triage Without the Bill

There is a hidden reason why error triage is so difficult for many teams: pricing models. Most traditional error tracking tools charge per-event. This creates a perverse incentive for engineering managers to be over-aggressive with filtering. If your tool starts costing $5,000 a month because a bot hit a 404 page 10 million times, your first instinct isn't to triage the data—it's to delete it.

The "Per-Event" Tax

When you pay per-event, you are effectively being taxed for having more observability. This leads to "blind spots" where teams turn off tracking for certain environments or aggressively sample their data to save money. But the most important signals are often found in the high-cardinality data that you would otherwise discard. You shouldn't have to choose between a manageable bill and a visible system.

Triage by Logic, Not by Budget

At GlitchReplay, we believe triage should be driven by technical logic, not financial constraints. By offering a flat-rate model, we allow teams to capture every signal—the noisy ones, the rare ones, and the "junk" ones—and then use the rules above to filter them properly. This "capture everything, filter locally" approach ensures you have the data when you need it (like for a post-mortem) without being punished for it in the meantime. You can afford to be thorough when the noise is free.

Stop paying for noise and start fixing bugs. GlitchReplay gives you Sentry-SDK compatibility, full session replays, and flat-rate pricing so you can capture every signal without the financial noise. If you're tired of alert fatigue, it's time to move from a "log of everything" to a high-signal stream that your team actually trusts.

Stop watching your error bill spike.

GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.

Get started — free Read the docs