Tracking errors in Cloudflare Workers (the right way)
`waitUntil`, tail workers, and SDK transport limits — the three things that make Workers error tracking different from Node.
You've just deployed a mission-critical Cloudflare Worker. It's handling millions of requests with sub-10ms latency, and your Cloudflare Analytics dashboard looks beautiful—except for a growing sliver of red indicating 500 errors. You check your error tracking dashboard, expecting a flood of stack traces. Silence. Zero errors. You refresh, check the API keys, and even trigger a manual exception in staging. Staging reports fine, but production remains a ghost town. This is the "Ghost Failure" problem, and it's the most common reason developers lose faith in edge computing observability.
The culprit isn't usually a bug in your code or a service outage at your provider. It's a fundamental misunderstanding of the V8 isolate architecture. In the ephemeral world of edge computing, if you don't explicitly tell the runtime to wait for your telemetry, your errors simply don't exist. When the response is sent to the user, the isolate is often frozen or destroyed immediately. If your Error SDK is still mid-way through an HTTP POST to an ingest server, that request is killed. No retry, no log, just a silent failure.
At GlitchReplay, we spent months debugging why high-throughput Workers were dropping up to 30% of their error reports when using standard Node-style SDKs. This post is the distillation of what we learned about tracking errors the right way on Cloudflare.
The Isolate Architecture: Why Workers ≠ Node.js
To understand why error tracking fails at the edge, you have to understand the difference between a container and an isolate. In a standard Node.js environment (like AWS Lambda or a VPS), you have a persistent process. When an error occurs, the process stays alive long enough for a background thread to ship that error off to a server. Even in a "serverless" Lambda, the runtime usually waits for the event loop to empty before freezing the execution environment.
Cloudflare Workers use V8 isolates. Think of an isolate as a clean room for your code. It starts up in less than 5 milliseconds because it doesn't need to boot an entire operating system or even a full Node.js runtime. But that speed comes with a trade-off: strict resource management. You don't get a persistent event loop that survives the request-response cycle by default.
CPU Time vs. Wall Time
On a standard Workers plan, you are often limited to 50ms of CPU time. This is not the same as "wall time" (the total time the request takes). If you make a fetch request to a slow database, your Worker "sleeps" and doesn't consume CPU time while waiting. However, the moment your code starts executing again—including the code inside an Error SDK that is serializing a massive stack trace and encrypting a JSON payload—the clock starts ticking. If you hit that 50ms limit during error reporting, Cloudflare will kill the isolate instantly. Your user gets their response, but your telemetry dies in memory.
The "Clean Room" Problem
Because isolates are designed to be reused across different requests for efficiency, they must be perfectly reset. Global state is dangerous. If an SDK tries to buffer errors in a global array to send them in batches (a common strategy in Node.js to save on HTTP overhead), those errors might never be sent if the isolate is purged or if the next request doesn't trigger a flush. At the edge, persistence is an illusion.
The ctx.waitUntil Requirement
The most important tool in your Cloudflare Workers toolkit is the ExecutionContext, specifically the ctx.waitUntil() method. This is the only way to tell the Cloudflare runtime: "I know I've sent the response to the user, but I've still got work to do. Please don't kill this isolate yet."
When you call captureException(e), most SDKs return a Promise. In a browser or Node, you might ignore that promise. At the edge, ignoring that promise is a guarantee of data loss. Here is what the "silent failure" pattern looks like:
export default {
async fetch(request, env, ctx) {
try {
return await doWork(request);
} catch (e) {
// This will likely fail in production!
// The response is sent, and the isolate is killed
// before the SDK finishes its network call.
SDK.captureException(e);
return new Response("Error", { status: 500 });
}
}
}
To fix this, you must pass the promise returned by your tracking tool into ctx.waitUntil. This extends the lifetime of the isolate until the promise settles. Here is the correct pattern:
export default {
async fetch(request, env, ctx) {
try {
return await doWork(request);
} catch (e) {
// We tell the runtime to keep the isolate alive
// until the error is fully reported.
ctx.waitUntil(SDK.captureException(e));
return new Response("Internal Server Error", {
status: 500,
headers: { "Content-Type": "text/plain" }
});
}
}
}
By using ctx.waitUntil, you ensure that even if the client receives the 500 error in 10ms, the isolate stays active for the extra 50-100ms required to hand off the telemetry to your ingest endpoint.
The SDK Transport Bottleneck
Even with ctx.waitUntil, you aren't out of the woods. Standard error-tracking SDKs are "heavy." They were built for an era of cheap RAM and persistent CPUs. When you run a full-featured SDK at the edge, you are paying a hidden tax in memory and CPU cycles.
Consider the process of reporting an error. The SDK must:
- Capture the current stack trace.
- Walk the stack and extract context for every frame.
- Serialize the entire object (including tags, breadcrumbs, and environment data) into a large JSON string.
- Compress or encrypt that string.
- Initiate a
fetchrequest.
In a Cloudflare Worker with a 128MB memory limit, a large stack trace combined with a few dozen breadcrumbs can easily consume several megabytes of heap during serialization. If your Worker was already near the limit because it was processing a large CSV or image, the act of reporting the error can trigger an Out of Memory (OOM) exception. This is the ultimate irony: the error tracker causes the final crash, and because the isolate is now dead, it can't report the OOM it just caused.
To avoid this, you need a transport layer specifically designed for V8 isolates—one that prioritizes low-allocation serialization and doesn't attempt to do "heavy lifting" like minification or complex source map resolution on the edge itself. This is why GlitchReplay focuses on a lightweight ingest; we move the processing work to our backend so your Worker stays under the 50ms limit.
Tail Workers: The Zero-Latency Alternative
If you are running a high-traffic service where every millisecond of wall time (and every cent of compute cost) matters, you should look into Tail Workers. This is Cloudflare's "pro" solution for observability.
Instead of your "Producer" Worker (the one serving users) being responsible for sending its own errors, you configure a separate "Tail" Worker. Cloudflare automatically sends the results of every execution—including logs, unhandled exceptions, and metadata—to the Tail Worker after the main execution is finished. This happens out-of-band, meaning it has zero impact on the latency of the request being served to your customer.
The architecture looks like this:
- Producer Worker: Executes business logic. If it crashes, it just crashes.
- Cloudflare Runtime: Collects the crash details and logs.
- Tail Worker: Receives a JSON payload containing the Producer's fate. It then ships that data to GlitchReplay or Sentry.
Here is a simplified example of what a Tail Worker looks like:
export default {
async tail(events) {
for (const event of events) {
if (event.outcome === "exception") {
// Forward the error to your tracking service
await fetch("https://ingest.glitchreplay.com/api/v1/log", {
method: "POST",
body: JSON.stringify({
exceptions: event.exceptions,
logs: event.logs,
scriptName: event.scriptName,
}),
});
}
}
}
}
Tail Workers are powerful because they capture the "uncapturable"—errors that happen before your code even starts (like syntax errors in a dynamic import) or errors that happen because of runtime limits (like the dreaded "CPU Limit Exceeded"). Since the Tail Worker is a separate execution context, it has its own fresh memory and CPU budget to handle the reporting.
The Source Map Mystery at the Edge
An error report without a source map is just a collection of gibberish. Because Cloudflare Workers are almost always bundled (using Wrangler, esbuild, or Vite), the stack trace you get from a V8 isolate will point to a line like worker.js:1:24503. This is useless for debugging.
In the Node.js world, you might load source maps from the file system. In a Worker, there is no file system. You have two choices:
- Inline Source Maps: You bundle the
.mapfile directly into the.jsfile. Do not do this. It doubles your bundle size, increasing your cold start time and potentially pushing you over the 1MB script size limit. - Build-time Uploads: You upload your source maps to your tracking provider during your CI/CD process.
To make this work with Cloudflare, ensure your wrangler.toml is configured to generate source maps:
[build]
command = "npm run build"
[build.upload]
format = "modules"
main = "./dist/index.mjs"
# Ensure esbuild generates the map file
[vars]
# ...
And in your build script, use the GlitchReplay or Sentry CLI to upload the resulting .map files. The key is to ensure the source field in the error report matches the filename in the uploaded source map exactly. V8 isolates are very particular about the bundle:// or worker.js naming conventions.
The Economics of Scale: Avoiding the "Success Tax"
Cloudflare Workers are often chosen because they are incredibly cheap at scale. You can serve millions of requests for a few dollars. However, many developers find that their observability bill quickly outpaces their Cloudflare bill. This is the "Success Tax."
Traditional error tracking companies charge per event. If you have a 0.1% error rate (which is quite good) on a Worker handling 100 million requests a month, you are looking at 100,000 error events. In a standard Sentry plan, that can cost you hundreds of dollars. It creates a perverse incentive: you want to track errors to improve your app, but the more your app grows, the more you are punished for having even a tiny fraction of failures.
This is even more acute with Cloudflare Workers because they are often used for "noisy" tasks like bot detection or edge-side redirects where 4xx and 5xx errors are common. You shouldn't have to decide which errors are "worth" tracking based on your monthly budget.
At GlitchReplay, we take a different approach. Because we built our platform on the same Cloudflare infrastructure we're helping you monitor, our ingest costs are a fraction of the legacy players. We offer flat-rate pricing. Whether you have 10,000 errors or 1,000,000, your price stays the same. We believe observability is a right, not a luxury taxed by volume.
Best Practices Checklist for Workers
Before you ship your next Worker, go through this checklist to ensure you won't be left in the dark when things go wrong:
- Use
try/catchat the top level: Don't rely on the runtime to catch everything. Wrap yourfetchhandler. - Pass the
context: Ensure every utility function that might need to report an error has access toctx.waitUntil. - Sanitize PII at the Edge: Use the
beforeSendhook in your SDK to scrub passwords or API keys before they leave the Cloudflare network. This helps with GDPR compliance since the data is cleaned before it even hits the open internet. - Keep breadcrumbs lean: Every breadcrumb you add (like "SQL query started") takes up memory. In a 128MB environment, keep only the last 10-20 most important events.
- Monitor CPU usage: Use
wrangler tailto see if your error reporting is pushing you close to the 50ms limit. If it is, switch to a Tail Worker. - Use a Sentry-SDK compatible endpoint: Don't reinvent the wheel. Use the existing Sentry SDKs but point them at a more cost-effective, edge-native backend like GlitchReplay.
Cloudflare Workers are the future of the web, but they require a shift in how we think about the lifecycle of our code. Stop treating your Workers like tiny servers and start treating them like the ephemeral, high-performance isolates they are. When you respect the isolate, your observability becomes rock solid.
If you're tired of missing errors or paying a fortune for the privilege of seeing your own stack traces, give GlitchReplay a try. We're built by Cloudflare fans, for Cloudflare fans, and we'll never charge you a success tax for growing your application.
GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.