Cloudflare D1 timeout errors: causes and fixes

Connection limits, query plan surprises, and the indexes that turn a 30-second query into a 30ms one.

·
cloudflared1

It's the classic "it worked in staging" nightmare. You deployed your Cloudflare Worker, everything was fast, and then your D1 table crossed 100,000 rows. Suddenly your logs are peppered with D1_ERROR: execution timeout. A query that ran in 20ms last week now hits a hard limit — not because the data is genuinely huge, but because your edge database is doing a full table scan while waiting on a write lock. This post is about why that happens, how to diagnose it with the tools D1 gives you, and how to turn a 30-second query back into a 30-millisecond one.

Anatomy of a D1 Timeout: The 30-Second Wall

D1 is SQLite running as a managed, serverless database at the edge. That heritage explains most of its surprising behavior, starting with the timeout.

Understanding the execution cap

A single D1 query has a hard execution ceiling — currently 30 seconds. When a statement exceeds it, D1 aborts and returns a D1_ERROR to your Worker. Thirty seconds sounds generous, but a query should never come close; if you're hitting it, you almost certainly have an unindexed scan, a lock-contention problem, or both. The timeout is a symptom, not a tuning knob. The first step is to catch it explicitly so it doesn't surface as an opaque 500:

try {
  const result = await env.DB.prepare(
    "SELECT * FROM events WHERE team_id = ?"
  ).bind(teamId).all();
  return Response.json(result.results);
} catch (e) {
  if (e.message?.includes("D1_ERROR") && e.message?.includes("timeout")) {
    captureException(e, { query: "events.byTeam", teamId });
    return new Response("Query timed out", { status: 504 });
  }
  throw e;
}

Why time to first byte is different at the edge

With a traditional database you hold a persistent connection and round trips are cheap. D1 is accessed over Cloudflare's internal network from an ephemeral Worker isolate, so each statement carries network overhead, and there's no long-lived connection to amortize it across. This makes the number of round trips matter as much as the cost of any single query — a point that becomes important when we get to batching.

The SQLite Locking Model at the Edge

Because D1 is SQLite, it inherits SQLite's locking model, and that model is the source of many timeouts that have nothing to do with how much data you're reading.

Shared vs. exclusive locks

SQLite uses shared locks for reads and an exclusive lock for writes. A write must wait for in-flight reads to finish, and once a write holds the exclusive lock, new reads queue behind it. In a low-traffic staging environment this is invisible. Under production concurrency, a single slow write can serialize a pile of reads behind it, and those queued reads burn their clock waiting for the lock — eventually tripping the 30-second timeout despite being trivial queries.

The cron job that kills your reads

The textbook version of this: a nightly "cleanup" cron job runs a big DELETE or bulk UPDATE, which grabs the exclusive lock for an extended period. Every user-facing read that arrives during that window waits on the lock. Your users see timeouts, your dashboard shows a spike, and the actual culprit is a maintenance task you forgot was running. Heavy writes should be chunked into small batches so they release the lock frequently, letting reads slip through.

The Missing Index Trap in Serverless

Small datasets are deceptive. They mask architectural problems because even a full table scan is fast over a thousand rows. The trouble is that scan cost grows with row count, so the problem stays invisible right up until it's an outage.

Full table scans, the silent killer

A WHERE clause on an unindexed column forces SQLite to read every row and test each one. At 1,000 rows that's instant. At 100,000 rows it's a meaningful fraction of a second per query, and under concurrency — with each scan also holding a shared lock — the whole thing snowballs into timeouts. The degradation isn't linear with your traffic; it's the product of row count and concurrency.

Identifying unindexed lookups

The rough shape of the problem: the same WHERE team_id = ? query that takes well under a millisecond at 1k rows climbs steadily as the table grows, while the indexed version stays essentially flat regardless of size. An index turns "read everything and filter" into "jump straight to the matching rows." The cost of adding one is a single DDL statement; the cost of not adding one is a production incident.

Using EXPLAIN QUERY PLAN with Wrangler

You don't have to guess which queries scan. SQLite will tell you exactly, and you can ask it from the command line before anything reaches production.

Running the command

wrangler d1 execute my-db --remote \
  --command="EXPLAIN QUERY PLAN SELECT * FROM events WHERE team_id = 42"

Interpreting the output

The output has two phrases you care about. SCAN TABLE events means a full table scan — SQLite is reading every row, and this query will degrade as the table grows. SEARCH TABLE events USING INDEX idx_team means SQLite found an index and is jumping straight to the matching rows. Your goal is to never see SCAN on a hot query. Create the index and re-run the plan to confirm:

-- Before: EXPLAIN shows "SCAN TABLE events"
CREATE INDEX idx_events_team ON events(team_id);
-- After: EXPLAIN shows "SEARCH TABLE events USING INDEX idx_events_team"

Make this a habit: any query that runs on a user-facing path gets its plan checked before it ships. It takes ten seconds and it's the single highest-leverage thing you can do for D1 performance.

Optimizing with Statement Batching

Remember that each round trip to D1 carries network overhead. When you need to run several statements, doing them one at a time multiplies that overhead and lengthens the window during which you might hit a lock.

The batch API

D1's .batch() API sends multiple prepared statements in a single round trip, and executes them as an implicit transaction. This collapses the network overhead and shortens lock-hold time:

// SLOW: three separate round trips, three lock acquisitions.
await env.DB.prepare("INSERT INTO logs (msg) VALUES (?)").bind(a).run();
await env.DB.prepare("INSERT INTO logs (msg) VALUES (?)").bind(b).run();
await env.DB.prepare("INSERT INTO logs (msg) VALUES (?)").bind(c).run();

// FAST: one round trip, one transaction.
const stmt = env.DB.prepare("INSERT INTO logs (msg) VALUES (?)");
await env.DB.batch([
  stmt.bind(a),
  stmt.bind(b),
  stmt.bind(c),
]);

When batching backfires

Batching has a ceiling. SQLite limits the number of bound variables in a single statement, so don't try to cram tens of thousands of parameters into one giant batch — you'll hit a "too many SQL variables" error. For very large bulk operations, chunk into batches of a few hundred. You still get most of the round-trip savings while staying under the variable limit and keeping each batch's lock-hold short.

Architecting for High Concurrency

Sometimes the right answer is to take load off D1 entirely. Not every read needs to touch the database, and not every write needs to contend for the same lock.

KV and Durable Objects as a caching layer

For read-heavy data that tolerates slight staleness — config, feature flags, rendered fragments — put Cloudflare KV in front of D1. KV reads are fast and don't contend with your database locks at all. For data that's both hot and frequently mutated, a Durable Object can serialize access in memory and flush to D1 periodically, keeping the contention out of the shared database.

Moving hot counters out of D1

The canonical example is a view counter. If every page view does UPDATE posts SET views = views + 1, every view grabs the exclusive write lock on a popular row, and under traffic that single statement becomes a bottleneck that times out other queries. Move the counter into a Durable Object, increment it in memory, and persist the total to D1 every few seconds. The write contention vanishes, and D1 sees one update instead of ten thousand.

Monitoring D1 Health with GlitchReplay

The thing about D1 timeouts is that they're conditional. They don't fire for every user — they fire for the user with the big dataset, during the window when the cron job is running, on the query that scans. Logs alone rarely capture that context, which is why these are so hard to reproduce.

Seeing the context that triggers the timeout

Capturing the timeout error together with the surrounding context — which team ID, which query, what the user was doing — is what turns "D1 is sometimes slow" into "D1 times out for team 4017 because they have 2 million rows and this query isn't indexed for their access pattern." A session replay alongside the error shows you the user-side symptom — the spinner that hung, the action that triggered the slow query — so you can connect the database failure to the real-world trigger. For more on why those moments matter, see our broader guides on error tracking on Cloudflare Workers and debugging Workers in production.

Alerting on D1-specific error strings

Set up an alert that fires on the D1_ERROR: execution timeout string specifically. A sudden cluster of these is your earliest signal that a table crossed the threshold where a missing index started to matter, or that a maintenance job is holding a lock — usually well before users start filing tickets.

D1 timeouts feel mysterious because the same query that's instant in staging falls over in production, but the causes are almost always boring and fixable: a missing index turning a lookup into a scan, a write lock serializing your reads, or too many round trips stretching a request past the limit. Check your query plans, index your hot columns, batch your writes, and move the truly hot paths off the database. GlitchReplay tracks D1 errors with full session context, Sentry-compatible and flat-rate priced, so you can see not just that a query timed out but exactly which user and which data made it happen. Stop guessing why your edge queries are slow — give GlitchReplay a try.

Stop watching your error bill spike.

GlitchReplay is Sentry-SDK compatible, includes session replay and security signals, and never charges per event. Free to start, five minutes to first event.