| Severity | high |
| Status | fixed |
| Found | 2026-04-18 |
| Fixed | 2026-04-18 |
| Area | gateway/session |
| Commit | 0655ac1 |
Symptom
Two/plan requests arriving for the same session_id within a few
hundred milliseconds both triggered compaction. Three observable
effects:
- Double LLM cost. Both requests summarized the same history into a paragraph. Each paid for an Anthropic/OpenAI call.
- Silent fact loss. LLMs are non-deterministic even at
temperature: 0.2. The two summaries diverged in which facts they preserved. The secondUPDATEtogateway_sessions.compaction_summaryoverwrote the first. If the winning paragraph happened to omit facts the losing paragraph captured, those facts were gone. - Non-reproducibility. Replaying identical inputs 5 seconds later could produce a different session state, because which compaction “won” depended on sub-millisecond timing.
Root cause
No serialization anywhere in the compaction path.compactIfNeeded was
a pure Effect — called directly from the planner’s runPlan, no lock,
no mutex, no dedupe map. Concurrent invocations interleaved:
SET compacted=true) is idempotent so that part
was harmless. The summary-column UPDATE was the unsafe operation, and
it had no concurrency control.
Fix
Application-layer promise dedupe in the compaction layer (gateway/src/session/compaction.ts):
CompactOutcome as the first caller.
Only one LLM call. Only one set of UPDATEs. No race.
The entry is cleared in a finally so a resolved (or failed)
compaction does not block subsequent ones. On rejection, the next
caller starts a fresh producer, enabling retry.
Regression tests at gateway/tests/session/compaction-dedupe.test.ts:
four cases covering reuse of in-flight promise, post-settle re-entry,
per-session isolation (different keys don’t collide), and
error-path recovery.
Known limitation documented in the commit and code comment: this
is per-process state. A horizontally-scaled deployment of the gateway
(multiple processes fronting one Supabase) could still race across
processes. Single-process Phase 1 is correct today; Phase 2 should
add either a Postgres version column with optimistic concurrency or
a wrap-everything-in-a-stored-procedure approach. Tracked in
known-issues.md under compaction-dedupe-single-process-only.
Why the tests didn’t catch it
No concurrency tests existed in the session layer. Compaction was untested end-to-end (separate issue — see2026-04-18-compaction-lossy-second-pass.md and
2026-04-18-compaction-mid-turn-cut.md, which share the root cause
of “compaction had no test harness”).
Even with a proper compaction harness, this specific bug requires
concurrent invocation to manifest. Vitest tests are sequential by
default; you have to explicitly construct Promise.all([call(), call()])
to reproduce the race. That’s the kind of test you write only after
you already know concurrency is a failure mode.
The regression test for this fix is itself instructive: it tests the
dedupe wrapper (a pure function) rather than the full compaction
path, because mocking Supabase + the LLM together is expensive. The
wrapper’s properties (reuse, clear-on-settle, key isolation,
error recovery) are what actually prevent the race; if those hold,
the application of the wrapper to compaction is trivial.
Class of bug — where else to watch
“Non-idempotent operation on a shared row” — any database write that replaces state based on read-then-compute must either serialize across concurrent writers or use compare-and-swap semantics. Ask for each write path: “if two requests did this concurrently, would the outcome depend on timing?” Specific candidates in the codebase:db.updateSessionCatalogwholesale-overwrites theagent_catalogcolumn on every/plan. Two concurrent requests for the same session could race on which catalog version wins. If catalogs differ (different tenants, different external configs), the losing one’s agents are silently dropped. Noted inknown-issues.mdunderagent-catalog-overwriteandagent-catalog-race.db.touchSessiondoes anUPDATE ... SET last_active_at = now(). Idempotent, safe — this is the right shape.- Any future payment-state updates (
x402processing in Phase 5) will need concurrency control by definition. If using Postgres, preferSELECT ... FOR UPDATEor advisory locks; if using application-level, the dedupe pattern here generalizes. - The
gateway_tasks.statecolumn is updated byfinishTask. Single writer per task id, so safe today. If a future feature lets the gateway retry a task (re-finishing it with a different outcome), serialization becomes necessary.