The Dual-Write Problem: The Bug I Kept Shipping Without Knowing Its Name

2:47 a.m.

My phone buzzes.

The fleet dashboard says we're short twenty-three robots. The database swears all twenty-three are Provisioning — and have been for hours. The queue that's supposed to provision them? Empty. The workers? Idle. Doing absolutely nothing, with great confidence.

No errors. No alerts. No stack trace. Just numbers that refuse to add up.

I spent the next three hours hunting a bug that, technically, doesn't exist in any line of code. It lives between two lines. Two lines that passed every code review, passed every test, and ran quietly in production for months:

db.save(robot_state)
queue.publish(provision_msg)

db.save(robot_state)
queue.publish(provision_msg)

If your system has a database and a message queue — or a cache, or a search index, or Stripe — I'd bet real money these two lines, or their cousins, are sitting in your codebase right now. Ticking.

The bug has a name: the dual-write problem. The night someone finally said that name out loud to me, two years of "impossible" tickets suddenly made sense. By the end of this article, you'll see it everywhere — and you'll know the pattern that turns it from inevitable into impossible.

Grab a coffee. Let's go ghost hunting. ☕

Part 1 — The Problem: Two Lines That Lie

First, a story about a meteor

Imagine paying a bill the old-fashioned way. Two steps:

You write "−$500" in your ledger.
You drop the check in the mailbox.

Now imagine a very small, very rude meteor hits you between step 1 and step 2.

Your ledger says "paid." There's no check in the mail. And here's the part that matters: nothing in the universe will ever automatically fix this. Your ledger is simply wrong until some human stumbles onto the gap, scratches their head, and digs through old receipts.

A dual write is exactly that scenario — except it runs millions of times a day, on machines that crash far more often than meteors fall. At that scale, the unlucky millisecond isn't a possibility. It's a calendar entry.

The many costumes of a dual write

Two quick definitions, so we're speaking the same language:

Atomic means all-or-nothing. A database transaction is atomic: either every change in it commits, or none does. You can never observe "half."

A dual write is two writes to two different systems that you wish were atomic but aren't — because no transaction spans both.

Once you know the shape, you'll spot it wearing a dozen different costumes:

Database + message queue (the classic)
Database + Redis — "we'll just invalidate the cache after the commit"
Database + Elasticsearch — "index it right after saving"
Database + email — "save the user, then send the welcome mail"
Database + Stripe — "create the order, then charge the card"
Database + webhook to that other team's service
Database + a file on S3

(Yes, I have personally written at least four of these. Hi.)

The trap is that the code always looks atomic. Two lines, one right after the other. It reads like one operation.

It's two. And the gap between them is where consistency goes to die.

The five stages of dual-write grief

Every engineer who meets this bug goes through the same journey. I certainly did.

Stage 1: Denial.

"That gap is, what, a millisecond? It'll basically never happen."

Let's do the napkin math. Say the second write fails once per 10,000 attempts — that's 99.99% reliable, which sounds amazing on a slide. Now run a million operations a week.

That's a hundred silent inconsistencies. Every. Week.

"Rare" multiplied by "at scale" equals Tuesday.

Stage 2: Anger.

You blame the network. You blame SQS. You add a retry decorator to everything and feel briefly productive. (The retries help with blips. They do nothing for the real problem, but we're not ready to hear that yet.)

Stage 3: Bargaining.

This is where we get creative. Surely there's a clever ordering?

"I'll write to the database first."

1. Commit "Provisioning" to the database    ✅
2. 💥 process dies right here
3. Publish the message to the queue         ❌ never happens

1. Commit "Provisioning" to the database    ✅
2. 💥 process dies right here
3. Publish the message to the queue         ❌ never happens

Result: a ghost.

The database swears work is in progress. No worker ever got the memo.

"Fine, queue first!"

1. Publish the message to the queue         ✅
2. 💥 process dies right here
3. Commit to the database                   ❌ never happens

1. Publish the message to the queue         ✅
2. 💥 process dies right here
3. Commit to the database                   ❌ never happens

Result: an orphan.

A worker happily does a job for a record that doesn't exist.

Reordering just swaps which flavor of broken you get.

"Okay — retry the second write if it fails!" If the process crashed, there is no code left running to do the retrying. You can't retry from the afterlife.

"Then I'll wrap it in try/except and roll back the database when the publish fails!" Closer. But you can crash during the rollback. You've shrunk the gap, not closed it.

Stage 4: Depression.

You read distributed-systems papers at 2 a.m. You learn phrases like "at-least-once delivery." You stare at the ceiling thinking about the CAP theorem.

Stage 5: Acceptance.

Two independent systems with no shared commit cannot be made atomic. Not by ordering, not by retries, not by vibes.

The only way to win is to stop playing.

Why your tests never caught it

Here's what makes this bug genuinely cruel: it's the bug that isn't in the code review. Every line is individually correct. It passes every test — because your tests don't kill the process between line 1 and line 2. It demos flawlessly. It works for months.

It's also the secret root cause behind a whole genre of "impossible" tickets:

"This job is stuck in Provisioning, but there's no message anywhere."
"We charged the customer, but the order never showed up."
"Search is showing a product we deleted last week."

Different teams, different stacks — the same bug. The first time you walk into someone else's incident and say "that's a dual write" before opening a single log file, people look at you like you're psychic. You're not. You just know the shape.

Part 2 — The Fix: Write Once, Derive the Rest

Stop trying to make two writes

Stay with me, because this is the part where it finally clicked for me.

You can't make two writes atomic. So don't. Make one write — atomically — and let the second one be a consequence of the first.

That's the transactional outbox pattern. Here's the whole machine:

one atomic transaction
Handler  ───────────────────────────►  Database
                                       ├── business row  (status = Provisioning)
                                       └── outbox row    (event: ProvisionRequested)

Relay (poller or CDC):
    reads committed-but-unsent outbox rows
    → publishes each to the queue (retrying until the broker ACKs)
    → marks the row "sent"
Consumer:
    processes the message - idempotently

one atomic transaction
Handler  ───────────────────────────►  Database
                                       ├── business row  (status = Provisioning)
                                       └── outbox row    (event: ProvisionRequested)

Relay (poller or CDC):
    reads committed-but-unsent outbox rows
    → publishes each to the queue (retrying until the broker ACKs)
    → marks the row "sent"
Consumer:
    processes the message - idempotently

Three moves:

1. One transaction, two rows. In a single database transaction, write the business row (status = 'Provisioning') and a row into an outbox table describing the event you want published. Because they share one transaction, atomicity comes free: both rows commit or neither does. There is no gap. Your intent to publish is now exactly as durable as the state change itself.

2. A separate relay does the publishing. A little worker reads unsent rows from the outbox and publishes each one to the queue. Only after the broker acknowledges does it mark the row sent. Relay crashes mid-publish? No drama — on restart it re-reads the still-unsent row and tries again. The relay can't lose anything, because the thing it's relaying is sitting safely in your database.

3. Re-publishing means duplicates, so consumers must be idempotent. The relay might publish a row, crash before marking it sent, and publish it again after restarting. Same event, twice. Put the outbox row's id on the message as an idempotency key and have the consumer dedupe. (If you've ever wondered why everyone is so obsessed with idempotency keys — this is why.)

Notice what just happened. You didn't make the impossible possible. You moved the hard problem somewhere it's actually solvable. "Two systems, always consistent" — impossible. "One transaction, a retry-until-acked relay, and an idempotent consumer" — just engineering.

Show me the code

A minimal, copy-pasteable version with a polling relay (Postgres-flavored):

-- The outbox lives in the SAME database as your business tables,
-- so it can share a transaction with them. That's the whole trick.
CREATE TABLE outbox (
  id          BIGSERIAL PRIMARY KEY,   -- doubles as the idempotency key
  aggregate   TEXT NOT NULL,           -- e.g. 'robot:RPA42'
  type        TEXT NOT NULL,           -- e.g. 'ProvisionRequested'
  payload     JSONB NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  sent_at     TIMESTAMPTZ              -- NULL = not yet published
);
CREATE INDEX ON outbox (id) WHERE sent_at IS NULL;  -- cheap "find unsent"
# WRITE PATH — one atomic transaction. No queue call here. At all.
def request_provision(robot_id, count):
    with db.transaction():   # atomic: both rows or neither
        db.execute(
            "UPDATE robot_state SET status='Provisioning' WHERE id=%s",
            robot_id,
        )
        db.execute(
            "INSERT INTO outbox (aggregate, type, payload) VALUES (%s, %s, %s)",
            f"robot:{robot_id}", "ProvisionRequested",
            json.dumps({"robot": robot_id, "count": count}),
        )
    # Crash anywhere above? The transaction rolls back cleanly. No ghost.

# RELAY - a separate worker (cron Lambda, sidecar, whatever). Runs forever.
def relay_once():
    rows = db.query(
        "SELECT * FROM outbox WHERE sent_at IS NULL ORDER BY id LIMIT 100"
    )
    for r in rows:
        sqs.send_message(   # at-least-once delivery
            QueueUrl=Q,
            MessageBody=json.dumps({**r.payload, "idempotency_key": r.id}),
        )
        db.execute("UPDATE outbox SET sent_at=now() WHERE id=%s", r.id)

-- The outbox lives in the SAME database as your business tables,
-- so it can share a transaction with them. That's the whole trick.
CREATE TABLE outbox (
  id          BIGSERIAL PRIMARY KEY,   -- doubles as the idempotency key
  aggregate   TEXT NOT NULL,           -- e.g. 'robot:RPA42'
  type        TEXT NOT NULL,           -- e.g. 'ProvisionRequested'
  payload     JSONB NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  sent_at     TIMESTAMPTZ              -- NULL = not yet published
);
CREATE INDEX ON outbox (id) WHERE sent_at IS NULL;  -- cheap "find unsent"
# WRITE PATH — one atomic transaction. No queue call here. At all.
def request_provision(robot_id, count):
    with db.transaction():   # atomic: both rows or neither
        db.execute(
            "UPDATE robot_state SET status='Provisioning' WHERE id=%s",
            robot_id,
        )
        db.execute(
            "INSERT INTO outbox (aggregate, type, payload) VALUES (%s, %s, %s)",
            f"robot:{robot_id}", "ProvisionRequested",
            json.dumps({"robot": robot_id, "count": count}),
        )
    # Crash anywhere above? The transaction rolls back cleanly. No ghost.

# RELAY - a separate worker (cron Lambda, sidecar, whatever). Runs forever.
def relay_once():
    rows = db.query(
        "SELECT * FROM outbox WHERE sent_at IS NULL ORDER BY id LIMIT 100"
    )
    for r in rows:
        sqs.send_message(   # at-least-once delivery
            QueueUrl=Q,
            MessageBody=json.dumps({**r.payload, "idempotency_key": r.id}),
        )
        db.execute("UPDATE outbox SET sent_at=now() WHERE id=%s", r.id)

The three rules that make it correct

The business write and the outbox insert share one transaction. If they don't, you haven't fixed the dual write — you've just relocated it.
Mark "sent" only after the broker acknowledges. Think of the ack like WhatsApp's checkmark: you don't tell your friend "message delivered" just because you hit send — you wait for the tick. Crash before marking → harmless re-send (the consumer dedupes). Mark before publishing → silent message loss. Always order things so a crash causes a re-send, never a loss.
The consumer is idempotent, keyed on the outbox id. Non-negotiable — the relay is at-least-once by design.

Bonus for AWS folks: on a FIFO SQS queue, set MessageDeduplicationId = str(r.id) and MessageGroupId = aggregate, and you get deduplication plus per-aggregate ordering for free.

Operationally, you'll also want: an alarm on the age of the oldest unsent row (relay down = a backlog you can actually see), a reaper that archives sent rows so the table doesn't grow forever, and a single relay leader (or SELECT … FOR UPDATE SKIP LOCKED) so two relays don't fight over the same rows.

Two ways to build the relay

Polling publisher. A loop or scheduled job runs SELECT * FROM outbox WHERE sent_at IS NULL, publishes, marks sent. Dead simple, works on any database, trivially easy to reason about. Costs a little latency and some read load.

Change data capture (CDC). Instead of polling, tail the database's change log — Postgres WAL, MySQL binlog, DynamoDB Streams — and publish events as rows land. Lower latency, no polling load, more moving parts. On AWS, the native combo is DynamoDB + Streams + Lambda, or RDS → Debezium/DMS → Kinesis or Kafka.

My honest advice: start with polling. Graduate to CDC when latency or load tells you to.

Part 3 — Real Life: Stories from Production

Theory is nice. Here's what this bug actually looks like when it walks into your week.

Scenario 1: The vanishing order

Friday, 6:04 p.m. Sara orders a birthday gift for her dad — express shipping, arrives Tuesday, perfect. Her card is charged. She closes the tab and forgets about it.

Monday evening, she checks for tracking info. Her account shows no orders. Just a charge on her card statement for a thing that, according to the website, she never bought. She opens a support ticket with the energy of someone who has been charged $89 for nothing.

Support checks the admin panel — no order, anywhere. Checks the payment provider — the charge exists, settled, very real money. They escalate to engineering with the worst kind of ticket: "Customer charged, no order in system, please advise."

The engineer pulls the logs from Friday 18:04. And there it is:

18:04:11  payment 8841 saved                          ✅
18:04:11  publish OrderPlaced → connection reset      💥

18:04:11  payment 8841 saved                          ✅
18:04:11  publish OrderPlaced → connection reset      💥

Three milliseconds later, the pod terminated — a routine deploy was rolling the cluster at exactly 18:04. The payment row committed. The event that tells the warehouse "go pack this" died with the pod. The code was our old friend:

db.save(payment)             # committed
queue.publish(order_placed)  # 💥 the deploy killed the pod right here

db.save(payment)             # committed
queue.publish(order_placed)  # 💥 the deploy killed the pod right here

The kicker? Six months earlier, someone had added a retry around that publish. Retries help with network blips — and do absolutely nothing when the process itself dies. (Stage 3 of grief. We've all been there.)

The cleanup: a refund, an apology email, a discount code, twenty minutes of support time, and one engineer's entire Monday. Now multiply that by every deploy, forever.

The fix: one transaction writes the payment row and an outbox row (type: OrderPlaced). The relay publishes it — retrying straight through deploys, crashes, and throttles. Worst case, the event goes out twice, and the fulfillment service dedupes on the order id, so Sara's dad gets exactly one gift. The "charged-but-no-order" ticket genre doesn't get rarer.

It goes extinct.

Scenario 2: The ghost robots — finishing that 2:47 a.m. story

Remember the twenty-three missing robots from the top of this article? Here's the autopsy.

The provisioning Lambda did exactly what you'd expect:

dynamo.update_item(robot_id, status="Provisioning")  # write #1
sqs.send_message(provision_payload)                  # write #2

dynamo.update_item(robot_id, status="Provisioning")  # write #1
sqs.send_message(provision_payload)                  # write #2

About 1 in 10,000 invocations, write #2 failed — an SQS throttle, a transient network hiccup, a Lambda timeout. Write #1 had already committed.

The result: a robot stuck at Provisioning in DynamoDB, with no message in any queue and no worker doing anything. A ghost. 👻 The autoscaler couldn't see it. No alarm fired, because nothing technically errored in a way anyone was watching. At a few hundred thousand provisions a week, "1 in 10,000" wasn't rare — it was a standing appointment. Hence: 2:47 a.m., twenty-three ghosts, and me squinting at a dashboard.

The fix, AWS-native flavor: write only to DynamoDB — the status update (plus an explicit outbox item, if you like) in a single TransactWriteItems call, so everything commits atomically. Then enable DynamoDB Streams on the table and attach a small relay Lambda: for every new Provisioning record, it publishes the provision message to SQS, retrying until SQS acks, with the stream sequence number as the dedup id. The already-idempotent worker absorbs the rare duplicate.

Ghost robots didn't become rare. They became impossible — every committed Provisioning row provably has a message behind it, because the stream guarantees that committed changes get delivered. Add an alarm on the oldest unprocessed stream record, and you can see the relay's health instead of guessing at it.

(Lovely DynamoDB bonus: with Streams, you often don't even need a separate outbox table. The stream is the ordered, durable change log of your writes.)

And the 2:47 a.m. pages? Gone. My phone and I have never been closer.

"Okay, but what about…" — the questions I always get

Every time I explain this pattern, the same hands go up. Let's take them in order.

"Why not a distributed transaction? Two-phase commit exists!" It does — and the systems you actually want to write to never RSVP'd to that party. 2PC needs every participant to speak the same protocol (XA), and SQS doesn't, Kafka doesn't, most managed cloud services don't. Even where it works, 2PC brings a coordinator that's a single point of failure and a blocking protocol that holds locks while it waits — one slow participant and everyone's stuck. 2PC isn't wrong; it's just from an era when both systems lived in the same building.

"Doesn't Kafka have transactions and exactly-once semantics?" Kafka transactions make writes atomic across Kafka — multiple topics, plus consumer offsets. They do not span Kafka and your Postgres. The dual write between your database and Kafka is still right there. Kafka's exactly-once is great; it just solves a different problem.

"Can't I flip it — make the queue the source of truth and update the DB from a consumer?" You can! That's a real pattern (sometimes called listen to yourself): publish the event first, and your own service consumes it to update its database. It's the same trick in a mirror — one authoritative write, one derived write. The tradeoffs: your database is now eventually consistent with your own actions (no read-your-writes), and all validation has to happen before you publish. Outbox keeps the database authoritative; listen-to-yourself makes the log authoritative. Choose based on whose read-your-write story you need.

"What about third-party APIs like Stripe? I can't put Stripe in my transaction." True — but the same idea generalizes beautifully: record the intent first. One transaction writes the order and a pending_charge row. A worker picks up pending charges and calls Stripe with an idempotency key (Stripe supports these natively, for exactly this reason), then marks the row done. Crash anywhere → safe retry, no double charge. The outbox was never really about queues. It's about durable intent + retry until done + idempotency — and that shape works for any external side effect.

"Isn't this over-engineering for my small app?" Maybe! If you do fifty writes a day and a human can reconcile the occasional weirdness, honestly — ship the dual write and leave a TODO. The outbox earns its keep when volume guarantees the gap will be hit, or when an inconsistency costs money, customers, or 3 a.m. pages. The skill isn't "always use the outbox." The skill is knowing the gap exists, so that living with it is a decision instead of a surprise.

Part 4 — Your Playbook

When you actually need this (and when you don't)

Reach for the outbox when:

One action must update a database and emit an event, and divergence between them causes real pain.
You're publishing to SQS, Kafka, or SNS right after a DB commit — the textbook setup.
A lost or phantom event means a real incident: ghost jobs, unshipped orders, missed billing.
You can tolerate the event arriving slightly later, and occasionally twice.

Skip it when:

You only ever write to one system. No dual write, nothing to fix.
The side effect is genuinely best-effort — a lossy analytics ping isn't worth the machinery.
The second system can pull instead of being pushed. If it can just read your database on demand, there is no second write at all.
The second system must be updated synchronously, read-your-write, right now. The outbox is asynchronous by nature.

The deeper point — and the one I'd tattoo on a whiteboard if facilities would let me: the outbox doesn't make two writes atomic. It eliminates the second write by turning it into a derived consequence of the first. If you can design the second write away entirely, that's even better. The outbox is the best answer when the second system genuinely must be pushed.

The pitfalls that bite everyone (me included)

Putting the outbox table in a different database. Now the insert and the business write still can't share a transaction. Congratulations — you've rebuilt the dual write with extra steps.
Marking "sent" before the broker acks. A crash in between, and the event is gone forever. Silently. Publish, wait for the tick, then mark sent.
Forgetting consumer idempotency. Duplicates will arrive. A non-idempotent consumer double-provisions, double-charges, double-emails. Dedupe on the outbox id.
No reaper. The outbox grows unbounded, the "find unsent" query slowly degrades, and the relay quietly falls behind.
Multiple relays brawling over the same rows. Not strictly wrong — consumers dedupe — but wasteful and ordering-hostile. One leader, or FOR UPDATE SKIP LOCKED.
Believing you now have exactly-once delivery. You have at-least-once delivery plus idempotent processing, which behaves like exactly-once end to end. True exactly-once across independent systems doesn't exist. Don't design as if it does.

One more habit: trust, but verify

Even with a flawless outbox, I run one more thing: a boring little reconciler. Every few minutes, it asks one question — "is anything stuck in an in-progress state for longer than it should be?" — and alerts (or re-enqueues) if so.

It's defense in depth. The outbox prevents the failure you know about; the reconciler catches the ones you don't — a bug in the relay itself, a manual data edit, the migration nobody mentioned in standup. It's two lines of SQL and a cron job, and it has saved my weekend more than once.

Belt and suspenders. Distributed systems have a way of pantsing the overconfident.

Try this, this week

Grep your codebase for a database write immediately followed by sqs.send, sns.publish, or an HTTP call. For each hit, finish this sentence: _"If we crash between these two lines, the inconsistent state is __." You're training your eye to see the gap.
Build the toy version. Business table + outbox table + relay loop + a local queue (LocalStack works fine). Kill the relay after it publishes but before it marks the row sent, restart it, and watch the duplicate arrive — then watch your dedup quietly eat it. Seeing at-least-once with your own eyes is the whole lesson.
Add "spot the dual write" to your design-review checklist. Catching this in review costs one comment. Catching it in production costs a 3 a.m. incident and a very awkward postmortem.

The takeaway

Two systems. Two writes. No shared transaction. That's the whole bug — and no ordering, retry, or rollback can paper over it.

So the next time you see db.save() followed by queue.publish(), ask the only question that matters:

"What's the state of the world if we die right here?"

If the answer makes you wince, congratulations — you just spotted a dual write. It took me five years, two production incidents, and one very long night at 2:47 a.m. to learn its name.

It took you one article. 🙂

If this saved your future self a 3 a.m. page, a clap or a follow helps more engineers find it. And I'm genuinely curious — what's the sneakiest dual write hiding in your codebase? The cache one? The email one? Confess in the comments. This is a safe space.