September 2, 2025
What is a Distributed Transaction?
Last year, we hit an incident where a customer was charged for a cloud subscription upgrade, but their account never reflected the new…
The Abstract Engineer
4 min read
Last year, we hit an incident where a customer was charged for a cloud subscription upgrade, but their account never reflected the new plan. The payment went through our billing provider, the audit logs were written, but the service that updates entitlements failed mid-request. From the customer's perspective, money was gone and nothing changed. From our perspective, the system was inconsistent in three different places.
This is the kind of failure distributed transactions are meant to prevent. In a world where a single workflow touches payment gateways, internal services, and external partners, we need a way to guarantee that either every system reflects the change or none of them do. That's the role of a distributed transaction: to make multi-system operations behave like one atomic unit of work.
The Cost of Getting Distributed Transactions Wrong
Modern applications don't operate within a single database anymore. A single user action, like placing an order, often spans multiple systems: decrementing inventory in one service, charging a payment through a third-party provider, and updating customer records elsewhere. If these steps don't stay in sync, the system drifts into inconsistent states: a customer charged without inventory reserved, or inventory reduced without payment collected. Both are costly failures.
Distributed transactions address this by extending the guarantees we expect from local transactions, atomicity and consistency, across service and network boundaries. They ensure that a business operation either commits everywhere or rolls back everywhere, preserving system integrity even when individual components fail.
Real-World Scenarios
Distributed transactions aren't theoretical, they show up in the systems we rely on every day.
- E-commerce orders: Confirming a purchase requires coordination between payment providers, inventory services, and shipping systems. If one step fails, the transaction must be rolled back or retried in a controlled way.
- Ride-sharing platforms: A single trip involves charging the rider, crediting the driver, updating trip history, and writing regulatory audit logs. Any mismatch here creates disputes and compliance risks.
- Cloud service billing: When a user upgrades their subscription, multiple systems must stay in sync — entitlements, invoices, usage metering, and third-party tax providers. A partial commit can leave customers overbilled or under-provisioned.
In each case, the core challenge is the same: the business treats the action as one thing, but under the hood it spans multiple systems with independent failure modes. Distributed transactions exist to bridge that gap, ensuring the system's external behavior matches the user's expectation of a single, consistent operation.
How Distributed Transactions Work
At their core, distributed transactions rely on coordination protocols that let independent systems act as if they were part of a single database. Two approaches are most common in practice:
Two-Phase Commit (2PC):
- A coordinator asks each participant if they're ready to commit. If all confirm, the coordinator issues a commit command; if any refuse, it instructs everyone to roll back. This provides strong atomicity but at the cost of performance, participants may sit in a "locked" state while waiting for consensus.
Think of a bank transfer between two accounts held in different databases. The coordinator (the banking system) first asks both databases if they can reserve the required updates — debit from account A and credit to account B. If both agree, the system instructs them to commit. If one refuses (e.g., insufficient funds, or a node crash), both sides roll back.
This guarantees correctness, but comes at a cost: while waiting, both accounts may be "locked," blocking other transactions. At scale, that can create contention and latency spikes — exactly why many high-throughput systems avoid pure 2PC.
Consensus Protocols (e.g., Raft, Paxos):
- Instead of a single coordinator, the system relies on agreement among multiple nodes. This makes the system more fault-tolerant (no single point of failure), but it adds complexity and network overhead.
Consensus shows up in systems like etcd, Zookeeper, or Consul, which underpin Kubernetes clusters and service discovery. For example, when Kubernetes schedules a pod, the decision must be written to the cluster's state store. Multiple nodes run Raft to agree on the new state, ensuring that even if one node crashes, the cluster maintains a consistent view.
This avoids a single coordinator bottleneck, but adds more message-passing overhead. In practice, consensus protocols are better suited for small, critical datasets (like cluster metadata) rather than large-scale transactional workflows.
Both methods enforce the same principle: either all participants move forward together, or none do. The implementation details differ, but the goal is always to align system state with business intent, even when failures occur mid-flight.
Trade-offs and Limitations
Distributed transactions provide strong correctness guarantees, but they don't come for free. The moment coordination crosses network boundaries, the system inherits new costs:
- Performance impact: Protocols like 2PC introduce latency because participants must wait on each other. Locks held during coordination can block unrelated work, reducing throughput.
- Operational complexity: Coordinators, consensus groups, and retry logic add moving parts that must be monitored and tuned. A poorly implemented coordinator can become a single point of failure.
- Scalability limits: At small to medium scale, strict atomicity is manageable. But at the scale of global services, coordinating every cross-system operation quickly becomes impractical. Many high-volume platforms choose looser guarantees in exchange for better performance.
- Failure handling: Rolling back a partially completed action isn't always straightforward. Real-world systems often need compensating transactions (e.g., issuing a refund after a failed charge) because true rollback isn't possible once external side effects occur.
The key point: distributed transactions are a powerful tool, but they should be used deliberately. Over-applying them in the name of safety can lead to brittle systems that don't scale.
When to Use Distributed Transactions (and When Not To)
Distributed transactions are worth the cost when correctness is non-negotiable. Banking transfers, compliance logging, healthcare records, telecom billing — these are domains where the system simply cannot tolerate inconsistency. In these cases, the coordination overhead is justified because the alternative is customer harm, regulatory violations, or financial loss.
But many systems don't require such strict guarantees. A shopping cart count being off by one item for a few seconds, or a recommendation service lagging behind recent activity, isn't catastrophic. In fact, forcing strict distributed transactions in these cases usually slows the system down, adds operational risk, and limits scalability.
The industry trend reflects this: mission-critical workflows often rely on distributed transactions or consensus, while everything else moves toward eventual consistency, idempotent retries, and compensating transactions. The art is knowing which category your workflow falls into.
Put simply:
- Use distributed transactions when the cost of inconsistency is higher than the cost of coordination.
- Avoid them when the business can tolerate temporary divergence, and faster, more resilient patterns will do.