Distributed Transactions: Consistency Across Multiple Systems
Distributed transactions coordinate multiple independent systems to ensure atomicity across all participants. While traditional ACID transactions work within a single database, distributed transactions span services, databases, or message queues.
Distributed Transactions: Consistency Across Multiple Systems
Distributed transactions coordinate multiple independent systems to ensure atomicity across all participants. While traditional ACID transactions work within a single database, distributed transactions span services, databases, message queues, or even different organizations. They ensure that a series of operations across these systems either all succeed or all fail, maintaining data consistency in distributed environments. However, distributed transactions come with significant performance and complexity costs, leading many systems to adopt alternative patterns like sagas or eventual consistency.
To understand distributed transactions properly, it helps to be familiar with distributed systems, ACID transaction properties, and CAP theorem.
What Are Distributed Transactions?
A distributed transaction is a set of operations performed on multiple distinct databases or services that must be atomic, consistent, isolated, and durable (ACID) across all participants. Unlike a local transaction confined to a single database, distributed transactions involve a coordinator that manages the commit protocol across multiple resource managers.
- Atomicity: All participating systems must commit or all must abort. No partial commits.
- Consistency: Transaction moves system from one consistent state to another across all participants.
- Isolation: Concurrent transactions do not interfere with each other.
- Durability: Committed changes persist despite failures.
- Coordinator: Central component managing commit protocol and failure recovery.
- Participants: Resource managers (databases, message queues) that perform work.
Why Distributed Transactions Matter
Modern applications often span multiple databases and services. Without distributed transactions, maintaining consistency becomes extremely difficult.
- Cross-Database Consistency: Financial systems may need to update multiple databases atomically. Inventory and order systems must stay consistent. Payment processing requires funds transfer between accounts atomically.
- Microservices Coordination: An order service creates order, payment service processes payment, inventory service reserves stock, and shipping service creates shipment — all must succeed or rollback together.
- Legacy Integration: New and old systems may need coordinated updates. Non-atomic updates cause data inconsistency.
- Regulatory Compliance: Financial regulations require atomicity for transfers, audits, chain-of-custody.
| Aspect | Local Transaction | Distributed Transaction |
|---|---|---|
| Scope | Single database | Multiple databases/services |
| Coordinator | Database itself | External coordinator |
| Locking | Row/page locks | Global locks (coordination) |
| Performance | Fast (milliseconds) | Slow (seconds+) |
| Failure Handling | Database recovery log | Complex (coordinator logs) |
| Isolation | Database guarantees | Weaker (or expensive) |
| Consistency | ACID | May use 2PC (ACID) or relax |
| Complexity | Low | High |
Two-Phase Commit (2PC)
Two-Phase Commit is the classic protocol for achieving atomic distributed transactions. It ensures all participants agree on commit or abort before any actually commits.
Challenges of Two-Phase Commit
- Blocking Problem: If coordinator fails after sending prepare but before commit, participants wait indefinitely (need manual intervention). Participants cannot decide to commit or abort alone (lack global knowledge).
- Coordinator Single Point of Failure: If coordinator crashes, transaction may be locked. Participants hold locks waiting for coordinator decision (resource contention). Recovery requires coordinator logs.
- Performance Overhead: Multiple network round trips across participants. Disk writes for prepare records (extra latency). Locks held across all participants for duration (contention).
- Scalability Limitations: Coordinator becomes bottleneck for many participants. Latency increases with more participants (multiple prepare timeouts). Not suitable for large-scale microservices.
- Network Partitions: Under network partition, 2PC may block (no decision). Neither commit nor abort possible. CAP theorem forces choice between availability and consistency.
Coordinator fails after sending Prepare (and receiving YES)
State:
• Participant 1: Prepared (waiting for commit/abort), holding locks
• Participant 2: Prepared, holding locks
• Participant 3: Prepared, holding locks
Problem:
• No participant can decide to commit (others may have voted NO)
• No participant can abort (others may have already committed)
• System waits indefinitely for coordinator to recover
Solution: Three-Phase Commit (3PC) or Paxos-based commit
Alternatives to Distributed Transactions
Due to their limitations, many systems avoid distributed transactions and use alternative patterns for consistency.
| Pattern | Consistency | Complexity | Use Case |
|---|---|---|---|
| Saga Pattern | Eventual | Moderate | Long-running business transactions | Eventual Consistency + Compensation | Weak (eventual) | Low | Non-critical, tolerant of temporary inconsistency |
| Idempotent Operations | Best effort | Low | Retryable operations (no side effects) |
| Distributed Locks / Lease | Moderate | High | Coordination of critical section |
| Outbox Pattern | Eventual | Moderate | Message delivery with database transaction |
Alternative Consistency Performance Complexity
─────────────────────────────────────────────────────────────────────────────
Two-Phase Commit ACID (strong) Slow High
Saga Eventual Fast Moderate
TCC (Try-Confirm-Cancel) Strong (with compensation) Fast Moderate
Outbox + Polling Eventual Fast Low
Idempotent Retries Best effort Fast Low
Saga Pattern
The Saga pattern breaks a distributed transaction into a sequence of local transactions, each with a compensating action. If a step fails, compensations undo previous steps.
Transaction Steps:
T1: Create order (PENDING)
T2: Process payment
T3: Reserve inventory
T4: Ship order
Compensations:
C1: Cancel order (delete if still PENDING)
C2: Refund payment
C3: Release inventory
C4: N/A (shipping irreversible, must succeed)
Normal flow:
T1 → T2 → T3 → T4 → Done
Failure handling:
If T3 fails:
Execute C2 → Execute C1 → Done (order failed)
If T4 fails (shipping fails but payment taken):
Execute C3 → Execute C2 → Execute C1 → Customer refunded
Choreography-based Saga:
• No central coordinator
• Services publish events
• Each service listens for events and triggers next step/compensation
• Event-driven (Kafka, RabbitMQ)
Orchestration-based Saga:
• Central orchestrator service
• Orchestrator invokes steps sequentially
• Handles compensation logic centrally
• Easier to manage, but single point of coordination
Compare:
Aspect Choreography Orchestration
─────────────────────────────────────────────────────────
Decentralization High Low
Complexity (tech) High (event tracing) Moderate
Coordination Service-specific Centralized
Visibility Event logs Orchestrator logs
Try-Confirm-Cancel (TCC) Pattern
TCC is a distributed transaction pattern that reserves resources before confirming, ensuring consistency without long-held locks.
- Try Phase: Reserve resources (but do not commit). Example: reserve stock in inventory, check and hold funds in payment account.
- Confirm Phase: Commit reserved resources if all Try phases succeed. No failure expected at this point.
- Cancel Phase: Release reserved resources if any Try fails. Compensate (release holds).
Try Phase:
Payment Service: Hold $100 from customer account
Inventory Service: Reserve product stock
Shipping Service: Reserve shipping slot
Confirm Phase (if all Try succeeded):
Payment Service: Charge $100 (complete transaction)
Inventory Service: Reduce stock (commit)
Shipping Service: Book shipment (confirm)
Cancel Phase (if any Try fails):
Payment Service: Release hold (no charge)
Inventory Service: Release stock reservation
Shipping Service: Release slot
Properties:
• No long-held locks (reservation only)
• Confirm phase must succeed (no failures expected)
• Cancel phase must be idempotent
Distributed Transactions Anti-Patterns
- Using 2PC Across Many Participants: Coordinator becomes bottleneck, high latency, blocking risk. Use saga for many participants (10+). Use 2PC only for 2-3 participants with low latency requirements.
- No Idempotent Operations: Retries may cause duplicate operations (double payment). Each operation must be idempotent: can be applied multiple times safely. Use unique request IDs for deduplication.
- Long-Held Database Locks (2PC): Locks held across all participants during prepare phase (seconds to minutes). Contention kills throughput. Use saga or TCC (reservation instead of lock) for long transactions.
- No Compensating Transactions: Without compensation, partial failure leaves system inconsistent. Every step must have compensating action (logical rollback). Compensations must be idempotent.
- Coordinator Without Recovery: Coordinator crashes without logging state. Use transactional logs (durable storage). Implement recovery procedure to resolve in-doubt transactions.
Design Decisions:
□ Can you avoid distributed transaction? (redesign)
□ If needed, choose appropriate pattern:
- 2PC for low latency, few participants
- Saga for long-running, many participants
- TCC for resource reservation
□ All operations idempotent
□ Compensating transactions defined
□ Coordinator logs are durable
Implementation:
□ Unique transaction ID (correlation ID)
□ Timeouts for each phase
□ Retry with backoff
□ Monitoring for hung transactions
Distributed Transaction Best Practices
- Avoid Distributed Transactions When Possible: Redesign to avoid cross-service transactions (shard data by service boundary). Use eventual consistency if acceptable, or combine services that need ACID into single service.
- Use Idempotent Operations: Each operation can be safely retried. Use unique request ID for deduplication; store processed IDs in database and check before execution.
- Prefer Saga over 2PC for Large Scale: Saga provides eventual consistency without locking and coordinator bottleneck. Easier to scale, supports long-running transactions, but requires compensating transactions.
- Implement Transaction Timeouts: No transaction lasts indefinitely. Participant timeouts (abort if no commit/abort received). Coordinator timeouts (assume abort on participant timeout).
- Monitor Transaction Health: Track metrics: transaction duration, success/failure rate, compensation execution count. Alert on hung transactions (in-doubt for too long). Log all phases for audit.
- Use Correlated Identifiers for Tracing: Propagate transaction ID across services. Log transaction ID with each operation. Use distributed tracing for debugging.
Use Case Recommended Pattern
─────────────────────────────────────────────────────────────────────────────
Financial transfer (2-3 participants) 2PC (strong consistency)
E-commerce order (many services) Saga
Flight + hotel booking (long-running) Saga
Inventory reservation TCC
Update database + send event Outbox
Cross-database update (no microservices) 2PC (database native)
Low latency, high throughput Avoid distributed transaction
Popular Distributed Transaction Implementations
| Technology | Pattern | Language | Description |
|---|---|---|---|
| XA (JTA, MSSQL DTC) | 2PC | Java, .NET, databases | Standard 2PC across databases |
| Seata | AT, TCC, Saga | Java | Distributed transaction solution (Alibaba) |
| Temporal | Saga (workflow) | Java, Go, TypeScript | Workflow engine with saga support |
| Camunda | Saga (BPMN) | Java | Workflow engine (BPMN for sagas) |
Frequently Asked Questions
- What is the difference between 2PC and Saga?
2PC provides strong consistency (ACID) but blocks and has single point of failure. Saga provides eventual consistency, non-blocking, and better scalability, but requires compensating transactions and applications must handle temporary inconsistency. - Are distributed transactions practical in microservices?
Generally no (2PC not recommended). Microservices favor eventual consistency using saga pattern. 2PC causes coupling, performance bottlenecks, and blocking issues. Most microservices architectures avoid distributed transactions and use event-driven sagas. - What is the difference between TCC and Saga?
TCC reserves resources first (Try phase) then confirms, ensuring strong consistency with shorter locking. Saga commits immediately and compensates on failure (logical rollback). TCC requires more complex implementation (resource reservation logic). TCC may hold reservations longer than sagas hold locks. - Does the CAP theorem affect distributed transactions?
Yes. 2PC chooses consistency over availability (CP system). During network partition, 2PC cannot make progress (blocks). This violates availability requirement. Saga chooses availability (AP system) with eventual consistency. - How do idempotent operations help distributed transactions?
Enables safe retries without duplication. Prevents double-payment, double-inventory deduction. Use unique request ID stored in database, check before processing, and ignore duplicate requests. - What should I learn next after distributed transactions?
After mastering distributed transactions, explore saga pattern in depth, outbox pattern for reliable messaging, eventual consistency patterns, CQRS for separation of concerns, and event sourcing for auditability.
