Distributed Transactions: Consistency Across Multiple Systems

Distributed transactions coordinate multiple independent systems to ensure atomicity across all participants. While traditional ACID transactions work within a single database, distributed transactions span services, databases, or message queues.

Distributed Transactions: Consistency Across Multiple Systems

Distributed transactions coordinate multiple independent systems to ensure atomicity across all participants. While traditional ACID transactions work within a single database, distributed transactions span services, databases, message queues, or even different organizations. They ensure that a series of operations across these systems either all succeed or all fail, maintaining data consistency in distributed environments. However, distributed transactions come with significant performance and complexity costs, leading many systems to adopt alternative patterns like sagas or eventual consistency.

To understand distributed transactions properly, it helps to be familiar with distributed systems, ACID transaction properties, and CAP theorem.

Distributed Transactions

What Are Distributed Transactions?

A distributed transaction is a set of operations performed on multiple distinct databases or services that must be atomic, consistent, isolated, and durable (ACID) across all participants. Unlike a local transaction confined to a single database, distributed transactions involve a coordinator that manages the commit protocol across multiple resource managers.

  • Atomicity: All participating systems must commit or all must abort. No partial commits.
  • Consistency: Transaction moves system from one consistent state to another across all participants.
  • Isolation: Concurrent transactions do not interfere with each other.
  • Durability: Committed changes persist despite failures.
  • Coordinator: Central component managing commit protocol and failure recovery.
  • Participants: Resource managers (databases, message queues) that perform work.

Why Distributed Transactions Matter

Modern applications often span multiple databases and services. Without distributed transactions, maintaining consistency becomes extremely difficult.

  • Cross-Database Consistency: Financial systems may need to update multiple databases atomically. Inventory and order systems must stay consistent. Payment processing requires funds transfer between accounts atomically.
  • Microservices Coordination: An order service creates order, payment service processes payment, inventory service reserves stock, and shipping service creates shipment — all must succeed or rollback together.
  • Legacy Integration: New and old systems may need coordinated updates. Non-atomic updates cause data inconsistency.
  • Regulatory Compliance: Financial regulations require atomicity for transfers, audits, chain-of-custody.
Aspect Local Transaction Distributed Transaction
Scope Single database Multiple databases/services
Coordinator Database itself External coordinator
Locking Row/page locks Global locks (coordination)
Performance Fast (milliseconds) Slow (seconds+)
Failure Handling Database recovery log Complex (coordinator logs)
Isolation Database guarantees Weaker (or expensive)
Consistency ACID May use 2PC (ACID) or relax
Complexity Low High

Two-Phase Commit (2PC)

Two-Phase Commit is the classic protocol for achieving atomic distributed transactions. It ensures all participants agree on commit or abort before any actually commits.

Two-Phase Commit protocol 2PC example order creation

Challenges of Two-Phase Commit

  • Blocking Problem: If coordinator fails after sending prepare but before commit, participants wait indefinitely (need manual intervention). Participants cannot decide to commit or abort alone (lack global knowledge).
  • Coordinator Single Point of Failure: If coordinator crashes, transaction may be locked. Participants hold locks waiting for coordinator decision (resource contention). Recovery requires coordinator logs.
  • Performance Overhead: Multiple network round trips across participants. Disk writes for prepare records (extra latency). Locks held across all participants for duration (contention).
  • Scalability Limitations: Coordinator becomes bottleneck for many participants. Latency increases with more participants (multiple prepare timeouts). Not suitable for large-scale microservices.
  • Network Partitions: Under network partition, 2PC may block (no decision). Neither commit nor abort possible. CAP theorem forces choice between availability and consistency.
2PC blocking scenario:
Coordinator fails after sending Prepare (and receiving YES)

State:
  • Participant 1: Prepared (waiting for commit/abort), holding locks
  • Participant 2: Prepared, holding locks
  • Participant 3: Prepared, holding locks

Problem:
  • No participant can decide to commit (others may have voted NO)
  • No participant can abort (others may have already committed)
  • System waits indefinitely for coordinator to recover

Solution: Three-Phase Commit (3PC) or Paxos-based commit

Alternatives to Distributed Transactions

Due to their limitations, many systems avoid distributed transactions and use alternative patterns for consistency.

Pattern Consistency Complexity Use Case
Saga Pattern Eventual Moderate Long-running business transactions
Eventual Consistency + Compensation Weak (eventual) Low Non-critical, tolerant of temporary inconsistency
Idempotent Operations Best effort Low Retryable operations (no side effects)
Distributed Locks / Lease Moderate High Coordination of critical section
Outbox Pattern Eventual Moderate Message delivery with database transaction
Alternatives to 2PC:
Alternative               Consistency       Performance   Complexity
─────────────────────────────────────────────────────────────────────────────
Two-Phase Commit          ACID (strong)     Slow          High
Saga                      Eventual          Fast          Moderate
TCC (Try-Confirm-Cancel)  Strong (with compensation) Fast Moderate
Outbox + Polling          Eventual          Fast          Low
Idempotent Retries        Best effort       Fast          Low

Saga Pattern

The Saga pattern breaks a distributed transaction into a sequence of local transactions, each with a compensating action. If a step fails, compensations undo previous steps.

Saga example (order creation):
Transaction Steps:
  T1: Create order (PENDING)
  T2: Process payment
  T3: Reserve inventory
  T4: Ship order

Compensations:
  C1: Cancel order (delete if still PENDING)
  C2: Refund payment
  C3: Release inventory
  C4: N/A (shipping irreversible, must succeed)

Normal flow:
  T1 → T2 → T3 → T4 → Done

Failure handling:
  If T3 fails:
    Execute C2 → Execute C1 → Done (order failed)
  If T4 fails (shipping fails but payment taken):
    Execute C3 → Execute C2 → Execute C1 → Customer refunded
Saga coordination models:
Choreography-based Saga:
  • No central coordinator
  • Services publish events
  • Each service listens for events and triggers next step/compensation
  • Event-driven (Kafka, RabbitMQ)

Orchestration-based Saga:
  • Central orchestrator service
  • Orchestrator invokes steps sequentially
  • Handles compensation logic centrally
  • Easier to manage, but single point of coordination

Compare:
  Aspect              Choreography          Orchestration
  ─────────────────────────────────────────────────────────
  Decentralization    High                  Low
  Complexity (tech)   High (event tracing)  Moderate
  Coordination        Service-specific      Centralized
  Visibility          Event logs            Orchestrator logs

Try-Confirm-Cancel (TCC) Pattern

TCC is a distributed transaction pattern that reserves resources before confirming, ensuring consistency without long-held locks.

  • Try Phase: Reserve resources (but do not commit). Example: reserve stock in inventory, check and hold funds in payment account.
  • Confirm Phase: Commit reserved resources if all Try phases succeed. No failure expected at this point.
  • Cancel Phase: Release reserved resources if any Try fails. Compensate (release holds).
TCC example (product purchase):
Try Phase:
  Payment Service: Hold $100 from customer account
  Inventory Service: Reserve product stock
  Shipping Service: Reserve shipping slot

Confirm Phase (if all Try succeeded):
  Payment Service: Charge $100 (complete transaction)
  Inventory Service: Reduce stock (commit)
  Shipping Service: Book shipment (confirm)

Cancel Phase (if any Try fails):
  Payment Service: Release hold (no charge)
  Inventory Service: Release stock reservation
  Shipping Service: Release slot

Properties:
  • No long-held locks (reservation only)
  • Confirm phase must succeed (no failures expected)
  • Cancel phase must be idempotent

Distributed Transactions Anti-Patterns

  • Using 2PC Across Many Participants: Coordinator becomes bottleneck, high latency, blocking risk. Use saga for many participants (10+). Use 2PC only for 2-3 participants with low latency requirements.
  • No Idempotent Operations: Retries may cause duplicate operations (double payment). Each operation must be idempotent: can be applied multiple times safely. Use unique request IDs for deduplication.
  • Long-Held Database Locks (2PC): Locks held across all participants during prepare phase (seconds to minutes). Contention kills throughput. Use saga or TCC (reservation instead of lock) for long transactions.
  • No Compensating Transactions: Without compensation, partial failure leaves system inconsistent. Every step must have compensating action (logical rollback). Compensations must be idempotent.
  • Coordinator Without Recovery: Coordinator crashes without logging state. Use transactional logs (durable storage). Implement recovery procedure to resolve in-doubt transactions.
Distributed transaction checklist:
Design Decisions:
□ Can you avoid distributed transaction? (redesign)
□ If needed, choose appropriate pattern:
   - 2PC for low latency, few participants
   - Saga for long-running, many participants
   - TCC for resource reservation
□ All operations idempotent
□ Compensating transactions defined
□ Coordinator logs are durable

Implementation:
□ Unique transaction ID (correlation ID)
□ Timeouts for each phase
□ Retry with backoff
□ Monitoring for hung transactions

Distributed Transaction Best Practices

  • Avoid Distributed Transactions When Possible: Redesign to avoid cross-service transactions (shard data by service boundary). Use eventual consistency if acceptable, or combine services that need ACID into single service.
  • Use Idempotent Operations: Each operation can be safely retried. Use unique request ID for deduplication; store processed IDs in database and check before execution.
  • Prefer Saga over 2PC for Large Scale: Saga provides eventual consistency without locking and coordinator bottleneck. Easier to scale, supports long-running transactions, but requires compensating transactions.
  • Implement Transaction Timeouts: No transaction lasts indefinitely. Participant timeouts (abort if no commit/abort received). Coordinator timeouts (assume abort on participant timeout).
  • Monitor Transaction Health: Track metrics: transaction duration, success/failure rate, compensation execution count. Alert on hung transactions (in-doubt for too long). Log all phases for audit.
  • Use Correlated Identifiers for Tracing: Propagate transaction ID across services. Log transaction ID with each operation. Use distributed tracing for debugging.
Pattern selection guide:
Use Case                               Recommended Pattern
─────────────────────────────────────────────────────────────────────────────
Financial transfer (2-3 participants)  2PC (strong consistency)
E-commerce order (many services)       Saga
Flight + hotel booking (long-running)  Saga
Inventory reservation                  TCC
Update database + send event           Outbox
Cross-database update (no microservices) 2PC (database native)
Low latency, high throughput           Avoid distributed transaction

Popular Distributed Transaction Implementations

Technology Pattern Language Description
XA (JTA, MSSQL DTC) 2PC Java, .NET, databases Standard 2PC across databases
Seata AT, TCC, Saga Java Distributed transaction solution (Alibaba)
Temporal Saga (workflow) Java, Go, TypeScript Workflow engine with saga support
Camunda Saga (BPMN) Java Workflow engine (BPMN for sagas)

Frequently Asked Questions

  1. What is the difference between 2PC and Saga?
    2PC provides strong consistency (ACID) but blocks and has single point of failure. Saga provides eventual consistency, non-blocking, and better scalability, but requires compensating transactions and applications must handle temporary inconsistency.
  2. Are distributed transactions practical in microservices?
    Generally no (2PC not recommended). Microservices favor eventual consistency using saga pattern. 2PC causes coupling, performance bottlenecks, and blocking issues. Most microservices architectures avoid distributed transactions and use event-driven sagas.
  3. What is the difference between TCC and Saga?
    TCC reserves resources first (Try phase) then confirms, ensuring strong consistency with shorter locking. Saga commits immediately and compensates on failure (logical rollback). TCC requires more complex implementation (resource reservation logic). TCC may hold reservations longer than sagas hold locks.
  4. Does the CAP theorem affect distributed transactions?
    Yes. 2PC chooses consistency over availability (CP system). During network partition, 2PC cannot make progress (blocks). This violates availability requirement. Saga chooses availability (AP system) with eventual consistency.
  5. How do idempotent operations help distributed transactions?
    Enables safe retries without duplication. Prevents double-payment, double-inventory deduction. Use unique request ID stored in database, check before processing, and ignore duplicate requests.
  6. What should I learn next after distributed transactions?
    After mastering distributed transactions, explore saga pattern in depth, outbox pattern for reliable messaging, eventual consistency patterns, CQRS for separation of concerns, and event sourcing for auditability.