
Distributed Transactions in Microservices: Why Consistency Gets Complicated Fast
Distributed Transactions in Microservices: Why Consistency Gets Complicated Fast
Short description:
Distributed transactions sound straightforward in theory: multiple services should either succeed together or fail together. In practice, things become messy very quickly. Network failures, retries, partial commits, and service crashes make consistency one of the hardest problems in distributed systems. This post takes a deep dive into distributed transactions, why traditional approaches struggle in microservices, and how Two-Phase Commit and Saga patterns behave in real production systems.
The Monolith Advantage Nobody Appreciates Enough
In a monolith, transactions feel easy.
You update multiple tables, wrap everything inside a database transaction, and either commit or rollback.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
The database guarantees atomicity.
If something fails halfway through, everything rolls back automatically.
Most engineers grow up with this mental model.
Then microservices arrive.
Why Transactions Become Hard in Microservices
Microservices intentionally split data ownership across services.
Each service has:
Its own database
Its own deployment lifecycle
Its own failure modes
This improves scalability and team independence.
But it destroys the simplicity of local database transactions.
Now imagine a checkout flow:
Order Service creates order
Payment Service charges customer
Inventory Service reserves stock
Notification Service sends confirmation
What happens if payment succeeds but inventory fails?
You no longer have a single database transaction protecting consistency.
You have a distributed systems problem.
The Core Problem: Partial Success
Distributed transactions are difficult because partial success is normal.
In distributed systems:
Networks fail
Services restart
Requests timeout
Messages arrive late
The dangerous state is not total failure.
The dangerous state is when half the system thinks the operation succeeded and the other half thinks it failed.
This is where consistency breaks.
The Two Main Approaches
Modern distributed systems usually solve consistency using one of two patterns:
Two-Phase Commit (2PC)
Saga Pattern
Both attempt to coordinate changes across services.
Both involve trade-offs.
Neither is perfect.
Two-Phase Commit (2PC): The Traditional Approach
Two-Phase Commit tries to preserve strong consistency across distributed systems.
It works using a coordinator.
The flow looks like this:
Step 1: Prepare Phase
Coordinator asks all services:
"Can you commit?"
Step 2: Commit Phase
If all say YES:
"Commit transaction"
Else:
"Rollback transaction"
This sounds elegant.
And under ideal conditions, it works.
How 2PC Works Internally
During the prepare phase, each participant:
Executes the transaction locally
Locks required resources
Waits for coordinator decision
Nothing is fully committed yet.
Then the coordinator decides:
If all participants are ready → commit
If even one fails → rollback
This guarantees atomicity across services.
Why 2PC Looks Great on Whiteboards
2PC provides:
Strong consistency
Clear transactional guarantees
Predictable rollback behavior
From a business perspective, this is attractive.
Especially in financial systems, consistency matters deeply.
The Real Problems With Two-Phase Commit
The problems appear under failure.
1. Blocking Behavior
During the prepare phase, participants lock resources.
If the coordinator crashes before sending commit or rollback, participants remain blocked waiting for instructions.
This creates:
Stuck transactions
Resource contention
Reduced throughput
In high-scale systems, this becomes dangerous quickly.
2. Coordinator Becomes a Single Point of Failure
The coordinator controls transaction state.
If it becomes unavailable, the entire transaction pipeline suffers.
Even with replication, complexity increases significantly.
3. Poor Scalability
2PC performs poorly in highly distributed environments.
Why?
Multiple synchronous network round trips
Long-lived locks
Cross-service coordination overhead
Latency compounds rapidly.
4. Availability Suffers
2PC prioritizes consistency over availability.
Under network partitions, systems often pause instead of risking inconsistent state.
This aligns with CP systems in the CAP theorem.
The Industry Shift Toward Sagas
Because of these limitations, many microservice architectures moved toward eventual consistency.
This is where the Saga pattern became popular.
Saga Pattern: Distributed Transactions Through Compensation
Instead of one large atomic transaction, a Saga breaks the workflow into smaller local transactions.
Each service commits independently.
If something fails later, compensating actions undo previous steps.
Order Created
|
v
Payment Processed
|
v
Inventory Reserved
|
v
Notification Sent
If inventory reservation fails:
Compensation:
Refund Payment
Cancel Order
This fundamentally changes how consistency is handled.
The Core Philosophy Behind Sagas
Sagas accept that distributed systems fail.
Instead of preventing partial success, they embrace it and recover afterward.
This trades immediate consistency for resilience and scalability.
Two Saga Models
Sagas are generally implemented in two ways.
1. Choreography-Based Saga
Services communicate through events.
Order Service
|
v
OrderCreated Event
|
v
Payment Service
|
v
PaymentProcessed Event
No central coordinator exists.
Each service reacts independently.
Advantages
Loosely coupled
Highly scalable
No central bottleneck
Disadvantages
Harder debugging
Complex event chains
Difficult observability
2. Orchestration-Based Saga
A central orchestrator controls workflow execution.
Saga Orchestrator
|
+--> Payment Service
+--> Inventory Service
+--> Notification Service
The orchestrator tracks state and triggers compensations.
Advantages
Easier observability
Centralized control flow
Simpler debugging
Disadvantages
More coupling
Coordinator complexity
Potential bottleneck
The Hidden Complexity of Compensation
Compensation sounds simple in diagrams.
Reality is harder.
Some operations are difficult or impossible to reverse:
Emails already sent
External bank transfers
SMS notifications
Compensation often means “business correction,” not true rollback.
This distinction matters enormously.
Idempotency Becomes Mandatory
Saga systems rely heavily on retries.
Messages may be delivered multiple times.
Services must handle duplicate requests safely.
Without idempotency:
Payments may double-charge
Inventory may over-reserve
Notifications may duplicate
Idempotency is not optional in Saga-based systems.
Observability Is Much Harder Than Traditional Transactions
In monoliths, a transaction is visible inside one database.
In distributed systems, a transaction spans:
Multiple services
Multiple queues
Multiple databases
Tracing becomes significantly harder.
Mature systems use:
Distributed tracing
Correlation IDs
Centralized event logging
Without observability, debugging distributed transactions becomes nearly impossible.
When Two-Phase Commit Makes Sense
Despite its problems, 2PC is not obsolete.
It still makes sense when:
Strong consistency is mandatory
Transaction volume is relatively low
Participants are tightly controlled
Financial settlement systems are common examples.
When Sagas Make More Sense
Sagas work well when:
High scalability matters
Temporary inconsistency is acceptable
Services are loosely coupled
This is why Sagas dominate modern microservice architectures.
The Most Important Mindset Shift
The biggest lesson in distributed transactions is this:
Consistency is not free.
Every consistency guarantee introduces trade-offs:
Latency
Availability
Operational complexity
The real engineering challenge is choosing which trade-offs your business can tolerate.
Final Thoughts
Distributed transactions are one of the clearest examples of why distributed systems are fundamentally different from traditional application development.
Two-Phase Commit gives stronger guarantees but struggles with scalability and availability.
Sagas improve resilience and scalability but introduce eventual consistency and compensation complexity.
Neither pattern is universally better.
The right choice depends entirely on system requirements, business guarantees, and operational realities.
And once systems scale, understanding those trade-offs becomes more important than the implementation itself.