Database Replication: Master-Slave and Multi-Master

Database replication copies data from one database server to another. Master-slave replication uses a single primary for writes and multiple replicas for reads. Multi-master replication allows writes on multiple nodes.

Database Replication: Master-Slave and Multi-Master

Database replication is the process of copying data from one database server (the primary) to one or more secondary servers (replicas). Replication serves multiple critical purposes: it improves read performance by distributing queries across multiple servers, provides high availability by allowing failover if the primary fails, and enables disaster recovery by maintaining copies of data in different locations.

Replication is essential for applications that need to scale beyond the capacity of a single database server. As read traffic grows, you can add replicas to handle the load. When the primary server fails, you can promote a replica to become the new primary. To understand replication properly, it is helpful to be familiar with database basics, database transactions, and database sharding.

Database replication overview:

Master-Slave Replication:
     ┌─────────────┐
     │   Master    │ (writes: INSERT, UPDATE, DELETE)
     │  (Primary)  │
     └──────┬──────┘
            │
     ┌──────┴──────────────────────┐
     │                             │
     ▼                             ▼
┌───────────┐                 ┌───────────┐
│  Slave    │                 │  Slave    │ (reads: SELECT)
│ (Replica) │                 │ (Replica) │
└───────────┘                 └───────────┘

Multi-Master Replication:
┌───────────┐     ┌───────────┐
│  Master   │◄───►│  Master   │ (writes allowed on both)
│  Node 1   │     │  Node 2   │
└───────────┘     └───────────┘

What Is Database Replication

Database replication is the process of maintaining multiple copies of the same data across different database servers. Changes made on one server are automatically propagated to other servers. Replication can be synchronous (waiting for replicas to confirm before committing) or asynchronous (committing locally and replicating later).

Primary (Master): The main server that accepts write operations.
Replica (Slave): A copy of the primary that serves read operations.
Replication Lag: The delay between a write on the primary and its appearance on a replica.
Failover: Promoting a replica to primary when the original primary fails.
Replication Topology: How replicas are connected (single-tier, cascading, circular).

Why Database Replication Matters

Replication is a fundamental building block of scalable, highly available database architectures. It provides benefits that single-server databases cannot offer.

Read Scalability: Distribute read queries across multiple replicas. Add more replicas as read traffic grows.
High Availability: If the primary fails, promote a replica to primary. Application downtime is minimized.
Disaster Recovery: Maintain replicas in different geographic regions. Survive region-wide outages.
Backup Offloading: Take backups from a replica instead of the primary, avoiding performance impact.
Analytics Offloading: Run reporting and analytics queries on replicas, keeping the primary responsive for transactions.
Geographic Distribution: Place replicas closer to users in different regions to reduce read latency.

Types of Replication

1. Master-Slave (Primary-Replica) Replication

Master-slave is the most common replication pattern. One server (the master) handles all write operations. One or more slaves replicate data from the master and handle read operations. This pattern is simple and works well for read-heavy workloads.

Master-slave architecture:

Application
     │
     ├──► Writes ──► Master
     │
     └──► Reads ──► Slave 1
              └──► Slave 2
              └──► Slave 3

Benefits:
- Simple to set up and understand
- Writes go to one place (no conflict resolution)
- Easy to add more slaves for read scaling

Drawbacks:
- Master is a single point of failure
- Read replicas may have replication lag
- Writes cannot scale beyond one server

2. Multi-Master Replication

Multi-master replication allows writes on multiple servers. Changes made on any master are replicated to all other masters. This pattern enables write scaling but introduces complexity around conflict resolution.

Multi-master architecture:

Application
     │
     ├──► Master 1 (writes)
     │        │
     │        ▼
     │    Replication
     │        │
     └──► Master 2 (writes)
              │
              ▼
           Replica (optional)

Benefits:
- Write scaling across multiple servers
- High availability (other masters can serve writes)
- Geographic distribution (writes close to users)

Drawbacks:
- Complex conflict resolution
- Risk of data inconsistency
- Slower writes due to synchronization

3. Synchronous vs Asynchronous Replication

Aspect	Synchronous	Asynchronous
Data Durability	No data loss (transaction committed on all nodes)	Possible data loss if master fails before replication
Write Performance	Slower (waits for replicas)	Faster (commits locally)
Availability	Lower (replica failure blocks writes)	Higher (master can continue alone)
Replication Lag	Zero (by definition)	Possible (replicas may be behind)
Best For	Financial, critical data	Most web applications

Replication Methods

Statement-Based Replication

The master logs the actual SQL statements (INSERT, UPDATE, DELETE) and sends them to replicas to execute. Simple but can cause inconsistencies with non-deterministic functions like NOW() or RAND().

-- Master logs this statement
UPDATE users SET last_login = NOW() WHERE user_id = 123;

-- Replica executes same statement
-- NOW() on replica may be different → data inconsistency!

Row-Based Replication

The master logs the actual changed rows (before and after images) rather than the SQL statements. More consistent but generates more data.

-- Master logs changed rows, not the SQL
-- Before: (user_id=123, last_login='2024-01-01 10:00:00')
-- After:  (user_id=123, last_login='2024-01-01 11:00:00')

-- Replica applies the row changes directly
-- Result: Consistent across all nodes

Write-Ahead Log (WAL) Shipping

Used by PostgreSQL. The master streams WAL segments to replicas, which replay them. This is the most efficient and reliable method.

-- PostgreSQL streaming replication
primary_conninfo = 'host=replica1 user=replicator'
primary_slot_name = 'replica1_slot'

-- Replica continuously applies WAL from primary
-- Data is always consistent (physical replication)

Popular Database Replication Implementations

Database	Replication Type	Configuration
PostgreSQL	Streaming replication (physical or logical)	hot_standby = on, primary_conninfo
MySQL	Asynchronous master-slave, Group Replication (multi-master)	CHANGE MASTER TO, gtid_mode=ON
MongoDB	Replica sets (automatic failover)	rs.initiate(), rs.add()
Redis	Master-replica with sentinel for failover	replicaof, sentinel monitor
Cassandra	Multi-master (peer-to-peer)	Replication factor, snitch configuration

Setting Up Replication

PostgreSQL Streaming Replication Example

-- On Master (postgresql.conf)
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB
hot_standby = on

-- Create replication user
CREATE USER replicator WITH REPLICATION PASSWORD 'secure_password';

-- On Replica (pg_basebackup)
pg_basebackup -h master_host -U replicator -D /var/lib/postgresql/data -Fp -Xs -R

-- Replica automatically connects and starts replication

MySQL Replication Example

-- On Master (my.cnf)
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = myapp

-- Create replication user
CREATE USER 'replicator'@'%' IDENTIFIED BY 'secure_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';

-- On Replica
CHANGE MASTER TO
  MASTER_HOST='master_host',
  MASTER_USER='replicator',
  MASTER_PASSWORD='secure_password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=0;

START SLAVE;
SHOW SLAVE STATUS\G

Handling Replication Lag

Replication lag is the delay between a write on the primary and its appearance on a replica. High lag can cause stale reads and consistency issues.

Monitoring replication lag:

-- PostgreSQL
SELECT client_addr, state, replay_lag, flush_lag
FROM pg_stat_replication;

-- MySQL
SHOW SLAVE STATUS\G
-- Look for: Seconds_Behind_Master

-- MongoDB
rs.printSlaveReplicationInfo();

Causes of replication lag and solutions:

Cause: Large write transactions
Solution: Break into smaller transactions

Cause: Slow replica hardware
Solution: Upgrade replica hardware, use same specs as master

Cause: Long-running queries on replica
Solution: Move reporting queries to dedicated replicas

Cause: Network latency
Solution: Co-locate replicas, use faster network

Failover Strategies

When the primary fails, you need to promote a replica to become the new primary. Failover can be manual or automatic.

Manual failover (PostgreSQL):

-- On the replica (promote to primary)
pg_ctl promote

-- Replica is now primary, accepts writes
-- Update application connection strings

-- Former primary (after recovery) becomes replica
-- Rebuild as replica of the new primary

Automatic failover with Patroni (PostgreSQL):

# Patroni configuration (YAML)
scope: postgres
name: postgresql0

restapi:
  listen: 0.0.0.0:8008

etcd:
  host: 10.0.0.1:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576

Common Replication Mistakes to Avoid

Ignoring Replication Lag: Applications assume replicas are always up to date. Read-your-writes consistency requires reading from the primary.
Unmonitored Replication: Replication can break silently. Monitor lag and replication status continuously.
Same Hardware for All Replicas: Use consistent hardware. Slow replicas increase lag.
No Failover Testing: Test failover regularly. The first failover should not be during an actual outage.
Mixed Storage Engines: Different MySQL storage engines on master and slave cause replication errors.
Not Using Replication for Backups: Backups on the primary cause performance impact. Use a replica.

Replication Best Practices

Monitor Replication Lag: Set up alerts for high lag. Investigate and resolve lag promptly.
Use Read-Write Splitting: Route writes to primary, reads to replicas. Use connection pools or proxies.
Test Failover Regularly: Automate failover testing. Ensure promotion works and data is consistent.
Keep Replicas Read-Only: Prevent accidental writes to replicas. Set read_only = ON.
Use at Least Two Replicas: One replica can fail, leaving you without redundancy. Use multiple replicas.
Place Replicas Geographically: For disaster recovery, put replicas in different regions.

Frequently Asked Questions

What is the difference between replication and sharding?
Replication copies the same data to multiple servers (for read scaling and availability). Sharding partitions data across different servers (for write scaling). They are often used together.
How many replicas do I need?
At minimum, one replica for redundancy. For high availability, three replicas allow one to fail while still having quorum. For read scaling, add replicas until read latency is acceptable.
What is the difference between logical and physical replication?
Physical replication copies raw disk blocks (byte-for-byte identical). Logical replication copies changes at the SQL level, allowing selective replication and different schemas.
Can replication cause data loss?
Asynchronous replication can lose data if the master fails before replicating recent writes. Synchronous replication prevents data loss but impacts performance.
What is a replication slot?
A replication slot (PostgreSQL) ensures that the master retains WAL segments until all replicas have received them. Prevents replicas from falling too far behind.
What should I learn next after database replication?
After mastering replication, explore database sharding, high availability architectures, load balancing, and disaster recovery for complete database scalability and reliability.

Database Replication: Master-Slave and Multi-Master