Skip to content

Replication

Ensuring High Availability and Scalability

Section titled “Ensuring High Availability and Scalability”

Database Replication is the process of copying data from one database to another to ensure redundancy, improve availability, and scale read operations.

Why Replication?
===============
Without Replication:
+----------+
|Database | (Single point of failure)
| DB1 |
+----------+
With Replication:
+----------+ +----------+
| Primary | --> | Replica |
| DB | | DB |
+----------+ +----------+
Benefits:
- High availability
- Fault tolerance
- Read scalability
- Geographic distribution
- Backup capability

Primary-Replica Architecture
============================
Application
|
+------------+------------+
| | |
v v v
Read/ Write Read
Write (Primary) (Replicas)
(Primary)
|
v
+-------------+
| Primary |
| DB |
+-------------+
|
| Replicate
v
+-------------+ +-------------+ +-------------+
| Replica | | Replica | | Replica |
| 1 | | 2 | | 3 |
+-------------+ +-------------+ +-------------+
Flow:
1. Writes go to Primary
2. Primary replicates to Replicas
3. Reads can be served by Replicas
Master-Slave (Same as Primary-Replica)
======================================
Master = Primary
Slave = Replica
Terminology:
- "Master" is being deprecated
- "Primary" and "Replica" are preferred
Read/Write Splitting:
====================
Application
|
v
+-------------+
| Load |
| Balancer |
+-------------+
|
+----+----+
| |
v v
Write Read
Node Nodes
Multi-Master Architecture
=========================
+-------------+ +-------------+
| Master A |<--->| Master B |
| (Primary) | | (Primary) |
+-------------+ +-------------+
| |
v v
+-------------+ +-------------+
| Replica | | Replica |
| 1 | | 2 |
+-------------+ +-------------+
Write Flow:
- Application can write to any Master
- Masters replicate to each other
- Masters replicate to Replicas
Advantages:
- No single point of write failure
- Lower latency for writes (geographic)
Challenges:
- Conflict resolution
- Complexity

Synchronous Replication
======================
Application -> Primary -> Replica1 -> Replica2 -> Response
| | |
v v v
Write to all, wait for ACK
Timeline:
+--------+ +--------+ +--------+ +--------+
|Write to| |Wait for| |Wait for| |Return |
|Primary | -> |Replica1| -> |Replica2| -> |to App |
+--------+ +--------+ +--------+ +--------+
10ms 20ms 30ms 40ms
Pros:
- Strong consistency
- No data loss
Cons:
- High latency
- If one replica down, write fails
Asynchronous Replication
=======================
Application -> Primary -> Return to App -> Background Replica
| |
v v
Write Replicate async
immediately
Timeline:
+--------+ +--------+ +--------+
|Write to| |Return | |Replica|
|Primary | -> |to App | -> |writes |
+--------+ +--------+ +--------+
10ms 10ms 100ms
Pros:
- Low latency
- Continues if replicas down
Cons:
- Eventual consistency
- Potential data loss (if primary fails)
Semi-Synchronous Replication
===========================
Primary -> At least one replica confirms -> Return to App
|
v
All other replicas
Compromise:
- Waits for at least 1 replica (not all)
- Better performance than sync
- Better consistency than async

Single Primary Topology
======================
Primary
(Read/Write)
|
+-----------+-----------+
| | |
v v v
Replica1 Replica2 Replica3
Use Cases:
- Standard read scaling
- High availability
- Geographic distribution
Chain Replication
=================
Primary -> Replica1 -> Replica2 -> Replica3
Write flow:
1. Write to Primary
2. Primary to Replica1
3. Replica1 to Replica2
4. Replica2 to Replica3
5. Return to application
Pros:
- Simple
- Lower bandwidth per node
All-to-All Replication
=====================
Node A <-> Node B <-> Node C <-> Node A
Each node replicates to all others
Pros:
- Most resilient
- Any node can be primary
Cons:
- Complex
- High bandwidth

Write Conflicts
=============
Node A writes: user.balance = 100
Node B writes: user.balance = 50
Conflict occurs! Which one wins?
Resolution Strategies:
+------------------+--------------------------------+
| Strategy | Description |
+------------------+--------------------------------+
| Last-Write-Wins | Timestamp-based (simple) |
| Vector Clocks | Track causality |
| CRDTs | Conflict-free data types |
| Manual | Queue for resolution |
| Merge | Automatic merge rules |
+------------------+--------------------------------+
LWW Implementation
==================
Each write includes timestamp
Example:
Node A: Write X=100 at T1
Node B: Write X=50 at T2
Resolution: X=50 (T2 > T1)
Issues:
- Clock synchronization needed
- May lose updates
- Not suitable for all cases

Failover Steps
=============
1. Detect Failure
+---------------+
| Primary fails |
| Replica can't |
| connect |
+---------------+
2. Elect New Primary
+---------------+
| Replicas vote |
| Choose new |
| primary |
+---------------+
3. Promote
+---------------+
| New primary |
| promoted |
| Writes enabled|
+---------------+
4. Reconfigure
+---------------+
| Old primary |
| removed from |
| pool |
+---------------+
5. Heal
+---------------+
| Old primary |
| comes back |
| as replica |
+---------------+
ConsiderationImpact
Data LossAsync replication may lose data
DowntimeTime to detect and promote
Split BrainTwo primaries active
ClientsMust redirect to new primary

Read Replica Architecture
========================
Write Path:
App -> Primary DB -> Primary Storage
Read Path:
App -> Load Balancer -> Replica 1
-> Replica 2
-> Replica 3
Read Distribution:
- Round robin
- Least connections
- Geographic affinity
Replication Lag
==============
How to measure:
+----------------------------------+
| SHOW SLAVE STATUS\G |
| Seconds_Behind_Master: 5 |
+----------------------------------+
Impact of Lag:
+------------------+------------------------+
| Lag | Impact |
+------------------+------------------------+
| < 1 second | Minimal |
| 1-30 seconds | Noticeable |
| > 30 seconds | Significant stale data|
| > 5 minutes | Major issue |
+------------------+------------------------+
Solutions:
- Increase replicas
- Use faster network
- Reduce write volume
- Read from primary when fresh

AWS RDS Multi-AZ
================
+--------------------------------------------------+
| AWS Region |
+--------------------------------------------------+
Availability Zone 1 Availability Zone 2
+-------------+ +-------------+
| Primary | | Standby |
| DB | Sync | DB |
| | Replic. | |
+-------------+ +-------------+
Features:
- Automatic failover
- Synchronous replication
- Single endpoint
- Automatic backups
AWS RDS Read Replicas
====================
Primary Region Read Replica Region
+----------+ +----------+
| Primary | Cross-region | Replica |
| DB | Async Replic. | DB |
+----------+ +----------+
Use Cases:
- Read scaling
- Cross-region disaster recovery
- Dev/Test environments
- Analytics queries

Best PracticeDescription
Monitor lagTrack replication delay
Test failoverRegular DR tests
Size replicasSame as primary
NetworkLow latency between nodes
BackupsContinue during replication
SecurityEncrypt replication traffic

Key replication concepts:

  1. Choose replication type - Sync for consistency, async for performance
  2. Plan for failures - Automatic failover ensures availability
  3. Monitor lag - Track replica delay
  4. Handle conflicts - Choose resolution strategy
  5. Scale reads - Use read replicas effectively
  6. Consider multi-region - Global distribution

Next: Chapter 9: Database Sharding & Partitioning