Replication

Chapter 8: Database Replication

Ensuring High Availability and Scalability

8.1 Introduction to Replication

Database Replication is the process of copying data from one database to another to ensure redundancy, improve availability, and scale read operations.

    Why Replication?
    ===============

    Without Replication:
    +----------+
    |Database  |  (Single point of failure)
    |   DB1    |
    +----------+

    With Replication:
    +----------+     +----------+
    |  Primary | --> | Replica  |
    |    DB    |     |    DB    |
    +----------+     +----------+

    Benefits:
    - High availability
    - Fault tolerance
    - Read scalability
    - Geographic distribution
    - Backup capability

8.2 Replication Types

8.2.1 Primary-Replica Replication

    Primary-Replica Architecture
    ============================

                    Application
                        |
           +------------+------------+
           |            |            |
           v            v            v
        Read/       Write        Read
        Write       (Primary)    (Replicas)
        (Primary)
           |
           v
    +-------------+
    |  Primary    |
    |   DB        |
    +-------------+
           |
           | Replicate
           v
    +-------------+     +-------------+     +-------------+
    |  Replica   |     |  Replica   |     |  Replica   |
    |     1      |     |     2      |     |     3      |
    +-------------+     +-------------+     +-------------+

    Flow:
    1. Writes go to Primary
    2. Primary replicates to Replicas
    3. Reads can be served by Replicas

8.2.2 Master-Slave Replication

    Master-Slave (Same as Primary-Replica)
    ======================================

    Master = Primary
    Slave = Replica

    Terminology:
    - "Master" is being deprecated
    - "Primary" and "Replica" are preferred

    Read/Write Splitting:
    ====================

    Application
         |
         v
    +-------------+
    | Load        |
    | Balancer    |
    +-------------+
         |
    +----+----+
    |         |
    v         v
    Write    Read
    Node     Nodes

8.2.3 Multi-Primary (Multi-Master)

    Multi-Master Architecture
    =========================

    +-------------+     +-------------+
    |  Master A   |<--->|  Master B   |
    |  (Primary)  |     |  (Primary)  |
    +-------------+     +-------------+
         |                   |
         v                   v
    +-------------+     +-------------+
    |  Replica   |     |  Replica   |
    |     1      |     |     2      |
    +-------------+     +-------------+

    Write Flow:
    - Application can write to any Master
    - Masters replicate to each other
    - Masters replicate to Replicas

    Advantages:
    - No single point of write failure
    - Lower latency for writes (geographic)

    Challenges:
    - Conflict resolution
    - Complexity

8.3 Replication Methods

8.3.1 Synchronous Replication

    Synchronous Replication
    ======================

    Application -> Primary -> Replica1 -> Replica2 -> Response
                       |          |          |
                       v          v          v
                    Write to all, wait for ACK

    Timeline:
    +--------+    +--------+    +--------+    +--------+
    |Write to|    |Wait for|    |Wait for|    |Return  |
    |Primary | -> |Replica1| -> |Replica2| -> |to App  |
    +--------+    +--------+    +--------+    +--------+
    10ms          20ms         30ms         40ms

    Pros:
    - Strong consistency
    - No data loss

    Cons:
    - High latency
    - If one replica down, write fails

8.3.2 Asynchronous Replication

    Asynchronous Replication
    =======================

    Application -> Primary -> Return to App -> Background Replica
                       |          |
                       v          v
                    Write     Replicate async
                    immediately

    Timeline:
    +--------+    +--------+    +--------+
    |Write to|    |Return  |    |Replica|
    |Primary | -> |to App  | -> |writes |
    +--------+    +--------+    +--------+
    10ms          10ms         100ms

    Pros:
    - Low latency
    - Continues if replicas down

    Cons:
    - Eventual consistency
    - Potential data loss (if primary fails)

8.3.3 Semi-Synchronous Replication

    Semi-Synchronous Replication
    ===========================

    Primary -> At least one replica confirms -> Return to App
                    |
                    v
              All other replicas

    Compromise:
    - Waits for at least 1 replica (not all)
    - Better performance than sync
    - Better consistency than async

8.4 Replication Topologies

8.4.1 Single Primary

    Single Primary Topology
    ======================

              Primary
           (Read/Write)
                |
    +-----------+-----------+
    |           |           |
    v           v           v
    Replica1  Replica2   Replica3

    Use Cases:
    - Standard read scaling
    - High availability
    - Geographic distribution

8.4.2 Chain Replication

    Chain Replication
    =================

    Primary -> Replica1 -> Replica2 -> Replica3

    Write flow:
    1. Write to Primary
    2. Primary to Replica1
    3. Replica1 to Replica2
    4. Replica2 to Replica3
    5. Return to application

    Pros:
    - Simple
    - Lower bandwidth per node

8.4.3 All-to-All Replication

    All-to-All Replication
    =====================

    Node A <-> Node B <-> Node C <-> Node A

    Each node replicates to all others

    Pros:
    - Most resilient
    - Any node can be primary

    Cons:
    - Complex
    - High bandwidth

8.5 Conflict Resolution

Types of Conflicts

    Write Conflicts
    =============

    Node A writes: user.balance = 100
    Node B writes: user.balance = 50

    Conflict occurs! Which one wins?

    Resolution Strategies:
    +------------------+--------------------------------+
    | Strategy         | Description                    |
    +------------------+--------------------------------+
    | Last-Write-Wins | Timestamp-based (simple)      |
    | Vector Clocks   | Track causality                |
    | CRDTs           | Conflict-free data types      |
    | Manual           | Queue for resolution          |
    | Merge           | Automatic merge rules          |
    +------------------+--------------------------------+

Last Write Wins (LWW)

    LWW Implementation
    ==================

    Each write includes timestamp

    Example:
    Node A: Write X=100 at T1
    Node B: Write X=50 at T2

    Resolution: X=50 (T2 > T1)

    Issues:
    - Clock synchronization needed
    - May lose updates
    - Not suitable for all cases

8.6 Failover

Automatic Failover Process

    Failover Steps
    =============

    1. Detect Failure
    +---------------+
    | Primary fails |
    | Replica can't |
    | connect       |
    +---------------+

    2. Elect New Primary
    +---------------+
    | Replicas vote |
    | Choose new   |
    | primary       |
    +---------------+

    3. Promote
    +---------------+
    | New primary   |
    | promoted      |
    | Writes enabled|
    +---------------+

    4. Reconfigure
    +---------------+
    | Old primary   |
    | removed from  |
    | pool          |
    +---------------+

    5. Heal
    +---------------+
    | Old primary   |
    | comes back   |
    | as replica   |
    +---------------+

Failover Considerations

Consideration	Impact
Data Loss	Async replication may lose data
Downtime	Time to detect and promote
Split Brain	Two primaries active
Clients	Must redirect to new primary

8.7 Read Replicas

Scaling Reads

    Read Replica Architecture
    ========================

    Write Path:
    App -> Primary DB -> Primary Storage

    Read Path:
    App -> Load Balancer -> Replica 1
                      -> Replica 2
                      -> Replica 3

    Read Distribution:
    - Round robin
    - Least connections
    - Geographic affinity

Lag Monitoring

    Replication Lag
    ==============

    How to measure:
    +----------------------------------+
    |  SHOW SLAVE STATUS\G            |
    |  Seconds_Behind_Master: 5       |
    +----------------------------------+

    Impact of Lag:
    +------------------+------------------------+
    | Lag             | Impact                 |
    +------------------+------------------------+
    | < 1 second      | Minimal                |
    | 1-30 seconds    | Noticeable             |
    | > 30 seconds    | Significant stale data|
    | > 5 minutes     | Major issue            |
    +------------------+------------------------+

    Solutions:
    - Increase replicas
    - Use faster network
    - Reduce write volume
    - Read from primary when fresh

8.8 Cloud Database Replication

AWS RDS Multi-AZ

    AWS RDS Multi-AZ
    ================

    +--------------------------------------------------+
    |                  AWS Region                      |
    +--------------------------------------------------+

    Availability Zone 1       Availability Zone 2
    +-------------+           +-------------+
    |  Primary    |           |  Standby   |
    |  DB         |  Sync     |  DB        |
    |             |  Replic.  |            |
    +-------------+           +-------------+

    Features:
    - Automatic failover
    - Synchronous replication
    - Single endpoint
    - Automatic backups

Read Replica Configuration

    AWS RDS Read Replicas
    ====================

    Primary Region                Read Replica Region
    +----------+                 +----------+
    | Primary  |  Cross-region   | Replica  |
    | DB       |  Async Replic. | DB       |
    +----------+                 +----------+

    Use Cases:
    - Read scaling
    - Cross-region disaster recovery
    - Dev/Test environments
    - Analytics queries

8.9 Best Practices

Replication Design

Best Practice	Description
Monitor lag	Track replication delay
Test failover	Regular DR tests
Size replicas	Same as primary
Network	Low latency between nodes
Backups	Continue during replication
Security	Encrypt replication traffic

Summary

Key replication concepts:

Choose replication type - Sync for consistency, async for performance
Plan for failures - Automatic failover ensures availability
Monitor lag - Track replica delay
Handle conflicts - Choose resolution strategy
Scale reads - Use read replicas effectively
Consider multi-region - Global distribution

Next: Chapter 9: Database Sharding & Partitioning