Database Replication
Chapter 73: Database Replication
Section titled “Chapter 73: Database Replication”Comprehensive Guide to Database Replication
Section titled “Comprehensive Guide to Database Replication”Why This Matters in DevOps/SRE
Section titled “Why This Matters in DevOps/SRE”Database replication is fundamental to building high-availability systems. It enables read scaling, provides failover capability, and ensures data redundancy. As a DevOps/SRE, you’ll configure replication, monitor lag, and handle failover scenarios - critical skills for maintaining uptime.
┌─────────────────────────────────────────────────────────────────────────────┐│ DATABASE REPLICATION IN HIGH AVAILABILITY │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────────┐ ││ │ REPLICATION TOPOLOGIES │ ││ │ │ ││ │ Master-Slave: │ Master-Master: │ ││ │ ┌──────┐ │ ┌──────┐ ┌──────┐ │ ││ │ │Master│──► Slave │ │Master1│◄────►│Master2│ │ ││ │ │ │──► Slave │ └──────┘ └──────┘ │ ││ │ └──────┘ │ (active-active) │ ││ │ (read scaling) │ │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────────┐ ││ │ FAILOVER SCENARIO │ ││ │ │ ││ │ Primary DB fails │ ││ │ │ │ ││ │ ▼ │ ││ │ Detect failure (health check) │ ││ │ │ │ ││ │ ▼ │ ││ │ Promote replica to primary │ ││ │ │ │ ││ │ ▼ │ ││ │ Update application connection string │ ││ │ │ │ ││ │ ▼ │ ││ │ Traffic flows to new primary │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────────┐ ││ │ MONITORING METRICS │ ││ │ │ ││ │ • Replication lag (seconds behind master) │ ││ │ • IO thread status (Running/Stopped) │ ││ │ • SQL thread status │ ││ │ • Binlog position │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────────┘73.1 MySQL/MariaDB Replication
Section titled “73.1 MySQL/MariaDB Replication”Replication Types
Section titled “Replication Types” MySQL Replication Types+------------------------------------------------------------------+| || Master-Slave: || +----------------------------------------------------------+ || | • One master, multiple slaves | || | • Read scaling | || | • Backup offloading | || | • Asynchronous | || +----------------------------------------------------------+ || || Master-Master: || +----------------------------------------------------------+ || | • Two masters active | || | • Write to either | || | • Auto-increment offset for conflict avoidance | || +----------------------------------------------------------+ || || Replication Formats: || +----------------------------------------------------------+ || | Statement-based | SQL statements replicated | || | Row-based | Rows changes replicated | || | Mixed | Auto-select based on statement | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+Master Configuration
Section titled “Master Configuration”# /etc/mysql/my.cnf or /etc/mysql/mariadb.conf.d/server.cnf[mysqld]server-id = 1log-bin = mysql-binbinlog-do-db = myapp # Replicate specific databasebinlog-ignore-db = mysql # Ignore system databaseexpire_logs_days = 7max_binlog_size = 100M
# For GTID (recommended)gtid-mode = ONenforce-gtid-consistency = ON
# Create replication userCREATE USER 'repl'@'%' IDENTIFIED BY 'repl_password';GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';FLUSH PRIVILEGES;Slave Configuration
Section titled “Slave Configuration”[mysqld]server-id = 2relay-log = relay-binrelay-log-index = relay-bin.indexread-only = 1 # Prevent writes on slavelog-bin = mysql-bin # For cascading replication
# For GTIDgtid-mode = ONenforce-gtid-consistency = ON
# Skip errors (optional)# skip-slave-start = ON# slave-skip-errors = 1062,1050Setting Up Replication
Section titled “Setting Up Replication”# On master - get binlog positionSHOW MASTER STATUS;# Note: File and Position values
# Or for GTIDSHOW MASTER STATUS\G
# On slave - configure masterCHANGE MASTER TO MASTER_HOST='master_host', MASTER_USER='repl', MASTER_PASSWORD='repl_password', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=1234;
# For GTIDCHANGE MASTER TO MASTER_HOST='master_host', MASTER_USER='repl', MASTER_PASSWORD='repl_password', MASTER_AUTO_POSITION=1;
# Start replicationSTART SLAVE;
# Check statusSHOW SLAVE STATUS\G
# Stop replicationSTOP SLAVE;RESET SLAVE ALL;Monitoring Replication
Section titled “Monitoring Replication”# Check slave statusSHOW SLAVE STATUS\G
# Key metrics:# Slave_IO_Running: Yes# Slave_SQL_Running: Yes# Seconds_Behind_Master: 0# Last_IO_Errno / Last_SQL_Errno
# On masterSHOW PROCESSLIST\GSHOW MASTER STATUS;
# Replication threadsSHOW PROCESSLIST;-- Should show "Binlog Dump" on master
# Skip failed transactionSTOP SLAVE;SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;START SLAVE;73.2 PostgreSQL Replication
Section titled “73.2 PostgreSQL Replication”Streaming Replication
Section titled “Streaming Replication”# On master - postgresql.confwal_level = replicamax_wal_senders = 3max_replication_slots = 3wal_keep_size = 1GB
# Create replication userCREATE USER replicator WITH REPLICATION PASSWORD 'rep_password';
# pg_hba.conf - allow replicationhost replication replicator 10.0.0.0/24 md5
# Create base backup on replicapg_basebackup -h master -D /var/lib/postgresql/14/main -U replicator -P -Xs
# Or with replication slotpg_basebackup -h master -D /var/lib/postgresql/14/main -U replicator -P --slot=replica_slot -Xs
# On replica - postgresql.confprimary_conninfo = 'host=master port=5432 user=replicator password=rep_password'hot_standby = on
# Create recovery signaltouch /var/lib/postgresql/14/main/standby.signalReplication Monitoring
Section titled “Replication Monitoring”-- On masterSELECT * FROM pg_stat_replication;SELECT * FROM pg_replication_slots;
-- On replicaSELECT * FROM pg_stat_wal_receiver;SELECT * FROM pg_stat_replication;
-- Check lagSELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;Logical Replication
Section titled “Logical Replication”# On master - create publicationCREATE PUBLICATION myapp_pub FOR ALL TABLES;
# On replica - create subscriptionCREATE SUBSCRIPTION myapp_sub CONNECTION 'host=master port=5432 dbname=myapp user=replicator' PUBLICATION myapp_pub;
# MonitorSELECT * FROM pg_stat_subscription;73.3 Interview Questions
Section titled “73.3 Interview Questions”Basic Questions
Section titled “Basic Questions”-
What is database replication?
- Copying data from one database to another
-
What is master-slave replication?
- One master for writes, slaves for reads
-
What is the difference between statement-based and row-based replication?
- Statement: SQL statements replicated; Row: actual row changes
-
What is GTID?
- Global Transaction ID for easier replication management
-
What is streaming replication in PostgreSQL?
- WAL sent to replica in real-time
Common Mistakes & Anti-Patterns
Section titled “Common Mistakes & Anti-Patterns”1. Not Monitoring Replication Lag
Section titled “1. Not Monitoring Replication Lag”❌ WRONG: No visibility into lag
-- Not checking lagSHOW SLAVE STATUS; -- Not monitoring!✅ CORRECT: Monitor lag continuously
-- MySQLSHOW SLAVE STATUS\G-- Check: Seconds_Behind_Master
-- PostgreSQLSELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;2. Using async replication for critical data
Section titled “2. Using async replication for critical data”❌ WRONG: Data loss on primary failure
# my.cnf - async by default# (no sync related settings)✅ CORRECT: Use semi-sync for critical data
# my.cnfplugin_load = "rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so"rpl_semi_sync_master_enabled = 1rpl_semi_sync_slave_enabled = 13. Not having replication user security
Section titled “3. Not having replication user security”❌ WRONG: Weak replication credentials
-- DON'TGRANT REPLICATION SLAVE ON *.* TO 'repl'@'%' IDENTIFIED BY 'password';✅ CORRECT: Secure replication user
-- Limit to specific IPsGRANT REPLICATION SLAVE ON *.* TO 'repl'@'10.0.1.%' IDENTIFIED BY 'strong_password';4. Promoting wrong replica
Section titled “4. Promoting wrong replica”❌ WRONG: Manual promotion without checking position
# Blind promotion - might lose data!service mysql stop# promote replica to master✅ CORRECT: Use proper promotion procedure
# 1. Stop application writes# 2. Check replication position on all replicas# 3. Choose most up-to-date replica# 4. Stop replication and promotemysql> STOP SLAVE;mysql> RESET MASTER;# Update application to point to new master5. No testing of failover
Section titled “5. No testing of failover”❌ WRONG: Never tested failover until real outage
✅ CORRECT: Regular failover drills
# Schedule regular failover tests# 1. Document failover procedure# 2. Practice in staging# 3. Conduct game day exercises# 4. Automate with tools like Orchestrator or PatroniSummary
Section titled “Summary” Quick Reference+------------------------------------------------------------------+| || MySQL Commands: || +----------------------------------------------------------+ || | SHOW MASTER STATUS | Master position | || | SHOW SLAVE STATUS\G | Slave status | || | START/STOP SLAVE | Control replication | || +----------------------------------------------------------+ || || PostgreSQL Commands: || +----------------------------------------------------------+ || | pg_stat_replication | Master status | || | pg_stat_wal_receiver | Replica status | || | pg_basebackup | Create replica | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+