Distributed Storage
Chapter 65: Distributed Storage Systems
Section titled “Chapter 65: Distributed Storage Systems”Scaling Storage with GlusterFS, Ceph, and Distributed Solutions
Section titled “Scaling Storage with GlusterFS, Ceph, and Distributed Solutions”Why This Matters in DevOps/SRE
Section titled “Why This Matters in DevOps/SRE”Distributed storage is essential for modern cloud-native and scale-out infrastructure:
- Scale: Traditional storage can’t handle petabyte-scale workloads
- Availability: Distributed = no single point of failure
- DevOps: You’ll manage distributed storage in production
- Cloud: AWS EFS, EBS, S3 are distributed storage
- On-Call: Respond to storage cluster issues and capacity alerts
Understanding distributed storage is critical for modern infrastructure roles.
65.1 Distributed Storage Fundamentals
Section titled “65.1 Distributed Storage Fundamentals”Understanding Distributed Storage
Section titled “Understanding Distributed Storage”Distributed storage spreads data across multiple servers to provide scalability, fault tolerance, and high availability.
DISTRIBUTED STORAGE ARCHITECTURE+------------------------------------------------------------------+| || ┌──────────────────────────────────────────────────────────┐ │| │ DISTRIBUTED STORAGE BENEFITS │ │| │ │ │| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │| │ │ SCALE │ │ HIGH │ │ FAULT │ │ │| │ │ OUT │ │ AVAILABILITY│ │ TOLERANCE │ │ │| │ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ │| │ │ │ │ │ │ │ │ │| │ │ Add nodes │ │ Multiple │ │ Replicate │ │ │| │ │ to scale │ │ copies of │ │ data across│ │ │ │ │ capacity │ │ data │ │ nodes │ │ │ │ │ │ │ │ │ │ │ │ │ │ Linear │ │ Automatic │ │ Self- │ │ │| │ │ scaling │ │ failover │ │ healing │ │ │| │ │ │ │ │ │ │ │ │| │ └─────────────┘ └─────────────┘ └─────────────┘ │ │| │ │ │| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │| │ │PERFORMANCE │ │ GLOBAL │ │ COST │ │ │| │ │ │ │ ACCESS │ │ EFFECTIVE │ │ │| │ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ │| │ │ │ │ │ │ │ │ │| │ │ Parallel │ │ Single │ │ Use com- │ │ │| │ │ I/O across│ │ namespace │ │ modity │ │ │| │ │ multiple │ │ from any │ │ hardware │ │ │| │ │ nodes │ │ location │ │ │ │ │| │ │ │ │ │ │ No SAN │ │ │| │ │ │ │ │ │ required │ │ │| │ └─────────────┘ └─────────────┘ └─────────────┘ │ │| │ │ │| └──────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+Distributed Storage Types
Section titled “Distributed Storage Types” STORAGE TYPES COMPARISON+------------------------------------------------------------------+| || ┌──────────────────────────────────────────────────────────────┐ ││ │ BLOCK STORAGE │ ││ │ ┌────────────────────────────────────────────────────────┐ │ ││ │ │ │ │ ││ │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ ││ │ │ │Block │ │Block │ │Block │ │Block │ │ │ ││ │ │ │ Dev │ │ Dev │ │ Dev │ │ Dev │ │ │ │| │ │ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │ │ ││ │ │ └──────────┼──────────┼──────────┘ │ │ ││ │ │ ▼ ▼ │ │ ││ │ │ ┌────────────────────────────────┐ │ │ ││ │ │ │ Distributed Block Service │ │ │ ││ │ │ │ (Ceph RBD, Sheepdog) │ │ │ ││ │ │ └────────────────────────────────┘ │ │ ││ │ │ │ │ ││ │ │ Use Cases: Virtual machines, Databases │ │ ││ │ │ │ │ ││ │ └────────────────────────────────────────────────────────┘ │ ││ └──────────────────────────────────────────────────────────────┘ │| || ┌──────────────────────────────────────────────────────────────┐ ││ │ FILE STORAGE │ ││ │ ┌────────────────────────────────────────────────────────┐ │ ││ │ │ │ │ ││ │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ ││ │ │ │ File │ │ File │ │ File │ │ File │ │ │ ││ │ │ │ │ │ │ │ │ │ │ │ │ ││ │ │ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │ │ ││ │ │ └──────────┼──────────┼──────────┘ │ │ ││ │ │ ▼ ▼ │ │ ││ │ │ ┌────────────────────────────────┐ │ │ ││ │ │ │ Distributed File System │ │ │ ││ │ │ │ (GlusterFS, HDFS, CephFS) │ │ │ ││ │ │ └────────────────────────────────┘ │ │ ││ │ │ │ │ ││ │ │ Use Cases: NFS, Samba, Shared file storage │ │ ││ │ │ │ │ ││ │ └────────────────────────────────────────────────────────┘ │ ││ └──────────────────────────────────────────────────────────────┘ │| || ┌──────────────────────────────────────────────────────────────┐ ││ │ OBJECT STORAGE │ ││ │ ┌────────────────────────────────────────────────────────┐ │ ││ │ │ │ │ ││ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ ││ │ │ │ Object │ │ Object │ │ Object │ │ │ ││ │ │ │ (data) │ │ (data) │ │ (data) │ │ │ ││ │ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ ││ │ │ └────────────┼────────────┘ │ │ ││ │ │ ▼ │ │ ││ │ │ ┌─────────────────────┐ │ │ ││ │ │ │ Object Storage │ │ │ ││ │ │ │ (Ceph RGW, S3) │ │ │ ││ │ │ └─────────────────────┘ │ │ ││ │ │ │ │ ││ │ │ Use Cases: Cloud storage, Backups, Media │ │ ││ │ │ │ │ ││ │ └────────────────────────────────────────────────────────┘ │ ││ └──────────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+65.2 GlusterFS
Section titled “65.2 GlusterFS”GlusterFS Architecture
Section titled “GlusterFS Architecture” GLUSTERFS ARCHITECTURE+------------------------------------------------------------------+| || ┌──────────────────────────────────────────────────────────┐ │| │ CLIENT LAYER │ │| │ │ │| │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │| │ │ App 1 │ │ App 2 │ │ App 3 │ │ │| │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │| │ │ │ │ │ │| │ └───────────────┼───────────────┘ │ │| │ ▼ │ │| │ ┌────────────────┐ │ │| │ │ GlusterFS │ │ │| │ │ Client (FUSE) │ │ │| │ └────────┬───────┘ │ │| │ │ │ │| └────────────────────────┼──────────────────────────────────┘ │| │ || ▼ || ┌──────────────────────────────────────────────────────────┐ │| │ SERVER LAYER │ │| │ │ │| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │| │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │| │ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │ │| │ │ │Brick │ │ │ │Brick │ │ │ │Brick │ │ │ │| │ │ │/data1 │ │ │ │/data2 │ │ │ │/data3 │ │ │ │| │ │ └───────┘ │ │ └───────┘ │ │ └───────┘ │ │ │| │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │| │ │ │ │ │ │| │ └─────────────────┼─────────────────┘ │ │| │ ▼ │ │| │ ┌────────────────┐ │ │| │ │ GlusterD (GM) │ │ │| │ │ Management │ │ │| │ └────────────────┘ │ │| │ │ │| └──────────────────────────────────────────────────────────┘ || |+------------------------------------------------------------------+GlusterFS Installation and Setup
Section titled “GlusterFS Installation and Setup”# =============================================================================# INSTALLATION (Ubuntu)# =============================================================================
# Add GlusterFS repositorywget -O - https://download.gluster.org/pub/gluster/glusterfs/3.12/3.12.1/rsa.pub | apt-key add -echo "deb https://download.gluster.org/pub/gluster/glusterfs/3.12/3.12.1/ubuntu/$(lsb_release -cs)/amd64/" \ | tee /etc/apt/sources.list.d/gluster.list
apt-get updateapt-get install glusterfs-server glusterfs-client glusterfs-common
# Enable and startsystemctl enable glusterdsystemctl start glusterd
# =============================================================================# PROBE NODES# =============================================================================
# From node1, probe other nodesgluster peer probe node2gluster peer probe node3
# Check peer statusgluster peer status
# =============================================================================# CREATE VOLUME# =============================================================================
# Create distributed volume (no replication)gluster volume create gv0 node1:/data1 node2:/data2 node3:/data3
# Create replicated volume (3 copies)gluster volume create gv0 replica 3 \ node1:/data1 node2:/data2 node3:/data3
# Create distributed-replicated (2x2)gluster volume create gv0 replica 2 \ node1:/data1 node2:/data2 \ node3:/data3 node4:/data4
# Start volumegluster volume start gv0
# View volume infogluster volume infoGlusterFS Client Mount
Section titled “GlusterFS Client Mount”# =============================================================================# MOUNT GLUSTERFS# =============================================================================
# Mount using GlusterFS clientmount -t glusterfs node1:/gv0 /mnt/glusterfs
# Add to /etc/fstabnode1:/gv0 /mnt/glusterfs glusterfs defaults,_netdev 0 0
# Mount using NFSmount -t nfs node1:/gv0 /mnt/nfs
# Mount using CIFS/Sambamount -t cifs //server/share /mnt -o username=user
# View mounted volumesdf -h | grep glusterGlusterFS Management
Section titled “GlusterFS Management”# =============================================================================# VOLUME MANAGEMENT# =============================================================================
# Start/Stop volumegluster volume start gv0gluster volume stop gv0
# Add bricks to volumegluster volume add-brick gv0 node4:/data4
# Remove brick (with rebalance)gluster volume remove-brick gv0 node4:/data4 startgluster volume remove-brick gv0 node4:/data4 statusgluster volume remove-brick gv0 node4:/data4 commit
# Rebalancegluster volume rebalance gv0 start
# Set volume optionsgluster volume set gv0 performance.cache-size 256MBgluster volume set gv0 network.ping-timeout 10
# View volume statusgluster volume statusgluster volume status gv0 detail65.3 Ceph
Section titled “65.3 Ceph”Ceph Architecture
Section titled “Ceph Architecture” CEPH ARCHITECTURE+------------------------------------------------------------------+| || ┌──────────────────────────────────────────────────────────┐ │| │ CEPH CLIENT │ │| │ │ │| │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │| │ │ Kernel │ │ librados │ │ RGW (S3) │ │ │| │ │ Module │ │ (Library) │ │ (Swift) │ │ │| │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │| │ │ │ │ │ │| └──────────┼──────────────────┼──────────────────┼──────────┘ || │ │ │ || └──────────────────┼──────────────────┘ || ▼ || ┌──────────────────────────────────────────────────────────┐ │| │ MONITORS (Mons) │ │| │ │ │| │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │| │ │ Mon1 │◄──│ Mon2 │◄──│ Mon3 │ │ │| │ │ │ │ │ │ │ │ │| │ └─────────┘ └─────────┘ └─────────┘ │ ││ │ Cluster state, PGs, maps │ │| └──────────────────────────┬──────────────────────────────┘ || │ || ▼ || ┌──────────────────────────────────────────────────────────┐ │| │ MDS (Metadata Server) Cluster │ │| │ ┌─────────┐ ┌─────────┐ │ │| │ │ MDS1 │◄──│ MDS2 │ │ │| │ │ Active │ │ Standby │ │ │| │ └─────────┘ └─────────┘ │ │| │ File metadata, POSIX compliance │ │| └──────────────────────────┬──────────────────────────────┘ || │ || ▼ || ┌──────────────────────────────────────────────────────────┐ │| │ OSDs (Object Storage Daemons) │ │| │ │ │| │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │| │ │ OSD.1 │ │ OSD.2 │ │ OSD.3 │ │ OSD.4 │ │ │| │ │(Disk1) │ │(Disk2) │ │(Disk3) │ │(Disk4) │ │ │| │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │| │ └───────────┼───────────┼───────────┘ │ │| │ │ │ │ │| │ ┌──────┴───────────┴──────┐ │ │| │ │ CRUSH Algorithm │ │ │| │ │ (Data Distribution) │ │ │| │ └─────────────────────────┘ │ │| │ │ │| └──────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+Ceph Installation
Section titled “Ceph Installation”# =============================================================================# INSTALLATION (Ubuntu)# =============================================================================
# Add Ceph repositorywget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add -echo "deb https://download.ceph.com/packages/ceph-pacific/ubuntu $(lsb_release -cs) main" | \ tee /etc/apt/sources.list.d/ceph.list
apt-get updateapt-get install ceph-mon ceph-osd ceph-mds ceph-radosgw
# =============================================================================# DEPLOY WITH CEPH-ADMIN# =============================================================================
# On admin node, create clusterceph-deploy new node1 node2 node3
# Install Cephceph-deploy install node1 node2 node3
# Deploy monitorsceph-deploy mon create node1 node2 node3
# Deploy OSDsceph-deploy osd create --data /dev/sdb node1ceph-deploy osd create --data /dev/sdb node2ceph-deploy osd create --data /dev/sdb node3
# Gather keysceph-deploy admin node1 node2 node3
# Check healthceph healthceph -sCeph Pools and RBD
Section titled “Ceph Pools and RBD”# =============================================================================# POOL MANAGEMENT# =============================================================================
# Create poolceph osd pool create mypool 128 128
# Set pool sizeceph osd pool set mypool size 3ceph osd pool set mypool min_size 2
# View poolsceph osd lspoolsceph osd pool stats mypool
# =============================================================================# RBD (RADOS BLOCK DEVICE)# =============================================================================
# Create RBD imagerbd create mypool/myimage --size 10G
# Map RBD devicerbd map mypool/myimage
# Create filesystemmkfs.ext4 /dev/rbd0
# Mountmount /dev/rbd0 /mnt/rbd
# Or use kernel modulemodprobe rbdrbd map mypool/myimage
# Resize imagerbd resize mypool/myimage --size 20GCeph Filesystem (CephFS)
Section titled “Ceph Filesystem (CephFS)”# =============================================================================# CEPHFS SETUP# =============================================================================
# Create MDSceph-deploy mds create node1
# Create filesystemceph fs new cephfs metadata data
# Mount CephFSmount -t ceph node1:6789:/ /mnt/cephfs \ -o name=admin,secret=$(ceph auth get-key client.admin)
# Add to fstabnode1:6789:/ /mnt/cephfs ceph \ name=admin,secretfile=/etc/ceph/secret.key,noatime 0 065.4 Distributed Storage Comparison
Section titled “65.4 Distributed Storage Comparison”Feature Comparison
Section titled “Feature Comparison” DISTRIBUTED STORAGE COMPARISON+==================================================================+| || Feature | GlusterFS | Ceph | MooseFS || ──────────────────┼──────────────┼────────────────┼─────────────|| Type | File | Block/File/ | File || | | Object | || Min Nodes | 1 | 3 | 3 || Replication | Configurable | Configurable | Configurable|| Erasure Coding | Yes | Yes | No || Geo-replication | Yes | Yes | Limited || Snapshots | Yes | Yes | Yes || Quotas | Yes | Yes | Yes || NFS/SMB | Native | Via RGW | Native || S3/Swift API | No | Yes | No || Learning Curve | Low | High | Medium || Performance | Good | Excellent | Good || |+==================================================================+65.5 Distributed Storage Best Practices
Section titled “65.5 Distributed Storage Best Practices”Summary Checklist
Section titled “Summary Checklist” DISTRIBUTED STORAGE BEST PRACTICES+------------------------------------------------------------------+| || ┌─────────────────────────────────────────────────────────────┐ │| │ PLANNING │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Choose appropriate storage type (block/file/object) │ │| │ □ Plan capacity with growth projections │ │| │ □ Design for expected failure scenarios │ │| │ □ Consider network topology │ │| └─────────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────────┐ │| │ CONFIGURATION │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Use recommended replication factor (minimum 3) │ │| │ □ Enable data balancing │ │| │ □ Configure appropriate cache settings │ │| │ □ Set up monitoring and alerts │ │| └─────────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────────┐ │| │ MONITORING │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Monitor disk usage │ │| │ □ Monitor cluster health │ │| │ □ Set up performance metrics │ │| │ □ Monitor network traffic │ │| └─────────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────────┐ │| │ MAINTENANCE │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Regular health checks │ │| │ □ Plan for hardware failures │ │| │ □ Test disaster recovery procedures │ │| │ □ Keep documentation updated │ │| └─────────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+Common Mistakes & Anti-Patterns
Section titled “Common Mistakes & Anti-Patterns”1. Insufficient Replication
Section titled “1. Insufficient Replication”WRONG:
# Using replica 1 for datagluster volume create data replica 1 brick1:/brickCORRECT:
# Use replica 3 minimum for productiongluster volume create data replica 3 \ brick1:/brick \ brick2:/brick \ brick3:/brickWhy: Single replica = single point of failure, data loss on brick failure.
2. Not Planning Capacity
Section titled “2. Not Planning Capacity”WRONG:
# Just add bricks without planninggluster volume add-brick data brick4:/brickCORRECT:
# Plan for 70% capacity threshold# Add bricks with proper distribution# Use rebalancing after adding bricksgluster volume rebalance data startWhy: Running out of space causes writes to fail and cluster instability.
3. Ignoring Network Requirements
Section titled “3. Ignoring Network Requirements”WRONG:
# Using same network for storage and client traffic# All on 1GbECORRECT:
# Separate storage network# Use 10GbE or faster for replication# Isolate storage traffic on dedicated VLANWhy: Network is the bottleneck for distributed storage performance.
Interview Questions
Section titled “Interview Questions”Conceptual Questions
Section titled “Conceptual Questions”-
Q: What’s the difference between GlusterFS and Ceph?
- A: GlusterFS is file-based (POSIX), simpler to set up, scales to PB. Ceph provides block (RBD), file (CephFS), and object (RGW) storage, more complex but more versatile.
-
Q: Explain RAID vs distributed storage.
- A: RAID protects against disk failure within a server. Distributed storage protects against server/node failure by spreading data across multiple servers.
Scenario-Based Questions
Section titled “Scenario-Based Questions”- Q: Your GlusterFS volume shows degraded status. How would you troubleshoot?
- A: Check
gluster volume status, identify offline bricks, check brick servers, heal the volume withgluster volume heal, replace failed bricks.
- A: Check
End of Chapter 66: Distributed Storage Systems