Distributed Storage

Chapter 65: Distributed Storage Systems

Scaling Storage with GlusterFS, Ceph, and Distributed Solutions

Why This Matters in DevOps/SRE

Distributed storage is essential for modern cloud-native and scale-out infrastructure:

Scale: Traditional storage can’t handle petabyte-scale workloads
Availability: Distributed = no single point of failure
DevOps: You’ll manage distributed storage in production
Cloud: AWS EFS, EBS, S3 are distributed storage
On-Call: Respond to storage cluster issues and capacity alerts

Understanding distributed storage is critical for modern infrastructure roles.

65.1 Distributed Storage Fundamentals

Understanding Distributed Storage

Distributed storage spreads data across multiple servers to provide scalability, fault tolerance, and high availability.

                    DISTRIBUTED STORAGE ARCHITECTURE
+------------------------------------------------------------------+
|                                                                   |
|    ┌──────────────────────────────────────────────────────────┐   │
|    │               DISTRIBUTED STORAGE BENEFITS               │   │
|    │                                                          │   │
|    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │   │
|    │  │   SCALE    │  │    HIGH    │  │   FAULT    │      │   │
|    │  │   OUT      │  │  AVAILABILITY│  │  TOLERANCE │      │   │
|    │  ├─────────────┤  ├─────────────┤  ├─────────────┤      │   │
|    │  │            │  │             │  │             │      │   │
|    │  │  Add nodes │  │  Multiple  │  │  Replicate  │      │   │
|    │  │  to scale │  │  copies of │  │  data across│      │   │
    │  │  capacity │  │  data      │  │  nodes      │      │   │
    │  │            │  │            │  │             │      │   │
    │  │  Linear   │  │  Automatic │  │  Self-      │      │   │
|    │  │  scaling │  │  failover  │  │  healing   │      │   │
|    │  │            │  │            │  │             │      │   │
|    │  └─────────────┘  └─────────────┘  └─────────────┘      │   │
|    │                                                          │   │
|    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │   │
|    │  │PERFORMANCE │  │  GLOBAL    │  │   COST     │      │   │
|    │  │            │  │  ACCESS    │  │  EFFECTIVE │      │   │
|    │  ├─────────────┤  ├─────────────┤  ├─────────────┤      │   │
|    │  │            │  │             │  │             │      │   │
|    │  │  Parallel │  │  Single    │  │  Use com-   │      │   │
|    │  │  I/O across│  │  namespace │  │  modity     │      │   │
|    │  │  multiple  │  │  from any │  │  hardware  │      │   │
|    │  │  nodes    │  │  location │  │             │      │   │
|    │  │            │  │            │  │  No SAN    │      │   │
|    │  │            │  │            │  │  required │      │   │
|    │  └─────────────┘  └─────────────┘  └─────────────┘      │   │
|    │                                                          │   │
|    └──────────────────────────────────────────────────────────┘   │
|                                                                   |
+------------------------------------------------------------------+

Distributed Storage Types

                    STORAGE TYPES COMPARISON
+------------------------------------------------------------------+
|                                                                   |
|  ┌──────────────────────────────────────────────────────────────┐ │
│  │                    BLOCK STORAGE                             │ │
│  │  ┌────────────────────────────────────────────────────────┐  │ │
│  │  │                                                         │  │ │
│  │  │   ┌──────┐   ┌──────┐   ┌──────┐   ┌──────┐          │  │ │
│  │  │   │Block │   │Block │   │Block │   │Block │          │  │ │
│  │  │   │ Dev  │   │ Dev  │   │ Dev  │   │ Dev  │          │  │ │
|  │  │   └──┬───┘   └──┬───┘   └──┬───┘   └──┬───┘          │  │ │
│  │  │      └──────────┼──────────┼──────────┘               │  │ │
│  │  │                 ▼           ▼                          │  │ │
│  │  │          ┌────────────────────────────────┐          │  │ │
│  │  │          │   Distributed Block Service    │          │  │ │
│  │  │          │   (Ceph RBD, Sheepdog)          │          │  │ │
│  │  │          └────────────────────────────────┘          │  │ │
│  │  │                                                         │  │ │
│  │  │   Use Cases: Virtual machines, Databases              │  │ │
│  │  │                                                         │  │ │
│  │  └────────────────────────────────────────────────────────┘  │ │
│  └──────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌──────────────────────────────────────────────────────────────┐ │
│  │                    FILE STORAGE                              │ │
│  │  ┌────────────────────────────────────────────────────────┐  │ │
│  │  │                                                         │  │ │
│  │  │   ┌──────┐   ┌──────┐   ┌──────┐   ┌──────┐          │  │ │
│  │  │   │ File │   │ File │   │ File │   │ File │          │  │ │
│  │  │   │      │   │      │   │      │   │      │          │  │ │
│  │  │   └──┬───┘   └──┬───┘   └──┬───┘   └──┬───┘          │  │ │
│  │  │      └──────────┼──────────┼──────────┘               │  │ │
│  │  │                 ▼           ▼                          │  │ │
│  │  │          ┌────────────────────────────────┐          │  │ │
│  │  │          │   Distributed File System     │          │  │ │
│  │  │          │   (GlusterFS, HDFS, CephFS)    │          │  │ │
│  │  │          └────────────────────────────────┘          │  │ │
│  │  │                                                         │  │ │
│  │  │   Use Cases: NFS, Samba, Shared file storage           │  │ │
│  │  │                                                         │  │ │
│  │  └────────────────────────────────────────────────────────┘  │ │
│  └──────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌──────────────────────────────────────────────────────────────┐ │
│  │                    OBJECT STORAGE                           │ │
│  │  ┌────────────────────────────────────────────────────────┐  │ │
│  │  │                                                         │  │ │
│  │  │   ┌─────────┐  ┌─────────┐  ┌─────────┐              │  │ │
│  │  │   │ Object  │  │ Object  │  │ Object  │              │  │ │
│  │  │   │  (data) │  │  (data) │  │  (data) │              │  │ │
│  │  │   └────┬────┘  └────┬────┘  └────┬────┘              │  │ │
│  │  │        └────────────┼────────────┘                   │  │ │
│  │  │                     ▼                                │  │ │
│  │  │           ┌─────────────────────┐                    │  │ │
│  │  │           │   Object Storage    │                    │  │ │
│  │  │           │   (Ceph RGW, S3)    │                    │  │ │
│  │  │           └─────────────────────┘                    │  │ │
│  │  │                                                         │  │ │
│  │  │   Use Cases: Cloud storage, Backups, Media            │  │ │
│  │  │                                                         │  │ │
│  │  └────────────────────────────────────────────────────────┘  │ │
│  └──────────────────────────────────────────────────────────────┘ │
|                                                                   |
+------------------------------------------------------------------+

65.2 GlusterFS

GlusterFS Architecture

                    GLUSTERFS ARCHITECTURE
+------------------------------------------------------------------+
|                                                                   |
|    ┌──────────────────────────────────────────────────────────┐   │
|    │                     CLIENT LAYER                         │   │
|    │                                                          │   │
|    │   ┌──────────┐   ┌──────────┐   ┌──────────┐          │   │
|    │   │   App 1  │   │   App 2  │   │   App 3  │          │   │
|    │   └────┬─────┘   └────┬─────┘   └────┬─────┘          │   │
|    │        │               │               │                  │   │
|    │        └───────────────┼───────────────┘                  │   │
|    │                        ▼                                  │   │
|    │               ┌────────────────┐                          │   │
|    │               │  GlusterFS     │                          │   │
|    │               │  Client (FUSE) │                          │   │
|    │               └────────┬───────┘                          │   │
|    │                        │                                  │   │
|    └────────────────────────┼──────────────────────────────────┘   │
|                             │                                     |
|                             ▼                                     |
|    ┌──────────────────────────────────────────────────────────┐   │
|    │                   SERVER LAYER                            │   │
|    │                                                          │   │
|    │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │   │
|    │   │  Node 1     │  │  Node 2     │  │  Node 3     │     │   │
|    │   │  ┌───────┐  │  │  ┌───────┐  │  │  ┌───────┐  │     │   │
|    │   │  │Brick  │  │  │  │Brick  │  │  │  │Brick  │  │     │   │
|    │   │  │/data1 │  │  │  │/data2 │  │  │  │/data3 │  │     │   │
|    │   │  └───────┘  │  │  └───────┘  │  │  └───────┘  │     │   │
|    │   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘     │   │
|    │          │                 │                 │           │   │
|    │          └─────────────────┼─────────────────┘           │   │
|    │                            ▼                               │   │
|    │                   ┌────────────────┐                      │   │
|    │                   │  GlusterD (GM)  │                      │   │
|    │                   │  Management    │                      │   │
|    │                   └────────────────┘                      │   │
|    │                                                          │   │
|    └──────────────────────────────────────────────────────────┘   |
|                                                                   |
+------------------------------------------------------------------+

GlusterFS Installation and Setup

# =============================================================================
# INSTALLATION (Ubuntu)
# =============================================================================

# Add GlusterFS repository
wget -O - https://download.gluster.org/pub/gluster/glusterfs/3.12/3.12.1/rsa.pub | apt-key add -
echo "deb https://download.gluster.org/pub/gluster/glusterfs/3.12/3.12.1/ubuntu/$(lsb_release -cs)/amd64/" \
    | tee /etc/apt/sources.list.d/gluster.list

apt-get update
apt-get install glusterfs-server glusterfs-client glusterfs-common

# Enable and start
systemctl enable glusterd
systemctl start glusterd

# =============================================================================
# PROBE NODES
# =============================================================================

# From node1, probe other nodes
gluster peer probe node2
gluster peer probe node3

# Check peer status
gluster peer status

# =============================================================================
# CREATE VOLUME
# =============================================================================

# Create distributed volume (no replication)
gluster volume create gv0 node1:/data1 node2:/data2 node3:/data3

# Create replicated volume (3 copies)
gluster volume create gv0 replica 3 \
    node1:/data1 node2:/data2 node3:/data3

# Create distributed-replicated (2x2)
gluster volume create gv0 replica 2 \
    node1:/data1 node2:/data2 \
    node3:/data3 node4:/data4

# Start volume
gluster volume start gv0

# View volume info
gluster volume info

GlusterFS Client Mount

# =============================================================================
# MOUNT GLUSTERFS
# =============================================================================

# Mount using GlusterFS client
mount -t glusterfs node1:/gv0 /mnt/glusterfs

# Add to /etc/fstab
node1:/gv0 /mnt/glusterfs glusterfs defaults,_netdev 0 0

# Mount using NFS
mount -t nfs node1:/gv0 /mnt/nfs

# Mount using CIFS/Samba
mount -t cifs //server/share /mnt -o username=user

# View mounted volumes
df -h | grep gluster

GlusterFS Management

# =============================================================================
# VOLUME MANAGEMENT
# =============================================================================

# Start/Stop volume
gluster volume start gv0
gluster volume stop gv0

# Add bricks to volume
gluster volume add-brick gv0 node4:/data4

# Remove brick (with rebalance)
gluster volume remove-brick gv0 node4:/data4 start
gluster volume remove-brick gv0 node4:/data4 status
gluster volume remove-brick gv0 node4:/data4 commit

# Rebalance
gluster volume rebalance gv0 start

# Set volume options
gluster volume set gv0 performance.cache-size 256MB
gluster volume set gv0 network.ping-timeout 10

# View volume status
gluster volume status
gluster volume status gv0 detail

65.3 Ceph

Ceph Architecture

                    CEPH ARCHITECTURE
+------------------------------------------------------------------+
|                                                                   |
|    ┌──────────────────────────────────────────────────────────┐   │
|    │                    CEPH CLIENT                           │   │
|    │                                                          │   │
|    │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │   │
|    │   │   Kernel    │  │   librados    │  │   RGW (S3)   │ │   │
|    │   │   Module    │  │   (Library)   │  │   (Swift)    │ │   │
|    │   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘ │   │
|    │          │                  │                  │          │   │
|    └──────────┼──────────────────┼──────────────────┼──────────┘   |
|               │                  │                  │                |
|               └──────────────────┼──────────────────┘                |
|                                  ▼                                   |
|    ┌──────────────────────────────────────────────────────────┐   │
|    │                   MONITORS (Mons)                        │   │
|    │                                                          │   │
|    │   ┌─────────┐   ┌─────────┐   ┌─────────┐               │   │
|    │   │  Mon1  │◄──│  Mon2  │◄──│  Mon3  │               │   │
|    │   │        │   │        │   │        │               │   │
|    │   └─────────┘   └─────────┘   └─────────┘               │   │
│    │     Cluster state, PGs, maps                            │   │
|    └──────────────────────────┬──────────────────────────────┘   |
|                               │                                    |
|                               ▼                                    |
|    ┌──────────────────────────────────────────────────────────┐   │
|    │              MDS (Metadata Server) Cluster               │   │
|    │   ┌─────────┐   ┌─────────┐                             │   │
|    │   │  MDS1   │◄──│  MDS2   │                             │   │
|    │   │ Active  │   │ Standby │                             │   │
|    │   └─────────┘   └─────────┘                             │   │
|    │     File metadata, POSIX compliance                      │   │
|    └──────────────────────────┬──────────────────────────────┘   |
|                               │                                    |
|                               ▼                                    |
|    ┌──────────────────────────────────────────────────────────┐   │
|    │                    OSDs (Object Storage Daemons)          │   │
|    │                                                          │   │
|    │   ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐        │   │
|    │   │ OSD.1  │  │ OSD.2  │  │ OSD.3  │  │ OSD.4  │        │   │
|    │   │(Disk1) │  │(Disk2) │  │(Disk3) │  │(Disk4) │        │   │
|    │   └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘        │   │
|    │       └───────────┼───────────┼───────────┘              │   │
|    │                   │           │                          │   │
|    │            ┌──────┴───────────┴──────┐                   │   │
|    │            │   CRUSH Algorithm       │                   │   │
|    │            │   (Data Distribution)   │                   │   │
|    │            └─────────────────────────┘                   │   │
|    │                                                          │   │
|    └──────────────────────────────────────────────────────────┘   │
|                                                                   |
+------------------------------------------------------------------+

Ceph Installation

# =============================================================================
# INSTALLATION (Ubuntu)
# =============================================================================

# Add Ceph repository
wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add -
echo "deb https://download.ceph.com/packages/ceph-pacific/ubuntu $(lsb_release -cs) main" | \
    tee /etc/apt/sources.list.d/ceph.list

apt-get update
apt-get install ceph-mon ceph-osd ceph-mds ceph-radosgw

# =============================================================================
# DEPLOY WITH CEPH-ADMIN
# =============================================================================

# On admin node, create cluster
ceph-deploy new node1 node2 node3

# Install Ceph
ceph-deploy install node1 node2 node3

# Deploy monitors
ceph-deploy mon create node1 node2 node3

# Deploy OSDs
ceph-deploy osd create --data /dev/sdb node1
ceph-deploy osd create --data /dev/sdb node2
ceph-deploy osd create --data /dev/sdb node3

# Gather keys
ceph-deploy admin node1 node2 node3

# Check health
ceph health
ceph -s

Ceph Pools and RBD

# =============================================================================
# POOL MANAGEMENT
# =============================================================================

# Create pool
ceph osd pool create mypool 128 128

# Set pool size
ceph osd pool set mypool size 3
ceph osd pool set mypool min_size 2

# View pools
ceph osd lspools
ceph osd pool stats mypool

# =============================================================================
# RBD (RADOS BLOCK DEVICE)
# =============================================================================

# Create RBD image
rbd create mypool/myimage --size 10G

# Map RBD device
rbd map mypool/myimage

# Create filesystem
mkfs.ext4 /dev/rbd0

# Mount
mount /dev/rbd0 /mnt/rbd

# Or use kernel module
modprobe rbd
rbd map mypool/myimage

# Resize image
rbd resize mypool/myimage --size 20G

Ceph Filesystem (CephFS)

# =============================================================================
# CEPHFS SETUP
# =============================================================================

# Create MDS
ceph-deploy mds create node1

# Create filesystem
ceph fs new cephfs metadata data

# Mount CephFS
mount -t ceph node1:6789:/ /mnt/cephfs \
    -o name=admin,secret=$(ceph auth get-key client.admin)

# Add to fstab
node1:6789:/ /mnt/cephfs ceph \
    name=admin,secretfile=/etc/ceph/secret.key,noatime 0 0

65.4 Distributed Storage Comparison

Feature Comparison

                    DISTRIBUTED STORAGE COMPARISON
+==================================================================+
|                                                                   |
|  Feature           | GlusterFS    | Ceph           | MooseFS     |
|  ──────────────────┼──────────────┼────────────────┼─────────────|
|  Type              | File         | Block/File/   | File        |
|                    |              | Object        |             |
|  Min Nodes         | 1            | 3             | 3           |
|  Replication       | Configurable | Configurable  | Configurable|
|  Erasure Coding    | Yes          | Yes           | No          |
|  Geo-replication   | Yes          | Yes           | Limited     |
|  Snapshots         | Yes          | Yes           | Yes         |
|  Quotas            | Yes          | Yes           | Yes         |
|  NFS/SMB           | Native       | Via RGW       | Native      |
|  S3/Swift API      | No           | Yes           | No          |
|  Learning Curve    | Low          | High          | Medium      |
|  Performance       | Good         | Excellent     | Good        |
|                                                                   |
+==================================================================+

65.5 Distributed Storage Best Practices

Summary Checklist

                DISTRIBUTED STORAGE BEST PRACTICES
+------------------------------------------------------------------+
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ PLANNING                                                     │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Choose appropriate storage type (block/file/object)      │ │
|  │ □ Plan capacity with growth projections                   │ │
|  │ □ Design for expected failure scenarios                    │ │
|  │ □ Consider network topology                                 │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ CONFIGURATION                                                │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Use recommended replication factor (minimum 3)           │ │
|  │ □ Enable data balancing                                    │ │
|  │ □ Configure appropriate cache settings                     │ │
|  │ □ Set up monitoring and alerts                             │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ MONITORING                                                   │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Monitor disk usage                                        │ │
|  │ □ Monitor cluster health                                    │ │
|  │ □ Set up performance metrics                                │ │
|  │ □ Monitor network traffic                                   │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ MAINTENANCE                                                  │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Regular health checks                                     │ │
|  │ □ Plan for hardware failures                                │ │
|  │ □ Test disaster recovery procedures                        │ │
|  │ □ Keep documentation updated                                │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
+------------------------------------------------------------------+

Common Mistakes & Anti-Patterns

1. Insufficient Replication

WRONG:

# Using replica 1 for data
gluster volume create data replica 1 brick1:/brick

CORRECT:

# Use replica 3 minimum for production
gluster volume create data replica 3 \
  brick1:/brick \
  brick2:/brick \
  brick3:/brick

Why: Single replica = single point of failure, data loss on brick failure.

2. Not Planning Capacity

WRONG:

# Just add bricks without planning
gluster volume add-brick data brick4:/brick

CORRECT:

# Plan for 70% capacity threshold
# Add bricks with proper distribution
# Use rebalancing after adding bricks
gluster volume rebalance data start

Why: Running out of space causes writes to fail and cluster instability.

3. Ignoring Network Requirements

WRONG:

# Using same network for storage and client traffic
# All on 1GbE

CORRECT:

# Separate storage network
# Use 10GbE or faster for replication
# Isolate storage traffic on dedicated VLAN

Why: Network is the bottleneck for distributed storage performance.

Interview Questions

Conceptual Questions

Q: What’s the difference between GlusterFS and Ceph?
- A: GlusterFS is file-based (POSIX), simpler to set up, scales to PB. Ceph provides block (RBD), file (CephFS), and object (RGW) storage, more complex but more versatile.
Q: Explain RAID vs distributed storage.
- A: RAID protects against disk failure within a server. Distributed storage protects against server/node failure by spreading data across multiple servers.

Scenario-Based Questions

Q: Your GlusterFS volume shows degraded status. How would you troubleshoot?
- A: Check gluster volume status, identify offline bricks, check brick servers, heal the volume with gluster volume heal, replace failed bricks.

End of Chapter 66: Distributed Storage Systems