Sharding

Chapter 9: Database Sharding & Partitioning

Horizontally Scaling Your Database

9.1 Introduction to Sharding

Sharding (also called horizontal partitioning) is a method of distributing data across multiple databases or database instances to distribute load and enable horizontal scaling.

    Vertical vs Horizontal Partitioning
    ==================================

    Vertical Partitioning:
    +----------+     +----------+     +----------+
    | Table A  |     | Table B  |     | Table C  |
    | (Users)  |     |(Orders)  |     |(Products)|
    +----------+     +----------+     +----------+

    Horizontal Partitioning (Sharding):
    +------------------------+
    |      Users Table        |
    +------------------------+
             |
             v
    +----------+----------+
    | Shard 1  | Shard 2  |
    | A-M      | N-Z      |
    +----------+----------+

9.2 Why Sharding?

Scaling Challenges Without Sharding

    Monolithic Database Problems
    ============================

    1. Storage Limits
    +----------------------------------+
    |   1 Billion Users               |
    |   Database size: 50TB          |
    |   Single disk: 10TB             |
    |   Need 5 disks just for storage |
    +----------------------------------+

    2. Write Throughput
    +----------------------------------+
    |   100,000 writes/second         |
    |   Single DB: handles 10,000    |
    |   90% writes rejected          |
    +----------------------------------+

    3. Connection Limits
    +----------------------------------+
    |   10,000 concurrent users      |
    |   Max connections: 500         |
    |   Connection pool exhausted    |
    +----------------------------------+

Benefits of Sharding

Benefit	Description
Horizontal Scaling	Add more shards to increase capacity
Improved Performance	Smaller datasets per shard = faster queries
High Availability	Shard failures don’t affect others
Geographic Distribution	Shards in different regions

9.3 Sharding Strategies

9.3.1 Range-Based Sharding

    Range-Based Sharding
    ====================

    Shard by ranges of a key:

    +------------------------------------------------+
    |              User ID                            |
    +------------------------------------------------+
              |          |          |          |
              v          v          v          v
           +------+  +------+  +------+  +------+
           |Shard |  |Shard |  |Shard |  |Shard |
           |  1   |  |  2   |  |  3   |  |  4   |
           | ID   |  | ID   |  | ID   |  | ID   |
           | 1-1M |  |1M-2M |  |2M-3M |  |3M-4M |
           +------+  +------+  +------+  +------+

    Example:
    User ID 500,000 -> Shard 1
    User ID 1,500,000 -> Shard 2

    Pros:
    - Simple to implement
    - Easy to find data

    Cons:
    - Hot spots (popular IDs)
    - Uneven distribution

9.3.2 Hash-Based Sharding

    Hash-Based Sharding
    ===================

    Apply hash function to shard key:

    hash(user_id) = 12345
    shard = hash % num_shards

    +------------------------------------------------+
    |              User ID: 12345                     |
    +------------------------------------------------+
                          |
                          v
                   hash(user_id)
                          |
                          v
                   12345 % 4 = 1
                          |
                          v
    +------------------------------------------------+
    |  Shard 0  |  Shard 1  |  Shard 2  |  Shard 3  |
    |           |           |           |           |
    | hash%4=0  | hash%4=1  | hash%4=2  | hash%4=3  |
    +------------------------------------------------+

    Pros:
    - Even distribution
    - Less hot spotting

    Cons:
    - Can't do range queries efficiently
    - Resharding is complex

9.3.3 Directory-Based Sharding

    Directory-Based Sharding
    ========================

    Lookup table maps keys to shards:

    +-------------------+
    |   Shard Directory |
    +-------------------+
    | user:123 -> Shard1|
    | user:456 -> Shard2|
    | user:789 -> Shard3|
    +-------------------+

    Flow:
    1. Look up user:456 in directory
    2. Directory says Shard 2
    3. Connect to Shard 2

    Pros:
    - Flexible
    - Can move data without affecting clients

    Cons:
    - Directory is single point of failure
    - Extra lookup needed

9.3.4 Geographic Sharding

    Geographic Sharding
    ====================

    +--------------------------------------------------+
    |                    World Map                      |
    +--------------------------------------------------+

    North America    Europe         Asia
       |                |             |
    +-------+       +-------+     +-------+
    |Shard 1|       |Shard 2|     |Shard 3|
    |Users  |       |Users  |     |Users  |
    |from   |       |from   |     |from   |
    |USA/CA |       |EU     |     |Asia   |
    +-------+       +-------+     +-------+

    Benefits:
    - Low latency (data close to users)
    - Data sovereignty compliance
    - Regional performance

9.4 Choosing a Shard Key

Ideal Shard Key Properties

Property	Description
High Cardinality	Many distinct values
Even Distribution	Data spread evenly
Query Locality	Queries hit one shard
Stable	Rarely changes

Good vs Bad Shard Keys

    Good Shard Keys
    ===============

    User ID for user data:
    - High cardinality (billions of users)
    - Even distribution
    - Queries by user go to one shard

    Bad Shard Keys
    ==============

    Country for user data:
    - Low cardinality (200 countries)
    - Uneven distribution (US has many more users)
    - Hot spots

    Zip code for orders:
    - Could work but check distribution
    - May need composite key

9.5 Cross-Shard Operations

Challenges with Sharded Data

    Cross-Shard Queries
    ===================

    Problem:
    "Find all orders for users in shard 1 AND shard 2"

    Shard 1          Shard 2          Shard 3
    +------+         +------+         +------+
    |Users |         |Users |         |Users |
    +------+         +------+         +------+

    Must query each shard separately!

    Query:
    1. Query Shard 1 -> 100 results
    2. Query Shard 2 -> 50 results
    3. Query Shard 3 -> 75 results
    4. Merge and sort -> Final result

    This is slow!

Solutions

Challenge	Solution
Joins across shards	Denormalize data, duplicate data
Aggregations	Query all shards, merge results
Transactions	Use two-phase commit or saga pattern
Unique IDs	Use distributed ID generators

9.6 Rebalancing Shards

When to Rebalance

    Rebalancing Triggers
    ====================

    1. Data Growth
    +----------------------------------+
    | Shard 1: 80GB (over 70%)       |
    | Shard 2: 20GB (under 30%)      |
    | -> Need to split Shard 1       |
    +----------------------------------+

    2. Performance
    +----------------------------------+
    | Shard 1: 5000 queries/sec      |
    | Shard 2: 500 queries/sec      |
    | -> Rebalance to equalize       |
    +----------------------------------+

    3. Adding/Shutting Nodes
    +----------------------------------+
    | Add new server                  |
    | -> Redistribute data            |
    +----------------------------------+

Rebalancing Strategies

Strategy	Description	Pros	Cons
Fixed	Pre-defined shards	Simple	Inflexible
Dynamic	Auto-split on threshold	Adaptive	Complex
Virtual	Multiple logical shards per physical	Flexible	Overhead

9.7 Sharding in Practice

Application-Level Sharding

    Application-Level Sharding
    ==========================

    Application
         |
         v
    +-------------+
    | Shard       |
    | Router      |
    +-------------+
         |
    +----+----+----+
    |    |    |    |
    v    v    v    v
    DB1  DB2  DB3  DB4

    Router Logic:
    ```python
    def get_shard(user_id):
        return user_id % 4

    def get_connection(user_id):
        shard = get_shard(user_id)
        return connection_pool[shard]
    ```

Database-Native Sharding

    Database-Native Sharding
    ========================

    MongoDB Sharded Cluster:

    +---------------------------------------------------+
    |                   Application                      |
    +---------------------------------------------------+
                              |
                    +---------+---------+
                    |                  |
                    v                  v
            +-------------+     +-------------+
            |   Router    |     |   Router    |
            | (mongos)    |     |  (mongos)   |
            +-------------+     +-------------+
                    |                  |
                    +---------+---------+
                    |                  |
                    v                  v
            +-------------+     +-------------+
            | Config      |     | Config      |
            | Server      |     | Server      |
            +-------------+     +-------------+

                    Shard A (Primary)
                    +-------------+     +-------------+
                    |  Replica 1  |     |  Replica 2  |
                    +-------------+     +-------------+

9.8 Sharding vs Replication

Aspect	Sharding	Replication
Purpose	Scale writes & storage	Scale reads, HA
Data	Different data per node	Same data on all
Complexity	High	Medium
Queries	May need scatter-gather	Can read from any
Failure Impact	Partial data loss	Other replicas available

Often Used Together

    Sharding + Replication
    ====================

                    Application
                         |
                         v
                +----------------+
                | Load Balancer |
                +----------------+
                         |
        +----------------+----------------+
        |                |                |
        v                v                v
    +--------+       +--------+       +--------+
    | Shard 1|       | Shard 2|       | Shard 3|
    |Primary |       |Primary |       |Primary |
    +--------+       +--------+       +--------+
        |                |                |
        v                v                v
    +--------+       +--------+       +--------+
    |Replica |       |Replica |       |Replica |
    +--------+       +--------+       +--------+

9.9 Best Practices

Sharding Guidelines

Best Practice	Description
Choose shard key carefully	Test distribution before production
Start with fewer shards	Don’t over-shard early
Plan for growth	Leave room to add shards
Monitor distribution	Check for hot spots
Test failover	Ensure HA works with sharding
Document routing logic	Make it maintainable

Summary

Key sharding concepts:

Choose the right shard key - Cardinality, distribution, locality
Hash-based is often best - Even distribution
Cross-shard queries are expensive - Denormalize if needed
Plan for rebalancing - Growth will happen
Combine with replication - Both sharding and HA
Start simple - Don’t over-engineer early

Next: Chapter 10: ACID vs BASE Models