Sharding
Chapter 9: Database Sharding & Partitioning
Section titled “Chapter 9: Database Sharding & Partitioning”Horizontally Scaling Your Database
Section titled “Horizontally Scaling Your Database”9.1 Introduction to Sharding
Section titled “9.1 Introduction to Sharding”Sharding (also called horizontal partitioning) is a method of distributing data across multiple databases or database instances to distribute load and enable horizontal scaling.
Vertical vs Horizontal Partitioning ==================================
Vertical Partitioning: +----------+ +----------+ +----------+ | Table A | | Table B | | Table C | | (Users) | |(Orders) | |(Products)| +----------+ +----------+ +----------+
Horizontal Partitioning (Sharding): +------------------------+ | Users Table | +------------------------+ | v +----------+----------+ | Shard 1 | Shard 2 | | A-M | N-Z | +----------+----------+9.2 Why Sharding?
Section titled “9.2 Why Sharding?”Scaling Challenges Without Sharding
Section titled “Scaling Challenges Without Sharding” Monolithic Database Problems ============================
1. Storage Limits +----------------------------------+ | 1 Billion Users | | Database size: 50TB | | Single disk: 10TB | | Need 5 disks just for storage | +----------------------------------+
2. Write Throughput +----------------------------------+ | 100,000 writes/second | | Single DB: handles 10,000 | | 90% writes rejected | +----------------------------------+
3. Connection Limits +----------------------------------+ | 10,000 concurrent users | | Max connections: 500 | | Connection pool exhausted | +----------------------------------+Benefits of Sharding
Section titled “Benefits of Sharding”| Benefit | Description |
|---|---|
| Horizontal Scaling | Add more shards to increase capacity |
| Improved Performance | Smaller datasets per shard = faster queries |
| High Availability | Shard failures don’t affect others |
| Geographic Distribution | Shards in different regions |
9.3 Sharding Strategies
Section titled “9.3 Sharding Strategies”9.3.1 Range-Based Sharding
Section titled “9.3.1 Range-Based Sharding” Range-Based Sharding ====================
Shard by ranges of a key:
+------------------------------------------------+ | User ID | +------------------------------------------------+ | | | | v v v v +------+ +------+ +------+ +------+ |Shard | |Shard | |Shard | |Shard | | 1 | | 2 | | 3 | | 4 | | ID | | ID | | ID | | ID | | 1-1M | |1M-2M | |2M-3M | |3M-4M | +------+ +------+ +------+ +------+
Example: User ID 500,000 -> Shard 1 User ID 1,500,000 -> Shard 2
Pros: - Simple to implement - Easy to find data
Cons: - Hot spots (popular IDs) - Uneven distribution9.3.2 Hash-Based Sharding
Section titled “9.3.2 Hash-Based Sharding” Hash-Based Sharding ===================
Apply hash function to shard key:
hash(user_id) = 12345 shard = hash % num_shards
+------------------------------------------------+ | User ID: 12345 | +------------------------------------------------+ | v hash(user_id) | v 12345 % 4 = 1 | v +------------------------------------------------+ | Shard 0 | Shard 1 | Shard 2 | Shard 3 | | | | | | | hash%4=0 | hash%4=1 | hash%4=2 | hash%4=3 | +------------------------------------------------+
Pros: - Even distribution - Less hot spotting
Cons: - Can't do range queries efficiently - Resharding is complex9.3.3 Directory-Based Sharding
Section titled “9.3.3 Directory-Based Sharding” Directory-Based Sharding ========================
Lookup table maps keys to shards:
+-------------------+ | Shard Directory | +-------------------+ | user:123 -> Shard1| | user:456 -> Shard2| | user:789 -> Shard3| +-------------------+
Flow: 1. Look up user:456 in directory 2. Directory says Shard 2 3. Connect to Shard 2
Pros: - Flexible - Can move data without affecting clients
Cons: - Directory is single point of failure - Extra lookup needed9.3.4 Geographic Sharding
Section titled “9.3.4 Geographic Sharding” Geographic Sharding ====================
+--------------------------------------------------+ | World Map | +--------------------------------------------------+
North America Europe Asia | | | +-------+ +-------+ +-------+ |Shard 1| |Shard 2| |Shard 3| |Users | |Users | |Users | |from | |from | |from | |USA/CA | |EU | |Asia | +-------+ +-------+ +-------+
Benefits: - Low latency (data close to users) - Data sovereignty compliance - Regional performance9.4 Choosing a Shard Key
Section titled “9.4 Choosing a Shard Key”Ideal Shard Key Properties
Section titled “Ideal Shard Key Properties”| Property | Description |
|---|---|
| High Cardinality | Many distinct values |
| Even Distribution | Data spread evenly |
| Query Locality | Queries hit one shard |
| Stable | Rarely changes |
Good vs Bad Shard Keys
Section titled “Good vs Bad Shard Keys” Good Shard Keys ===============
User ID for user data: - High cardinality (billions of users) - Even distribution - Queries by user go to one shard
Bad Shard Keys ==============
Country for user data: - Low cardinality (200 countries) - Uneven distribution (US has many more users) - Hot spots
Zip code for orders: - Could work but check distribution - May need composite key9.5 Cross-Shard Operations
Section titled “9.5 Cross-Shard Operations”Challenges with Sharded Data
Section titled “Challenges with Sharded Data” Cross-Shard Queries ===================
Problem: "Find all orders for users in shard 1 AND shard 2"
Shard 1 Shard 2 Shard 3 +------+ +------+ +------+ |Users | |Users | |Users | +------+ +------+ +------+
Must query each shard separately!
Query: 1. Query Shard 1 -> 100 results 2. Query Shard 2 -> 50 results 3. Query Shard 3 -> 75 results 4. Merge and sort -> Final result
This is slow!Solutions
Section titled “Solutions”| Challenge | Solution |
|---|---|
| Joins across shards | Denormalize data, duplicate data |
| Aggregations | Query all shards, merge results |
| Transactions | Use two-phase commit or saga pattern |
| Unique IDs | Use distributed ID generators |
9.6 Rebalancing Shards
Section titled “9.6 Rebalancing Shards”When to Rebalance
Section titled “When to Rebalance” Rebalancing Triggers ====================
1. Data Growth +----------------------------------+ | Shard 1: 80GB (over 70%) | | Shard 2: 20GB (under 30%) | | -> Need to split Shard 1 | +----------------------------------+
2. Performance +----------------------------------+ | Shard 1: 5000 queries/sec | | Shard 2: 500 queries/sec | | -> Rebalance to equalize | +----------------------------------+
3. Adding/Shutting Nodes +----------------------------------+ | Add new server | | -> Redistribute data | +----------------------------------+Rebalancing Strategies
Section titled “Rebalancing Strategies”| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Fixed | Pre-defined shards | Simple | Inflexible |
| Dynamic | Auto-split on threshold | Adaptive | Complex |
| Virtual | Multiple logical shards per physical | Flexible | Overhead |
9.7 Sharding in Practice
Section titled “9.7 Sharding in Practice”Application-Level Sharding
Section titled “Application-Level Sharding” Application-Level Sharding ==========================
Application | v +-------------+ | Shard | | Router | +-------------+ | +----+----+----+ | | | | v v v v DB1 DB2 DB3 DB4
Router Logic: ```python def get_shard(user_id): return user_id % 4
def get_connection(user_id): shard = get_shard(user_id) return connection_pool[shard] ```Database-Native Sharding
Section titled “Database-Native Sharding” Database-Native Sharding ========================
MongoDB Sharded Cluster:
+---------------------------------------------------+ | Application | +---------------------------------------------------+ | +---------+---------+ | | v v +-------------+ +-------------+ | Router | | Router | | (mongos) | | (mongos) | +-------------+ +-------------+ | | +---------+---------+ | | v v +-------------+ +-------------+ | Config | | Config | | Server | | Server | +-------------+ +-------------+
Shard A (Primary) +-------------+ +-------------+ | Replica 1 | | Replica 2 | +-------------+ +-------------+9.8 Sharding vs Replication
Section titled “9.8 Sharding vs Replication”| Aspect | Sharding | Replication |
|---|---|---|
| Purpose | Scale writes & storage | Scale reads, HA |
| Data | Different data per node | Same data on all |
| Complexity | High | Medium |
| Queries | May need scatter-gather | Can read from any |
| Failure Impact | Partial data loss | Other replicas available |
Often Used Together
Section titled “Often Used Together” Sharding + Replication ====================
Application | v +----------------+ | Load Balancer | +----------------+ | +----------------+----------------+ | | | v v v +--------+ +--------+ +--------+ | Shard 1| | Shard 2| | Shard 3| |Primary | |Primary | |Primary | +--------+ +--------+ +--------+ | | | v v v +--------+ +--------+ +--------+ |Replica | |Replica | |Replica | +--------+ +--------+ +--------+9.9 Best Practices
Section titled “9.9 Best Practices”Sharding Guidelines
Section titled “Sharding Guidelines”| Best Practice | Description |
|---|---|
| Choose shard key carefully | Test distribution before production |
| Start with fewer shards | Don’t over-shard early |
| Plan for growth | Leave room to add shards |
| Monitor distribution | Check for hot spots |
| Test failover | Ensure HA works with sharding |
| Document routing logic | Make it maintainable |
Summary
Section titled “Summary”Key sharding concepts:
- Choose the right shard key - Cardinality, distribution, locality
- Hash-based is often best - Even distribution
- Cross-shard queries are expensive - Denormalize if needed
- Plan for rebalancing - Growth will happen
- Combine with replication - Both sharding and HA
- Start simple - Don’t over-engineer early