Skip to content

Sharding

Chapter 9: Database Sharding & Partitioning

Section titled “Chapter 9: Database Sharding & Partitioning”

Sharding (also called horizontal partitioning) is a method of distributing data across multiple databases or database instances to distribute load and enable horizontal scaling.

Vertical vs Horizontal Partitioning
==================================
Vertical Partitioning:
+----------+ +----------+ +----------+
| Table A | | Table B | | Table C |
| (Users) | |(Orders) | |(Products)|
+----------+ +----------+ +----------+
Horizontal Partitioning (Sharding):
+------------------------+
| Users Table |
+------------------------+
|
v
+----------+----------+
| Shard 1 | Shard 2 |
| A-M | N-Z |
+----------+----------+

Monolithic Database Problems
============================
1. Storage Limits
+----------------------------------+
| 1 Billion Users |
| Database size: 50TB |
| Single disk: 10TB |
| Need 5 disks just for storage |
+----------------------------------+
2. Write Throughput
+----------------------------------+
| 100,000 writes/second |
| Single DB: handles 10,000 |
| 90% writes rejected |
+----------------------------------+
3. Connection Limits
+----------------------------------+
| 10,000 concurrent users |
| Max connections: 500 |
| Connection pool exhausted |
+----------------------------------+
BenefitDescription
Horizontal ScalingAdd more shards to increase capacity
Improved PerformanceSmaller datasets per shard = faster queries
High AvailabilityShard failures don’t affect others
Geographic DistributionShards in different regions

Range-Based Sharding
====================
Shard by ranges of a key:
+------------------------------------------------+
| User ID |
+------------------------------------------------+
| | | |
v v v v
+------+ +------+ +------+ +------+
|Shard | |Shard | |Shard | |Shard |
| 1 | | 2 | | 3 | | 4 |
| ID | | ID | | ID | | ID |
| 1-1M | |1M-2M | |2M-3M | |3M-4M |
+------+ +------+ +------+ +------+
Example:
User ID 500,000 -> Shard 1
User ID 1,500,000 -> Shard 2
Pros:
- Simple to implement
- Easy to find data
Cons:
- Hot spots (popular IDs)
- Uneven distribution
Hash-Based Sharding
===================
Apply hash function to shard key:
hash(user_id) = 12345
shard = hash % num_shards
+------------------------------------------------+
| User ID: 12345 |
+------------------------------------------------+
|
v
hash(user_id)
|
v
12345 % 4 = 1
|
v
+------------------------------------------------+
| Shard 0 | Shard 1 | Shard 2 | Shard 3 |
| | | | |
| hash%4=0 | hash%4=1 | hash%4=2 | hash%4=3 |
+------------------------------------------------+
Pros:
- Even distribution
- Less hot spotting
Cons:
- Can't do range queries efficiently
- Resharding is complex
Directory-Based Sharding
========================
Lookup table maps keys to shards:
+-------------------+
| Shard Directory |
+-------------------+
| user:123 -> Shard1|
| user:456 -> Shard2|
| user:789 -> Shard3|
+-------------------+
Flow:
1. Look up user:456 in directory
2. Directory says Shard 2
3. Connect to Shard 2
Pros:
- Flexible
- Can move data without affecting clients
Cons:
- Directory is single point of failure
- Extra lookup needed
Geographic Sharding
====================
+--------------------------------------------------+
| World Map |
+--------------------------------------------------+
North America Europe Asia
| | |
+-------+ +-------+ +-------+
|Shard 1| |Shard 2| |Shard 3|
|Users | |Users | |Users |
|from | |from | |from |
|USA/CA | |EU | |Asia |
+-------+ +-------+ +-------+
Benefits:
- Low latency (data close to users)
- Data sovereignty compliance
- Regional performance

PropertyDescription
High CardinalityMany distinct values
Even DistributionData spread evenly
Query LocalityQueries hit one shard
StableRarely changes
Good Shard Keys
===============
User ID for user data:
- High cardinality (billions of users)
- Even distribution
- Queries by user go to one shard
Bad Shard Keys
==============
Country for user data:
- Low cardinality (200 countries)
- Uneven distribution (US has many more users)
- Hot spots
Zip code for orders:
- Could work but check distribution
- May need composite key

Cross-Shard Queries
===================
Problem:
"Find all orders for users in shard 1 AND shard 2"
Shard 1 Shard 2 Shard 3
+------+ +------+ +------+
|Users | |Users | |Users |
+------+ +------+ +------+
Must query each shard separately!
Query:
1. Query Shard 1 -> 100 results
2. Query Shard 2 -> 50 results
3. Query Shard 3 -> 75 results
4. Merge and sort -> Final result
This is slow!
ChallengeSolution
Joins across shardsDenormalize data, duplicate data
AggregationsQuery all shards, merge results
TransactionsUse two-phase commit or saga pattern
Unique IDsUse distributed ID generators

Rebalancing Triggers
====================
1. Data Growth
+----------------------------------+
| Shard 1: 80GB (over 70%) |
| Shard 2: 20GB (under 30%) |
| -> Need to split Shard 1 |
+----------------------------------+
2. Performance
+----------------------------------+
| Shard 1: 5000 queries/sec |
| Shard 2: 500 queries/sec |
| -> Rebalance to equalize |
+----------------------------------+
3. Adding/Shutting Nodes
+----------------------------------+
| Add new server |
| -> Redistribute data |
+----------------------------------+
StrategyDescriptionProsCons
FixedPre-defined shardsSimpleInflexible
DynamicAuto-split on thresholdAdaptiveComplex
VirtualMultiple logical shards per physicalFlexibleOverhead

Application-Level Sharding
==========================
Application
|
v
+-------------+
| Shard |
| Router |
+-------------+
|
+----+----+----+
| | | |
v v v v
DB1 DB2 DB3 DB4
Router Logic:
```python
def get_shard(user_id):
return user_id % 4
def get_connection(user_id):
shard = get_shard(user_id)
return connection_pool[shard]
```
Database-Native Sharding
========================
MongoDB Sharded Cluster:
+---------------------------------------------------+
| Application |
+---------------------------------------------------+
|
+---------+---------+
| |
v v
+-------------+ +-------------+
| Router | | Router |
| (mongos) | | (mongos) |
+-------------+ +-------------+
| |
+---------+---------+
| |
v v
+-------------+ +-------------+
| Config | | Config |
| Server | | Server |
+-------------+ +-------------+
Shard A (Primary)
+-------------+ +-------------+
| Replica 1 | | Replica 2 |
+-------------+ +-------------+

AspectShardingReplication
PurposeScale writes & storageScale reads, HA
DataDifferent data per nodeSame data on all
ComplexityHighMedium
QueriesMay need scatter-gatherCan read from any
Failure ImpactPartial data lossOther replicas available
Sharding + Replication
====================
Application
|
v
+----------------+
| Load Balancer |
+----------------+
|
+----------------+----------------+
| | |
v v v
+--------+ +--------+ +--------+
| Shard 1| | Shard 2| | Shard 3|
|Primary | |Primary | |Primary |
+--------+ +--------+ +--------+
| | |
v v v
+--------+ +--------+ +--------+
|Replica | |Replica | |Replica |
+--------+ +--------+ +--------+

Best PracticeDescription
Choose shard key carefullyTest distribution before production
Start with fewer shardsDon’t over-shard early
Plan for growthLeave room to add shards
Monitor distributionCheck for hot spots
Test failoverEnsure HA works with sharding
Document routing logicMake it maintainable

Key sharding concepts:

  1. Choose the right shard key - Cardinality, distribution, locality
  2. Hash-based is often best - Even distribution
  3. Cross-shard queries are expensive - Denormalize if needed
  4. Plan for rebalancing - Growth will happen
  5. Combine with replication - Both sharding and HA
  6. Start simple - Don’t over-engineer early

Next: Chapter 10: ACID vs BASE Models