1️⃣ Why Sharding?

Problem

Single DB eventually hits limits:

  • Storage limit
  • Write throughput limit
  • Read throughput limit
  • Slow queries
  • Slow backups

Before Sharding

  1. Vertical Scaling (Scale Up)

    • Bigger CPU
    • More RAM
    • More Storage
  2. Sharding (Scale Out)

    • Split data across multiple DBs
    • Each shard has:
      • Own CPU
      • Own Memory
      • Own Storage

Definition

Sharding = Partitioning data across multiple databases to distribute load and storage.


2️⃣ Sharding Terminology

TermMeaning
Shard KeyField used to partition data
Distribution StrategyHow shard key maps to shards
ShardIndividual database holding subset of data

Example:

Shard Key = user_id
 
user_id -> hash() -> shard

3️⃣ Good Shard Key Characteristics

✅ High Cardinality

Many unique values.

Good:

user_id
order_id

Bad:

is_premium (true/false)

✅ Even Distribution

Data should spread evenly.

Avoid:

creation_date

because recent data may overload one shard.


✅ Query Alignment

Shard key should match access pattern.

Example:

Get all posts of a user

Shard by:

user_id

Only one shard is queried.


4️⃣ Good vs Bad Shard Keys

GoodWhy
user_idHigh cardinality, user-centric queries
order_idOrder-centric queries
BadWhy
is_premiumOnly 2 values
creation_dateHot recent shard

5️⃣ Sharding Strategies

A. Range-Based Sharding

0-10M     -> Shard1
10M-20M   -> Shard2
20M-30M   -> Shard3

Pros

  • Simple
  • Easy to understand

Cons

  • Uneven distribution
  • Hotspots
  • New users may all hit same shard

B. Hash-Based Sharding ⭐ Default Choice

hash(user_id) % N

Pros

  • Even distribution
  • Simple routing

Cons

  • Adding shard causes massive data movement

C. Consistent Hashing ⭐ Industry Standard

Used together with hash sharding.

Benefits

  • Minimal data movement
  • Easy shard addition/removal
  • Smooth scaling

Interview Default

Hash-Based Sharding
+
Consistent Hashing

D. Directory-Based Sharding

Lookup table stores:

user_id -> shard

Pros

  • Maximum flexibility
  • Easy user migration

Cons

  • Extra network hop
  • Single point of failure
  • Higher latency

Interview

Usually avoid unless specifically needed.


6️⃣ Hotspot Problem

Example

Messi lands on Shard 1.

All:

  • Likes
  • Comments
  • Views
  • Messages

hit same shard.


Solution 1: Compound Shard Key

user_id + suffix

Example:

user_id + post_id

Distributes load further.


Solution 2: Celebrity Shard

Normal Users -> Normal Shards
 
Celebrities -> Dedicated Shard

Useful for extreme traffic users.


7️⃣ Cross-Shard Queries

Problem

Need data from multiple shards.

Example:

Top 10 posts globally

Must query all shards.


Cost

  • Fan-out requests
  • Aggregation overhead
  • Increased latency

Solutions

Cache Results

Redis

Cache expensive global queries.

Examples:

  • Trending Posts
  • Leaderboards
  • Top Content

Denormalization

Duplicate data.

Read Faster
Write Harder

Trade-off:

BenefitCost
Fast ReadsMultiple Writes

Interview Rule

If frequent cross-shard queries exist:

  1. Wrong shard key
  2. Cache
  3. Denormalize

8️⃣ Cross-Shard Transactions

Problem

Transfer money:

Bob -> Shard A
Alice -> Shard B

Need atomic update across shards.


Two-Phase Commit (2PC)

Steps

  1. Prepare
  2. Commit

Pros

  • Strong consistency

Cons

  • Slow
  • Coordinator failure risk
  • Operational complexity

Saga Pattern ⭐ Preferred

Break transaction into smaller steps.

Step 1: Deduct from Bob
 
Step 2: Add to Alice

If Step 2 fails:

Refund Bob

Idea

Every action has a compensating action.


9️⃣ Interview Framework for Sharding

Step 1

Justify need for sharding.

Check:

  • Storage
  • Read QPS
  • Write QPS

Step 2

Choose shard key.

Example:

user_id

Step 3

Choose distribution strategy.

Hash-Based Sharding
+
Consistent Hashing

Step 4

Mention trade-offs.

Examples:

  • Cross-shard queries
  • Hotspots
  • Rebalancing

Step 5

Discuss future growth.

Start with 10 shards
 
Add shards using consistent hashing

🚀 Interview Cheat Sheet

Need More Storage/QPS?
    -> Sharding
 
Good Shard Key:
    -> High Cardinality
    -> Even Distribution
    -> Query Alignment
 
Default Strategy:
    -> Hash Sharding
    -> Consistent Hashing
 
Hotspot:
    -> Compound Key
    -> Celebrity Shard
 
Cross-Shard Query:
    -> Cache
    -> Denormalize
 
Cross-Shard Transaction:
    -> Avoid if possible
    -> Saga Pattern
 
Interview Flow:
    Why Shard?
    -> Shard Key
    -> Hash + Consistent Hashing
    -> Trade-offs
    -> Growth Plan