Database Sharding: Horizontal Scaling for Large Databases

Database sharding is a technique for distributing large datasets across multiple database servers. Each shard contains a subset of the data, allowing the system to scale horizontally beyond the limits of a single server.

Database Sharding: Horizontal Scaling for Large Databases

Database sharding is a technique for distributing a large dataset across multiple database servers. Each shard contains a subset of the data, and together the shards form the complete dataset. As data grows beyond the capacity of a single server, sharding allows you to add more servers and continue scaling horizontally. Without sharding, you eventually hit the physical limits of a single machine, no matter how powerful.

Sharding is essential for applications that need to handle massive amounts of data or extremely high write throughput. Companies like Facebook, Google, and Uber use sharding to manage petabytes of data and millions of writes per second. To understand sharding properly, it is helpful to be familiar with database replication, database partitioning, and database transactions.

Database sharding overview:
Without Sharding (Single Server):
┌─────────────────────────────────────┐
│           Database Server           │
│         All 100GB of data           │
└─────────────────────────────────────┘

With Sharding (Multiple Servers):
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Shard 1  │  │ Shard 2  │  │ Shard 3  │
│ Users    │  │ Users    │  │ Users    │
│ ID 1-33  │  │ ID 34-66 │  │ ID 67-99 │
└──────────┘  └──────────┘  └──────────┘

What Is Database Sharding

Database sharding is a horizontal partitioning technique that splits a large database into smaller, independent databases called shards. Each shard has the same schema but contains a different subset of the data. The shards are distributed across multiple servers, and a routing layer directs each query to the appropriate shard based on a shard key.

  • Shard: An independent database containing a subset of the overall data.
  • Shard Key: A column (or set of columns) used to determine which shard a row belongs to.
  • Sharding Algorithm: The function that maps a shard key value to a specific shard.
  • Shard Router: The component that directs queries to the correct shard.
  • Resharding: The process of redistributing data when adding or removing shards.

Why Sharding Matters

Sharding addresses the fundamental limitations of single-server databases. When a single server cannot handle your data size, write throughput, or query load, sharding provides a path forward.

  • Write Scalability: Writes are distributed across shards. Each shard handles only a fraction of the total writes.
  • Data Size Scalability: No single server needs to store all data. Add more shards as data grows.
  • Query Parallelism: Queries that span shards can be executed in parallel.
  • Fault Isolation: A failure in one shard affects only that subset of users or data.
  • Geographic Distribution: Shards can be placed closer to users in different regions.
  • Cost Efficiency: Use many smaller, cheaper servers instead of one massive, expensive server.

Sharding Strategies

1. Range-Based Sharding

Data is divided into ranges based on the shard key. For example, users with IDs 1-1000 go to Shard 1, IDs 1001-2000 to Shard 2, and so on. Range sharding is simple to understand but can lead to uneven distribution (hot spots) if data is not evenly distributed across the range.

Shard Key: user_id (integer)
Range 1: user_id 1 - 1,000,000     → Shard 1
Range 2: user_id 1,000,001 - 2,000,000 → Shard 2
Range 3: user_id 2,000,001 - 3,000,000 → Shard 3

Query: SELECT * FROM users WHERE user_id = 500,000
Result: Routes to Shard 1

Pros: Simple, good for range queries
Cons: Risk of hot spots, can become unbalanced

2. Hash-Based Sharding

A hash function is applied to the shard key, and the hash value determines the shard. Hash sharding distributes data evenly but makes range queries impossible.

Shard Key: user_id
Shard = hash(user_id) % number_of_shards

Example: 3 shards
user_id=100 → hash(100) % 3 = 1 → Shard 1
user_id=200 → hash(200) % 3 = 2 → Shard 2
user_id=300 → hash(300) % 3 = 0 → Shard 0

Pros: Even distribution, no hot spots
Cons: Cannot do range queries, resharding is expensive

3. Directory-Based Sharding

A lookup table maintains the mapping from shard key values to shards. This provides maximum flexibility but adds a lookup step and a single point of failure.

Lookup Table:
┌─────────────┬──────────┐
│ user_id     │ shard_id │
├─────────────┼──────────┤
│ 1-1000      │ 1        │
│ 1001-5000   │ 2        │
│ 5001-20000  │ 3        │
│ 20001-50000 │ 1        │
└─────────────┴──────────┘

Query: SELECT * FROM users WHERE user_id = 15000
1. Query lookup table → shard 3
2. Route query to shard 3

Pros: Flexible, easy to rebalance
Cons: Lookup table is a bottleneck, single point of failure

4. Geographic Sharding

Data is sharded based on geographic location. Users in different regions are routed to different shards. This reduces latency and may help with data sovereignty regulations.

Shard Key: region
US users → Shard 1 (US East)
EU users → Shard 2 (EU West)
Asia users → Shard 3 (Singapore)

Pros: Low latency, compliance with data laws
Cons: Uneven distribution, cross-region queries are hard

Shard Key Selection

Choosing the right shard key is the most important decision in sharding. A poor shard key leads to hot spots, uneven distribution, and poor performance.

Good shard key characteristics:
✓ High cardinality (many unique values)
✓ Evenly distributes data across shards
✓ Used in most queries (to avoid scatter-gather)
✓ Immutable (never changes)
✓ Examples: user_id, customer_id, tenant_id

Bad shard key characteristics:
✗ Low cardinality (status, boolean, country)
✗ Skewed distribution (most data in one value)
✗ Not used in queries
✗ Can change over time
✗ Examples: status, is_active, region, product_category
Shard key examples:
-- Good: user_id (high cardinality, even distribution)
Shard = hash(user_id) % 16

-- Good: tenant_id for multi-tenant SaaS
Shard = tenant_id (one tenant per shard, or range of tenants)

-- Bad: status (only 3 values, 90% are 'active')
Shard = status  → 90% traffic to 'active' shard!

-- Bad: country (some countries have 100x more users)

Sharding Architectures

Application-Level Sharding

The application code contains the sharding logic. Each query includes logic to determine the correct shard. This is simple but couples sharding to the application.

// Application code determines shard
function getShard(userId) {
    const shardIndex = userId % 16;
    return dbConnections[shardIndex];
}

const db = getShard(userId);
await db.query('SELECT * FROM users WHERE user_id = ?', [userId]);

Proxy-Based Sharding

A proxy layer (like Vitess, ProxySQL) handles sharding transparently. The application connects to the proxy as if it were a single database.

Application
     │
     ▼
┌─────────────┐
│   Proxy     │ (Vitess, ProxySQL)
│  (Shard     │
│   Router)   │
└──────┬──────┘
       │
   ┌───┴───┬───────┐
   ▼       ▼       ▼
┌─────┐ ┌─────┐ ┌─────┐
│Shard│ │Shard│ │Shard│
│  1  │ │  2  │ │  3  │
└─────┘ └─────┘ └─────┘

Database-Native Sharding

Some databases (MongoDB, Cassandra, Citus for PostgreSQL) have built-in sharding support. The database automatically distributes data and routes queries.

-- Citus (PostgreSQL extension)
SELECT create_distributed_table('users', 'user_id');

-- MongoDB
sh.shardCollection("mydb.users", { "user_id": "hashed" });

-- Cassandra (defined in schema)
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    name TEXT
) WITH PARTITION KEY (user_id);

Challenges of Sharding

Challenge Description Solution
Cross-Shard Queries Queries that need data from multiple shards (JOINs, aggregations) Avoid cross-shard queries, denormalize, use application-level joining
Distributed Transactions Transactions that span multiple shards Use two-phase commit (rare), redesign to avoid, use saga pattern
Resharding Adding or removing shards requires moving data Use consistent hashing, double-write during migration, planned maintenance
Uneven Data Distribution Some shards become much larger than others Use hash sharding, monitor shard sizes, rebalance periodically
Schema Changes Changing schema across hundreds of shards Use migration tools, apply changes sequentially, tolerate temporary inconsistency

Resharding (Adding or Removing Shards)

As data grows, you need to add more shards. Resharding redistributes data across the new shard count. This is a complex operation that requires careful planning.

Resharding strategies:
1. Consistent Hashing
   - Each shard covers a range of hash values
   - Adding a shard only redistributes a fraction of data
   - Used by Cassandra, DynamoDB

2. Planned Maintenance Window
   - Stop writes, export data, re-import with new sharding
   - Simple but requires downtime

3. Double-Write Migration
   - Write to both old and new sharding
   - Backfill historical data
   - Switch reads to new sharding
   - No downtime but complex

4. Incremental Migration
   - Move data gradually, shard by shard
   - Long migration window but minimal impact
Consistent hashing example:
Hash ring (0 to 2^32-1):
  0──────────────────────────────────2^32
  │         │               │         │
Shard 1   Shard 2         Shard 3   Shard 1

Key hash falls into shard's range on the ring.
Adding a shard only affects neighboring ranges.

Sharding vs Other Scaling Techniques

Technique What It Does Best For Limitation
Vertical Scaling Add more CPU, RAM, disk to existing server Quick fix, small to medium data Physical limits, expensive
Replication Copy data to multiple servers Read scaling, high availability Does not scale writes
Partitioning Split tables within same database Manage large tables, archive old data Still on one server
Sharding Split data across multiple servers Write scaling, massive datasets Complexity, cross-shard operations

Common Sharding Mistakes to Avoid

  • Sharding Too Early: Sharding adds significant complexity. Only shard when a single server cannot handle your workload.
  • Poor Shard Key Choice: Uneven distribution or frequently changing shard keys cause major problems.
  • Cross-Shard Queries: Designing queries that need to read from many shards becomes slow and complex.
  • Distributed Transactions: Trying to maintain ACID across shards is very hard. Redesign to avoid.
  • No Monitoring: Without monitoring, you cannot detect hot spots or uneven distribution.
  • Ignoring Resharding: Plan for resharding from day one. You will need to add shards as data grows.

Sharding Best Practices

  • Shard at Application Start: Design for sharding from the beginning, even if you start with one shard.
  • Choose Shard Key Carefully: Test shard key distribution with real data patterns.
  • Avoid Cross-Shard Queries: Design data access patterns to stay within one shard per query.
  • Use Hash Sharding for Even Distribution: Range sharding often leads to hot spots.
  • Plan for 2x Growth: When adding shards, double the count to avoid frequent resharding.
  • Monitor Shard Sizes and Load: Track distribution and rebalance when needed.
  • Test Resharding Process: Practice resharding in staging before doing it in production.

Frequently Asked Questions

  1. At what data size should I consider sharding?
    There is no fixed number. Shard when you cannot solve performance problems with vertical scaling, replication, or optimization. For many applications, this is in the hundreds of gigabytes to terabytes range.
  2. Can I shard an existing database?
    Yes, but it is complex. You need to choose a shard key, redistribute data, and update application queries. Plan for downtime or use a double-write migration strategy.
  3. What is the difference between sharding and partitioning?
    Partitioning splits a table within the same database server. Sharding splits data across multiple database servers. Partitioning is a technique used within sharding.
  4. Does sharding work with joins?
    Joins that span shards are complex and slow. Design your schema and queries so that related data lives on the same shard. Denormalize or use application-level joins when necessary.
  5. What is the difference between sharding and replication?
    Replication copies the same data to multiple servers (read scaling). Sharding distributes different data across servers (write scaling). They are often used together: sharding for writes, replication within each shard for high availability.
  6. What should I learn next after database sharding?
    After mastering sharding, explore database replication, consistent hashing, distributed transactions, and high availability architectures for complete large-scale database mastery.