top of page

System design paradigm: Sharding

Scale out for too big data or too many writes

Primary-replica pattern and in-memory cache solves the problem on the read side. But what if the total data is too big to fit in one host or the write throughput is too high? The answer is very intuitive: we split the total data into multiple hosts. This is called sharding.

Notice that write latency is usually not an issue because

  • We expect write to be slow for most applications

  • The write can return immediately after persistent on commit log, we therefore trade strong consistency for lower latency

Sharding key

A key decision of sharding is given a record R, how to determine which shard it belongs to. The sharding design contains a key function that maps R to a key string. We then compute a hash of the key. The modulo of the hash(integer) on N is the shard ID. N is the number of hosts.

Shard Hash(Key(R)) mod N

Notice that Key(R) doesn’t have to be available at query time. For example, R is <user_id, order_id, quantity>. We can use order_id as a sharding key(Key(R) = order_id). We can still serve the query of finding all orders for a given user by doing the query on all shards.

When choosing sharding key should consider

  • Avoidance of write hotspot

  • Avoidance of query hotspot

Query pattern: one or all

There are two query patterns.

  • Query One: If we are querying records with a single sharding key, the query can be handled by the single shard.

  • Aggregation: If the query’s predicate can’t be translated to a subset of sharding keys, then the query must be processed on each shard. The result is aggregated before returning to clients.

Each pattern has advantages and weaknesses.

The weakness from each pattern can be mitigated. For example, reading hotspot can be mitigated by replication and caching. Read latency can be mitigated by imposing shard timeout on RPC(in expense of data completeness).

Query hotspot is interesting in that aggregation doesn’t really distribute the query load, but simply repeats the load on each shard. This way, the query loads on each shard are the same. This makes it possible to assign the same number of read replicas per shard.

The choice should be determined per application tradeoff.

Last words: don’t do it unless you must

Sharding is rarely a good idea. The costs include

  • Harder to write application code

  • Harder to reason and debug the system

  • Harder to operate/maintain

  • Higher order of complexity on everywhere

Tremendous architectural work is needed to get sharding right and convenient for application developers. So don’t use it lightly. The good news is that most projects don’t have to be shared because the write QPS is very low in the first place. When teams do need sharding, they should be mature enough to know what they are doing.

Source: Medium

The Tech Platform



bottom of page