When your software application grows in volume and size, the database will most probably become overloaded and won’t be able to handle the incoming data as effectively as before. A common solution to this problem is sharding – but you need to know all the pros and cons in advance in order not to overcomplicate your app even more.
So, what is sharding and how do you do it right?
Below, we talk about its definition, available alternatives, and ways to realize it in order to maximize the performance and potential of your application.
What is sharding in database?
If we define sharding, it can be named as a distribution of a single dataset across multiple databases (shards) with an aim to make the load more even and manageable. In other words, it can be described as a horizontal scaling process that implies adding extra nodes (shards) to a database to improve its performance.
The most common use case for sharding in database is when your app demands a high write (together with the high read) volume and processes massive amounts of data. On the contrary, if your app is more read-focused, sharding is most often not recommended – and here is where we start talking about sharding alternatives.
Do you really need sharding in database? Looking at possible alternatives
So, you consider sharding to improve the performance of your database. While this strategy indeed can be a silver bullet, sometimes it’s best to use other options. The choice of the solution will depend on the database type, the app’s workload, available resources for database maintenance, and other factors. And here are the available alternatives.
While horizontal scaling implies adding more nodes to the database, vertical scaling implies amplifying the existing one and making it more powerful. This includes computer upgrade, addition of RAM, and similar activities. In this way, you don’t have to change the database architecture but make it capable of handling the increasing load.
Replication is very similar to sharding in a sense that you create multiple copies of your database. These copies have the same exact data as the primary database and are stored on different machines. A great advantage of replication is that it enables load balancing and increases availability of the data. This approach is most often used for read-focused workloads so if your application features a generous amount of write-focused workloads, replication may add unnecessary complexity.
Finally, there is always an option of delegating a certain workload to a third-party provider (like Amazon) or to a specialized service. In this way, you offload some amount of load to another database and won’t have to worry about maintaining it. On the other hand, such delegation means that you won’t be in full control of this external database security and maintenance which might be an issue in some cases.
The main benefits of sharding
Getting back to sharding, what makes this approach so popular? Below we list the main advantages that it brings:
- Higher reliability: with a single database, if it fails, the whole app fails too. With a sharded database, if one node fails, the whole app remains partially functioning.
- Effective scaling: as many experts state, sharding allows almost infinite scaling, meaning you are always able to allocate your resources exactly as you need them.
- Increased read/write capacity: thanks to the distribution of the whole dataset across several shards, the overall capacity of read/write functions is increased greatly.
The biggest challenges of database sharding
As mentioned above, sharding is not for everyone – and here are the main considerations and challenges that might occur.
Database Management Systems
Data is the core value for any company but in order to truly gain a competitive advantage from it, one has to properly organize the data so it can be easily managed.
Potential issues with response time and latency
Since the data is distributed across multiple shards in a sharded database, that means that the query routing might take longer than usual. In addition, if the required data is horizontally distributed among several shards, the router will have to query every shard one by one and then will take some time to merge the results. Such workflow can slow down the execution of operations and can result in low response times.
It can be quite challenging to properly manage a single database – with sharding, you have to keep an eye on several databases and on the data integrity and security. Hence, sharding adds complexity to database administration and can complicate such tasks as data analysis. Since the data is dsitrbuted across nodes, developers have to query these nodes, merge the information, and then analyze it.
High infrastructure costs
Every shard runs on a separate machine and requires additional computational resources and that means, sharding results in high infrastructure costs since you have to maintain all your shards instead of a single database on just one server. And while sharding indeed brings many benefits, you need to be aware of these costs in the beginning and plan them correspondingly.
The main sharding types
There is a great variety of sharding methods, depending on the data that you process and store and your specific business needs. Below, we look at the main sharding types and their principle of work.
Also known as dynamic sharding, this method is highly effective and is relatively simple to understand and implement. With ranged sharding, you first predefine the range and create a shard key, which is a single indexed field (or several fields covered by the compound index) that defines how the rows are distributed. Next, a field is taken as an input and, based on the predefined range, is allocated to the appropriate shard.
Though it may sound complicated, it’s actually not at all. Let’s look at an example. Say, you need to partition the data according to the customers’ first names. In this case, your shard keys and predefined range will look something like this:
|From A to H
|From I to P
|From Q to Z
So when a record is written in the database, the workflow will be as following:
- The application determines which shard key matches the range;
- The app matches the shard ket to the corresponding node (shard) that stores the data;
- The input (row) is stored on the machine.
Note though that it’s critical to properly define shard keys, since it’s easy to overload the data on a single node. By this, we mean that one shard can contain a much larger number of rows than others – this will obviously lead to the overload of this node.
Hashed sharding is similar to ranged sharding in a sense that a set of fields determines to which node the record will be allocated. With this sharding type, a shard key (yes, it is used here, too) is assigned to each row of a database with the help of a hash function. A hash function is a mathematical formula that is applied to a record and generates a hash value for it. The hash value is then matched with the shard key and voilà – the record is allocated to the needed physical node.
The main benefit of hashed sharding is even distribution of data across the nodes. The drawback, however, is that shed sharding does not distribute the data based on its meaning. Hence, query operations for several records will most probably be distributed across several shards. This, in turn, will result in higher broadcast operation occurrence.
Geo sharding distributes the data according to the geographic location, which is highly convenient. Since there are shards for each region (country), the latency is significantly decreased as it takes less time to retrieve the data. As well, the user experience gets better since it takes less time for the user to make a request and receive needed information.
But similar to range-based sharding, geo sharding may result in uneven data distribution, as one shard may contain a much bigger number of rows than others.
A tip on creating shard keys
By now, we know that sharding is a great way to offload your database and improve its performance, but it also comes with specific bottlenecks and challenges. To achieve even data distribution and smooth database performance, we recommend looking at the two main attributes of an effective shard key: cardinality and frequency.
Cardinality means the possible number of values of a shard key. In other words, it determines the maximal number of possible shards. So if you have a shard key as a yes/no data field, the maximal number of shards will be only two. Hence, the number of possible values of a shard key = the number of shards.
When defining a shard key, high cardinality is important as it allows you to increase the number of shards.
Frequency refers to the probability of storing the data in a shard and to the data distribution across the possible values. Say, you have a fitness app, and you know that your target audience is mainly 25–30 years old. In this way, most of your records will be assigned to corresponding nodes and data hotspots will occur. It is therefore important to maintain a well-distributed frequency to avoid data overload.
Now that we’ve answered the “what is sharding” question, it’s clear that sharding is an effective way to ramp up the performance of your database, decrease latency, and make your clients happier with faster response times. However, as with any other software solution, database sharding may not be for everyone and calls for thorough preliminary research and work. Without corresponding preparation, sharding may cause more problems than benefits and will slow down your app’s performance even more. Thus, consult with a knowledgeable database expert to define what sharding strategy (sharding, replication, etc.) will best meet your needs and how exactly it should be carried out.