How does database sharding work in SQL server? Explained with examples
Imagine that you are using a Relational Database like SQL. You are using a single monolithic SQL server. But as more data starts coming in, the SQL queries (like read, write, and update operations) start becoming slow, and, as a result, the user experience takes a hit. You have tried optimizing your queries, and also setting up proper SQL indexes but nothing is helping. At this point, you need to scale up your database, and database sharding is one of the ways to do that.
Let's dive right in -
What is Database Sharding?
The literal meaning of shard means "a small part of something big". Sharding is the process of dividing a whole into smaller parts. Database sharding is the process of distributing data across multiple servers or nodes. Sharding is useful for scaling databases and improving performance.
Why is Database Sharding required?
Database sharding is required to scale up databases and improve their performance.
A single database server has a bottleneck as the data starts to increase. It has a limited amount of storage, RAM, and CPU, hence there is a certain physical limit to how many requests and queries it can handle. If the number of requests goes beyond this limit, you would need to split up this data across various servers or nodes so that the data is highly available. This is what sharding is all about. There are various ways to perform database sharding.
In short, the benefits of database sharding are
Improved performance
Increased capacity
Higher availability
Reduced dependency on individual nodes
How does sharding work in the SQL server?
Database sharding can be performed in a number of ways, which have their own different benefits and drawbacks. Let's look at the popular methods of sharding:
Horizontal or Range Based Sharding
Divide the entire data in a table into small chunks and store them on different database servers. This type of sharding is called Horizontal or Range based Sharding which allows databases to grow as needed by adding more nodes to store data.
The idea here is you divide the data into smaller buckets and keep on adding more buckets or data in each bucket as required.
Let's take an example of an SQL table. Let's say there is an employee's table in a SQL database. The fields in this table are `employee_id`, `employee_name`, `department`, `designation`, `address`, and `qualifications`.
If I want to shard this table horizontally, I have a couple of options.
I shard based on the initials of `employee_name`. For example, in the employees SQL table, all the employees whose name starts with the alphabets between "A"-"E" can lie in one bucket. Similar employees whose names start with "F"-"J" can lie in another bucket and so on.
Another option is to shard the SQL table based on the employee's "department". For example, all employees belonging to the HR department can go into one bucket and all the employees belonging to marketing can go into another bucket, and so on.
Which parameter to select to shard this SQL table depends completely on the use case and business logic. The end goal is that the data should be evenly distributed among each node or bucket.
The main benefit of horizontal sharding is that all the servers will have the SQL schema of the original database and data from one database server won't be related to the data in the other servers. Hence horizontal sharding is easier to implement. Even from the application layer point of view, implementation is simple as you won't be needing to query data from 2 different shards.
Speaking about disadvantages, a major disadvantage of horizontal sharding is that the data may not be evenly distributed across each server.
Vertical Sharding
Vertical Sharding involves splitting up tables into smaller tables by dividing columns based on a certain set of properties. The SQL table is divided such that some columns are in one server and some are in the other server. Each server/bucket has various properties and you can handle and manage different properties in different ways.
We can consider an SQL table named employees with the following column names: `employee_id`, `employee_name`, `department`, `designation`, `address`, `age`, and `qualifications`.
We can vertically shard this SQL table in a couple of ways:
In one server, we can keep columns like `employee_id`, `employee_name`, `department`, and `designation`, and in the other database server, we can keep the `employee_id` and the remaining columns. Yes, the primary key will be present in both of these servers so that we can uniquely identify the rows. Here I have divided the columns based on the use case, company-related information in one server, and personal information in the other server.
An alternative way to share this table would be to shard based on the frequency of usage of columns. For example, the columns that are fetched frequently can be in one server and the remaining columns can be in a separate server.
One of the main benefits of vertical partitioning is that we can separate columns based on a particular use case. For example, if we separate columns based on how frequently a particular set of columns are updated, it would make sense as the operation would be faster as there would be fewer columns.
The disadvantages of vertical sharding include higher complexity in the development process, especially in the application layer as there can be use cases where we would need to fetch data across various shards.
Adding a new column can be tricky as you would need to decide where to place the column and you may also need to create a new shard for it and regroup certain columns.
Key or hash-based sharding
This is similar to hash-based bucketing. In this, each row/entry in a table is put into different buckets based on the hash of their primary or unique keys.
To understand hash-based sharding with SQL, we can consider the above "employees" SQL table, where "employee_id" is the primary key. Suppose we know that we need to spread the data of this SQL table into 4 servers. A simple way to shard the data is -
Take the hash of the primary key, i.e. the "employee id" here
Modulo this hash with the number of database servers, i.e. 4 here
Then place that row in the corresponding server number
While everything looks fine, the main problem comes when you want to add or remove database servers. This problem can be solved by consistent hashing.
To know more about consistent hashing, checkout this blog on Load Balancers and consistent hashing. https://www.thegeekyminds.com/post/how-do-load-balancers-work-what-is-consistent-hashing-system-design
And that's a wrap! Hi, I am Gourav Dhar, a software developer and I write blogs on Backend Development and System Design. Subscribe to my Newsletter and learn something new every week - https://thegeekyminds.com/subscribe
Comments