Apr
15
After understanding how to pick the correct dividing logic we continue our journey into database sharding. Many say that sharding is partitioning and they are right, but keep in mind that it’s the most complex form of all. In order to better grasp the concept, think about a field of flowers.
In a normal situation (database), the flowers are all together.
What if you want to pick only the red flowers?
In this case you would have to check every flower and see which one has the desired color, than pick it up, but that would take to long.
Instead, why not plant all the flowers based on their color. So, if you’d like to get the red ones only, it would be easy as pie.
The only problem which could appear would be if you wanted only the flowers which had 5 petals. That is why you must carefully think things over before starting to split your data.
Alright then, we’ve setup the logic, what next? It’s time to implement it.Now, the implementation is the tricky part.
First order of business is creating some sort of mapping for your data, so the interrogator will know where to search for information. This can be accomplished in 2 ways. You either hard-code it into the query parser, or you setup a mapping table.
Let me give you some examples. Take the posts table from the first part of the guide. The example given yesterday was splitting the posts data in odd and even category ids. This logic is so simple, it can be hard-coded in the query parser (I’ll show you just that in a later post) or in the object (POJO, POPO, you name it) or in the proxy. Here’s a quick preview:
- a call to the category object is made, requesting a page. The object than calls it’s method, getPosts for example. The method than checks for the context id from which it has been called and sees if it’s odd or even then appends the proper suffix to the posts table in the query (”_odd” or “_even”).
- the database abstraction layer receives a query. It parses it to see if it contains references to the posts table. If that’s the case, it continues to search and see if the category id is involved. If affirmative, rewrites the query to point to the correct posts table
- the proxy receives a query. The handling is made the same as above
There might be the case when you start an app from the scratch and don’t have to scale an already existing one. If so, you can just write prepared statements and call them, but then you would be tied to the database.
If you have complex data splitting algorithms, than it might be a good idea to create a mapping table. The only downside is that this would be a major bottleneck if not planned carefully, as all queries will pass through it first and it doesn’t matter if you have the shards spread on hundreds of machines, the mapping table gets nuked by every request.
So, how does a mapping table look like? Check it out bellow:
CREATE TABLE `mydatabase`.`mapper` ( `pid` INT( 10 ) UNSIGNED NOT NULL , `cid` INT( 10 ) UNSIGNED NOT NULL , `table` VARCHAR( 50 ) NOT NULL , PRIMARY KEY ( `pid` ) ) ENGINE = MYISAM
The query flow would look like:
SELECT pid, table FROM mapper WHERE cid=$cid; // Get the post ids and the table from the above query and insert them in the one bellow SELECT * FROM table_from_above WHERE pid IN (pid_list_from_above);
As you can see, the weight now sits in the writing part, as the mapping table must be populated. When reading, the splitting of the data algorithm involved is no longer a concern.
In part three of the series, I’ll talk about consistency of newly created data, as well as sharding models of big players on the web 2.0 scene.
Comments
4 Responses to “Database sharding unraveled - part II”
Leave a Reply








the link to the first part is misspelled.
Looking forward to the continuation of this series
Thanx
Thank you Sofia, I’ve corrected the error.
Pretty soon I’ll post the continuation.
I wonder if DBMSs don’t offer some features like this at a lower level.
The giant expensive ones like Oracle or DB2 indeed offer these features, but there are 2 things to consider:
1. The price -> can’t scale big with prices like that
2. You still have to code your way through, only you wouldn’t be coding in the app, but inside the DB.
MySQL, for example, has some nice partitioning features, but they are extremely limited.
Anyway, I’ll get on this issue in the next part of the series, because there is a reason for using app level partitioning.