After understanding how to pick the correct dividing logic we continue our journey into database sharding. Many say that sharding is partitioning and they are right, but keep in mind that it’s the most complex form of all. In order to better grasp the concept, think about a field of flowers. Unpartitioned dataIn a normal situation (database), the flowers are all together.

What if you want to pick only the red flowers? Partitioned dataIn this case you would have to check every flower and see which one has the desired color, than pick it up, but that would take to long.

Instead, why not plant all the flowers based on their color. So, if you’d like to get the red ones only, it would be easy as pie.
The only problem which could appear would be if you wanted only the flowers which had 5 petals. That is why you must carefully think things over before starting to split your data.

Alright then, we’ve setup the logic, what next? It’s time to implement it.Now, the implementation is the tricky part.

First order of business is creating some sort of mapping for your data, so the interrogator will know where to search for information. This can be accomplished in 2 ways. You either hard-code it into the query parser, or you setup a mapping table.
Let me give you some examples. Take the posts table from the first part of the guide. The example given yesterday was splitting the posts data in odd and even category ids. This logic is so simple, it can be hard-coded in the query parser (I’ll show you just that in a later post) or in the object (POJO, POPO, you name it) or in the proxy. Here’s a quick preview:

  • a call to the category object is made, requesting a page. The object than calls it’s method, getPosts for example. The method than checks for the context id from which it has been called and sees if it’s odd or even then appends the proper suffix to the posts table in the query (”_odd” or “_even”).
  • the database abstraction layer receives a query. It parses it to see if it contains references to the posts table. If that’s the case, it continues to search and see if the category id is involved. If affirmative, rewrites the query to point to the correct posts table
  • the proxy receives a query. The handling is made the same as above

There might be the case when you start an app from the scratch and don’t have to scale an already existing one. If so, you can just write prepared statements and call them, but then you would be tied to the database.
If you have complex data splitting algorithms, than it might be a good idea to create a mapping table. The only downside is that this would be a major bottleneck if not planned carefully, as all queries will pass through it first and it doesn’t matter if you have the shards spread on hundreds of machines, the mapping table gets nuked by every request.
So, how does a mapping table look like? Check it out bellow:

CREATE TABLE `mydatabase`.`mapper` (
`pid` INT( 10 ) UNSIGNED NOT NULL ,
`cid` INT( 10 ) UNSIGNED NOT NULL ,
`table` VARCHAR( 50 ) NOT NULL ,
PRIMARY KEY ( `pid` )
) ENGINE = MYISAM

The query flow would look like:

SELECT pid, table FROM mapper WHERE cid=$cid;
// Get the post ids and the table from the above query and insert them in the one bellow
SELECT * FROM table_from_above WHERE pid IN (pid_list_from_above);

As you can see, the weight now sits in the writing part, as the mapping table must be populated. When reading, the splitting of the data algorithm involved is no longer a concern.
In part three of the series, I’ll talk about consistency of newly created data, as well as sharding models of big players on the web 2.0 scene.

Save and share: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Mixx
  • Google
  • StumbleUpon
  • Technorati
  • YahooMyWeb

Comments

6 Responses to “Database sharding unraveled - part II”

  1. sofia on May 29th, 2008 5:50 am

    the link to the first part is misspelled.

    Looking forward to the continuation of this series :) Thanx

  2. Bogdan on May 29th, 2008 9:59 am

    Thank you Sofia, I’ve corrected the error.
    Pretty soon I’ll post the continuation.

  3. Piccolo Principe on May 29th, 2008 12:12 pm

    I wonder if DBMSs don’t offer some features like this at a lower level.

  4. Bogdan on May 29th, 2008 1:19 pm

    The giant expensive ones like Oracle or DB2 indeed offer these features, but there are 2 things to consider:
    1. The price -> can’t scale big with prices like that
    2. You still have to code your way through, only you wouldn’t be coding in the app, but inside the DB.
    MySQL, for example, has some nice partitioning features, but they are extremely limited.
    Anyway, I’ll get on this issue in the next part of the series, because there is a reason for using app level partitioning.

  5. Sam on September 19th, 2008 1:22 am

    I am new to this… In fact I am trying to understand how sharding works and this is the first article I came across.

    Scaling by sharding seems complex. E.g the hardcoding in the query parser approach explained above has a limitation; you have to decide beforehand how many shards you are creating (and either stick to it or change the code in application). In this case 2: even & odd. If your posts in each catagory or any specific catagory grow huge, you will end up with large number of rows in posts table on that shard. How do you address this? Am I missing something obvious!

    Nice article though. Did the part III come out?

    Thanks!

  6. Bogdan on September 19th, 2008 10:32 am

    Hi Sam!
    The part III has been out for some time now. You should read it, it will answer your questions.
    http://lifescaler.com/2008/06/database-sharding-unraveled-part-iii

Leave a Reply




Advertisements