Archive for the ‘Database Sharding’ Category
As far as I’m concerned, the first rule of database sharding is “You must have a data access layer“.
If you have architected your system in a “client-server” style were data access code is embedded in the client code then the effort required to shard is going to be huge, partially because you have so many places in your code to address and partially because of the vast amount of testing that is going to be required because you have likely touched every source file in your system.
I hit this problem some time ago. It wasn’t on a sharding project but something similar. The project was writing the next version of a system that had been written in VB6 by some inexperienced developers. Data access code and dynamically built SQL statements were all over the place. Their idea of a data access layer was a “Db” module that had methods for running queries and returning DAO result sets. E.g.,
Set rs = Db.Execute("Select * from Customers where " _ + strCol + " = " + strCustName)
And about 20% of the time they would directly call DAO and not user their own “DAL”. Yuck!
We began moving the functionality over to .NET piece by piece. I created a three-layer style architecture for the .NET pieces: UI Layer, Domain Layer (sometimes called Business Logic Layer) and Data Access Layer. We used COM Interop to call the .NET code from the VB6 exe. (I wish I could have created a .NET exe and converted the VB6 part to a DLL but … well that’s a subject for another post.)
After some successful releases the client decided they wanted to move over to using SQL Server instead of MS Access. We got a SQL Server version of the database created relatively quickly and then we moved on to converting the code to use the new database.
To convert the .NET code took one developer 3 days and that included extensive unit testing. This was because all data access code was in one place and the Domain and UI Layers knew nothing about the database: not its structure, not its constraints, not its table names, not its column names, nothing. It would have taken less time if it were not for the fact that we had to tweek some of the Access SQL so it would play nice with SQL Server.
To convert the VB6 code took three developers over two months to get about 75% done (without all the needed testing) and was then put on hold because there was a deadline to get out some high priority new features to the system users.
Why so long? It wasn’t the developers, these were smart, hardworking guys. No, it was simply the lack of a data access layer.
The same applies to sharding a database. If you don’t have your data access code isolated in one area then I’d suggest you do that first before you even think of sharding.
It is probably possible to shard without a data access layer, but then it’s probably possible to stab yourself with a machete and not bleed to death, but do you really want to risk it?
We hit some tables where some records applied to only one shard but others applied to multiple shards.
As an example, one of the tables was Users. The database is sharded on the Companies table. Many users belong to only one company (e.g. employees of a company) but there were some (such as in-house support people and super-admins) who belong to multiple companies.
The possible solutions we came up with were:
- Split the tables according to the different records (Table per Type or Table per Concrete Class). For example, keep the Users table for people who belong to only one company and create an AdminUsers table for those who belong to multiple companies.
- Add a discriminator column (e.g., UserType with values of ‘Regular’ or ‘Admin’) and handle the records according to the type (Table per Hierarchy).
- Move the split table that applies to multiple shards into a “master” database so it is not on the shards and query it there. E.g., we split Users into Users and AdminUsers and move AdminUsers onto the master database.
- When loading a user do a cross-shard query to get all the companies/shards the entity is on. Then do all further CRUD operations on the entity by hitting the specific shards you found in the first query. E.g., User #5 administers company #7 on shard 1, company #10 on shard 2 and company #15 on shard 3. He does not administer any companies on shards 4 or 5. Further CRUD for that user will hit shards 1, 2 and 3.
My preference is #1, but I think we’ll be going with #4 on all the “special” tables because the amount of extra work in implementing #1 right now is way more than we have the time or resources for. Perhaps once we have the sharded system up and running we may switch to #1, but that is in the future and a lot depends on the performance for #4. Luckily most of the “special” tables (other than User) are not high traffic, so we’re hoping it won’t be a problem.
Of course your own situation will probably be different so you should evaluate the pros and cons of these methods for your particular situation. And you may come up with other ways to do it. If you do, please leave a comment describing your solution.
Good Sharding Article
Here is a good intro article on sharding: Database Sharding.
I’m nearing the end of a database sharding project and I wanted to make some notes about what I’ve observed both for myself, so I can refer back here when I do it again, and for anyone who may be interested in the subject.
To Shard or not to Shard, That is the Question
It’s nice to dream, but do you really want to put in all the extra time and effort it will take to implement sharding before knowing that you will reach the huge volume necessary to require you to chop your database into bits?
I’d say no and the guys over at 37Signals agree: Don’t do it until you have to.
Sharding is an evolutionary step that you should undertake only when you really need to.
Should you shard from the start?
Sitting in the Ivory Tower of design, it is easy to predict which entity you will be partitioning your shards on, but after running your system for a few months, or better still a few years, you are likely to find that your lofty theories don’t stand up to the harsh realities of production.
Example: In the system I’m working on, it was originally thought that the Users table would be the one to partition by because users had Contacts and the ContactActivities table would get huge and this would be best chopped up by User. But as the business progressed and the true needs of the people using it became clear it turned out that a totally different table was the bottleneck and a totally different entity had to be used for partitioning.
If that database had been designed from scratch to shard across the Users table, then the work necessary to “reshard” would have ended up being much more than to shard it across the correct entity after waiting a while to find out what that entity was.
So I’d say, don’t shard from the start. Shard when you need to because of volume and when you have the data in your database that you can analyze to tell you what to shard on.
The one exception to this is where you are building a system that is a copy of an existing system which has already gone through the process and the sharding entity is well known.
For example, if you are building the next Facebook then you can use the same entity they do to shard by.