When IT people discuss Mean Time Between Failure (MTBF), the most common focus is on disk drives. We all know the basics for instance – the more drives you put in an array, the lower the cumulative MTBF, etc.
What impact does virtualisation have on MTBF though? Are there any published studies? I suspect not yet.
I’ll be clear from the outset: I like virtualisation.
Just because I like it though doesn’t lead me to question how many sites (particularly smaller ones) implement it, and the risks that they carry of effectively decreased MTBF by putting too many eggs in one basket.
Consider for instance a small business that decides, as part of an infrastructure refresh, to replace their current fileserver, directory server, mail server, database server and internet gateway server with a single VMware ESX server. (We’ll assume of course that they do not virtualise their backup server – something you should never do.)
So, instead of having five primary production servers, each of which has some chance of experiencing a catastrophic failure, we now have one primary production server which can still experience catastrophic failure. I’m not talking at the OS layer here (though that’s still relevant), but at the hardware layer.
Let’s be honest with ourselves – this is IT, and things can go wrong in IT just as they can anywhere else.
Now, in a small business such as the above, it can be argued that the loss of any one server is likely to cause a fair to serious inconvenience, but in each case, other functions are likely to still be carried out while the hardware is being repaired. If people can’t email, they may be able to catch up on some documentation or file related work. If people can’t access the database, they may be able to process things manually while still emailing, etc.
If all five servers go down at once, that’s a significantly more challenging proposition.
Anyone with exposure to virtualisation, high availability/redundancy or data protection should see what is needed here – a second server, shared storage and the ability to have guest systems moved from one virtualisation server to the other. (In smaller companies it may be achieved instead by just having a standby server with storage that can be accessed by the other host if necessary.)
However, it’s clear there’s more to running a virtualised environment than just whacking a big server in and virtualising the hosts that are already in the computer room.
Companies that are now just starting to adopt virtualisation may feel that it’s a mature enough industry that the time is ripe for jumping in – and they’re right. In fact, it’s been mature enough for long enough that virtualisation is practically old hat.
Regardless of the maturity of virtualisation though, it doesn’t change the fact that you’re still at the mercy of hardware failures (or other critical virtualisation-host failures), and you still have to design your systems to provide the appropriate level of protection you can (a) afford and (b) is necessary. When doing cost comparisons, it’s not appropriate to compare say, the cost of replacing 5 servers with another 5 servers vs replacing 5 servers with 1 beefier server – virtualised services should never be about putting all the eggs in just one basket.
Without that consideration, it’s too easy to see MTBF for your computing environment fall through the floor – and blame virtualisation technology instead of the real culprit: the practical implementation.