May 182009
 

Recently Australia’s largest grocery chain followed some of the other chains and started offering unit pricing on their products. For example, packaged food includes not only its actual RRP but also the price per 100gm. That way, you can look at say, two blocks of cheese and work out which one is technically the better price, even if one is larger than the other.

This has reminded me of how miserly some companies can be with backup. While it’s something I cover in my book, it’s also something that’s worth explaining in a bit of detail.

To set this up, I want to use LTO-4 media as the backup destination, and look at one of those areas of systems that are frequently skipped from backups by miserly companies looking to save a buck here and there. That, of course, is the operating system. Too often it’s common to see backup configurations that back up data areas on servers, but leave the operating system unprotected because “that can be rebuilt”. That sort of argument is often a penny-wise/pound-foolish approach that fails to take into account the purpose of backup – recovery.

Sure, useless backups are a waste of money. That is, if you backup an Oracle database using the NetWorker module, but also let filesystem backups pick up the datafiles from the running database, then you’re not only backing up the database twice, but also the non-module backup in the scenario I’m describing is useless because it can’t be recovered from.

However, are operating system backups are waste of money or time? My argument is that except in circumstances where it is architecturally illogical/unnecessary, they’re neither a waste of money nor a waste of time. Let’s look at the why…

At the time of writing, a casual search of “LTO-4 site:.au best price” in Google yields within the first 10 results LTO-4 media as low as $80 for RRP. That’s RRP, which often has little correlation with bulk purchases, but miserly companies don’t make bulk media purchases, so we’ll work off that pricing.

Now, LTO-4 media has a native capacity of 800 GB. Rather than go fuzzy with any numbers, we’ll assume native capacity only for this example. So, at $80 for 800 GB, we’re talking about $0.10 per GB – 10c per GB.

So, our $80/800GB cartridge has a “unit cost” of 10c/GB, which sounds pretty cheap. However, that’s probably not entirely accurate. Let’s say that we’ve got a busy site and in order to facilitate backups of operating systems as well as all the other data, we need another LTO-4 tape drive. Again, looking around at list prices for standalone drives (“LTO-4 drive best price site:.au”) I see prices starting around the $4,500 to $5,000 mark. We should expect to see the average drive (with warranty) last for at least 3 years, so that’s $5,000 for 1,095 days, or $4.56 per day of usage. Let’s round that to $5 per day to account for electricity usage.

So, we’re talking about 10c per GB plus $5 per day. Let’s even round that up to $6 per day to account for staff time in dealing with any additional load caused by operational management of operating system backups.

I’ll go on the basis that the average operating system install is about 1.5GB, which means we’re talking about 15c to back that up as a base rate, plus our daily charge ($6). If you had say, 100 servers, that’s 150GB for full backups, or $15.00 for the full backups plus another $6 on that day. Operating system incremental backups tend to be quite small – let’s say a delta of 20% to be really generous. Over the course of a week then we have:

  • Full: 150GB at $15 + $6 for daily usage.
  • Incremental: 6 x (30GB at $3 + $6 for daily usage).

In total, I make that out to be $63 a week, or $3,276 a year for operating system backups to be folded into your current data backups. Does that seem a lot of money? Think of this: if you’re not backing up operating system data, this usually means that you’re working on the basis that “if it breaks, we’ll rebuild the server”.

I’d suggest to you that in most instances your staff will spend at least 4 hours trying to fix the average problem before the business decision is made to rebuild a server. Even with say, fast provisioning, we’re probably looking at 1 hour for full server reinstall/reprovision, current revision patching, etc. So that equals 5 hours of labour. Assuming a fairly low pay rate for Australian system administrators, we’ll assume you’re paying your sysadmins $25 per hour. So a 5 hour attempted fix + rebuild will cost you $125 in labour. Or will it? Servers are servers because they typically provide access or services for more than one person. Let’s assume 50 staff are also unable to work effectively while this is going on, and their average salary is even as low as $20 per hour. That’s $5,000 for their labour, or being more fair, we’ll assume they’re only 50% affected, so that’s $2,500 for their wasted labour.

How many server rebuilds does it take a year for operating system backups to suddenly be not only cost-effective but also a logically sound business decision? Even when we factor in say, an hour of effort for problem diagnosis plus recovery when actually backing up the operating system regions, there’s still a significant difference in price.

Now, I’m not saying that any company that chooses not to backup operating system data is being miserly, but I will confidently assert that most companies who choose not to backup their operating system data are being miserly. To be more accurate, I’d suggest that if the sole rationale for not doing such a backup is “to save money” (rather than “from an architectural standpoint it is unnecessary”) then it is likely that a company is wasting money, not saving it.

  4 Responses to “Backups are not about being miserly”

  1. Oddly enough, I had this topic raised the other day. Mind you, the client was also one of the ones that takes backups but never does a restore.
    The business they run requires an response from clients of less than 10 minutes (the client will lose business and money otherwise) and yet the DR plan is get tapes from offsite -> restore -> back to BAU in 5 days. When I tried to raise the topic of “that’s a little bit mental isn’t it?” They went silent. Care to comment?

  2. Your experience is, unfortunately, not without precedence. SLAs that have never needed to be enforced, when handled incorrectly, become laughable guidelines that are ignored.

    On the other hand, of the client of your client happened to suddenly require a test recovery, I suspect the proverbial dark matter would hit the fan at an alarming rate.

    It used to be, 10 years ago for instance, that “spending money on backup” was something we were still teaching recalcitrant businesses to do. That challenge has shifted. Now it’s no longer a case of convincing businesses to spend money on backup, but to convince them to spend money on recovery.

    For instance, around 2000-2003 I was aware of a company in Australia that kept a “cold DR site” based on regulatory requirements. It was however established for name only, because the machines on site were the ex production machines, and none of them either had the physical RAM or storage required to actually enable them to run the business. Yet, it fulfilled the legal requirements.

    As I say in my book, it comes down to the distinction between having “backup software” installed, and having a “backup system” installed. One doesn’t create the other, no matter how much spendthrift organisations would like.

    After all, I still remember spending 3 hours in a meeting only 4 years ago, which was called by a spendthrift customer to discuss “urgent” backup problems. Those problems, it turned out, was that after having purchased 50 SDLT-320 tapes a year before, they were facing the need to purchase another ten and wanted urgent recommendations to avoid that.

    Back to your original point – unrealistic SLAs result in ignored SLAs. The best way this can be rectified is to either have the SLAs tested and found lacking, or have the SLAs renegotiated and set to something more realistic. At that point, the business may treat them more seriously…

  3. Oddly enough I’ve only had this brought up a few times and the one time the case was made with any vigour, I ended up agreeing to the customer..

    However, their case was an odd one.. Their production environment was mostly Solaris 9, and all servers were built off a JumpStart server and config pushed via cfengine (the infrastructure servers were backed up in total for DR purposes). We directed the backups of their production systems by group to a specific set of directories where the user data, work files and status files lived.

    The main point they made however was not about the space, but the time spent back the data up and the time it would take to restore it. They posited that their time would be better spend doing a DR on the key infrastructure systems, and them building servers from those key systems then restoring their data from the tapes.

    Their point being that the less OS data on the tapes, the less seek time and the faster they could restore their systems. At this point I thought this was splitting hairs, but also not worth the fight.. At the time this was LTO2, so this wasn’t totally unreasonable..

    I only mention this because it was such a memorable exception to the rules. They had everything so tightly managed there, it was very impressive.. I’ve only had one client so organized that I’ve thought they could safely bypass OS backups without a problem with DR.

    • Hi Michael,

      In such a scenario I’d agree that with a highly locked down configuration system and fully automated rebuild, they may very well have been better suited to not do operating system backups. I think this (when properly managed) would fall into what I say in the article that operating system backups can be skipped when “from an architectural standpoint it is unnecessary”.

      When it comes to OS or application area backups, I mention a “15 minute rule” in my book – if after reinstalling the operating system it takes more than 15 minutes to customise the various changes and updates required to integrate it into the environment, then backing up the non-data regions would be a Really Good Idea. If any administrator in the team without any specific training on that host however could make the requisite changes (at say, 2am in the morning) without assistance from anyone else, then yes, you could successfully argue that the environment has been sufficiently architected.

      It seems to me based on your example that you were lucky enough to encounter a site that was very well focused on achieving that level of setup.

      Cheers,

      Preston.

Sorry, the comment form is closed at this time.