Healthy paranoia

 Architecture, Backup theory  Comments Off on Healthy paranoia
Jun 012012

Healthy paranoia

Are your backup administrators people who are naturally paranoid?

What about your Data Protection Advocate?

What about the members of your Information Protection Advisory Council?

There’s healthy paranoia, and then there’s crazy paranoia. (Or as is trendy to say these days, “cray cray”.)

Being a facet of Information Lifecycle Protection, backup is about having healthy paranoia. It’s about behaving both as a cynic and a realist:

  • The realist will understand that IT is not immune to failures, and
  • The cynic will expect that cascading or difficult failures will occur.

Driven from a healthy sense of paranoia, part of the challenge of being involved in backup is an ability to plan for bad situations. If you’re involved in backup, you should be used to asking “But what if…?”

As I say in my book, backup is a game of risk vs cost:

  1. What’s the risk of X happening?
  2. What’s the cost of protecting against it?
  3. What’s the cost of not protecting against it?

Paranoia, in the backup game, is being able to quantify the types of risk and exposure the business has – item 1 in the above list. Ultimately, items 2 and 3 become business decisions, but item 1 is almost entirely the domain of the core backup participants.

As such, those involved in backup – the backup administrators, the DPA, the IPAC, need to be responsible for development and maintenance of a risk register. This should be a compilation of potential data loss (and potentially data availability loss*) situations, along with:

  • Probabilities of the event occurring (potentially just as “High”, “Low”, etc.);
  • Current mitigation techniques;
  • Preferred or optimal mitigation techniques;
  • Whether the risk is a primary risk (i.e., one that can happen in and of itself), or a secondary risk (i.e., can only happen after another failure);
  • RPO and RTO.

This register then gets fed back first to the broader IT department to determine question two in the risk vs cost list (“What’s the cost of protecting against it?”), but following that, it gets fed back to the business as a whole to answer the third question in the risk vs cost list (“What’s the cost of not protecting against it?”).

Finally, it’s important to differentiate between healthy paranoia and paranoia:

  • Healthy paranoia comes from acknowledging risks, prioritising their potential, and coming up with mitigation plans before deciding a response;
  • Paranoia (or unhealthy paranoia) happens when risks are identified, but mitigation is attempted before the risk is formally evaluated.

A backup administrator, given carte blanche over the company budget, could spend all of it for 5 years and still not protect against every potential failure the company could ever conceivably have. That’s unhealthy paranoia. Healthy paranoia is correctly identifying and prioritising risk so as to provide maximum appropriate protection for the business within reasonable budgetary bounds.

* Arguably, data availability loss is a broader topic that should also have significant involvement by other technical teams and business groups.

May 182009

Recently Australia’s largest grocery chain followed some of the other chains and started offering unit pricing on their products. For example, packaged food includes not only its actual RRP but also the price per 100gm. That way, you can look at say, two blocks of cheese and work out which one is technically the better price, even if one is larger than the other.

This has reminded me of how miserly some companies can be with backup. While it’s something I cover in my book, it’s also something that’s worth explaining in a bit of detail.

To set this up, I want to use LTO-4 media as the backup destination, and look at one of those areas of systems that are frequently skipped from backups by miserly companies looking to save a buck here and there. That, of course, is the operating system. Too often it’s common to see backup configurations that back up data areas on servers, but leave the operating system unprotected because “that can be rebuilt”. That sort of argument is often a penny-wise/pound-foolish approach that fails to take into account the purpose of backup – recovery.

Sure, useless backups are a waste of money. That is, if you backup an Oracle database using the NetWorker module, but also let filesystem backups pick up the datafiles from the running database, then you’re not only backing up the database twice, but also the non-module backup in the scenario I’m describing is useless because it can’t be recovered from.

However, are operating system backups are waste of money or time? My argument is that except in circumstances where it is architecturally illogical/unnecessary, they’re neither a waste of money nor a waste of time. Let’s look at the why…

At the time of writing, a casual search of “LTO-4 best price” in Google yields within the first 10 results LTO-4 media as low as $80 for RRP. That’s RRP, which often has little correlation with bulk purchases, but miserly companies don’t make bulk media purchases, so we’ll work off that pricing.

Now, LTO-4 media has a native capacity of 800 GB. Rather than go fuzzy with any numbers, we’ll assume native capacity only for this example. So, at $80 for 800 GB, we’re talking about $0.10 per GB – 10c per GB.

So, our $80/800GB cartridge has a “unit cost” of 10c/GB, which sounds pretty cheap. However, that’s probably not entirely accurate. Let’s say that we’ve got a busy site and in order to facilitate backups of operating systems as well as all the other data, we need another LTO-4 tape drive. Again, looking around at list prices for standalone drives (“LTO-4 drive best price”) I see prices starting around the $4,500 to $5,000 mark. We should expect to see the average drive (with warranty) last for at least 3 years, so that’s $5,000 for 1,095 days, or $4.56 per day of usage. Let’s round that to $5 per day to account for electricity usage.

So, we’re talking about 10c per GB plus $5 per day. Let’s even round that up to $6 per day to account for staff time in dealing with any additional load caused by operational management of operating system backups.

I’ll go on the basis that the average operating system install is about 1.5GB, which means we’re talking about 15c to back that up as a base rate, plus our daily charge ($6). If you had say, 100 servers, that’s 150GB for full backups, or $15.00 for the full backups plus another $6 on that day. Operating system incremental backups tend to be quite small – let’s say a delta of 20% to be really generous. Over the course of a week then we have:

  • Full: 150GB at $15 + $6 for daily usage.
  • Incremental: 6 x (30GB at $3 + $6 for daily usage).

In total, I make that out to be $63 a week, or $3,276 a year for operating system backups to be folded into your current data backups. Does that seem a lot of money? Think of this: if you’re not backing up operating system data, this usually means that you’re working on the basis that “if it breaks, we’ll rebuild the server”.

I’d suggest to you that in most instances your staff will spend at least 4 hours trying to fix the average problem before the business decision is made to rebuild a server. Even with say, fast provisioning, we’re probably looking at 1 hour for full server reinstall/reprovision, current revision patching, etc. So that equals 5 hours of labour. Assuming a fairly low pay rate for Australian system administrators, we’ll assume you’re paying your sysadmins $25 per hour. So a 5 hour attempted fix + rebuild will cost you $125 in labour. Or will it? Servers are servers because they typically provide access or services for more than one person. Let’s assume 50 staff are also unable to work effectively while this is going on, and their average salary is even as low as $20 per hour. That’s $5,000 for their labour, or being more fair, we’ll assume they’re only 50% affected, so that’s $2,500 for their wasted labour.

How many server rebuilds does it take a year for operating system backups to suddenly be not only cost-effective but also a logically sound business decision? Even when we factor in say, an hour of effort for problem diagnosis plus recovery when actually backing up the operating system regions, there’s still a significant difference in price.

Now, I’m not saying that any company that chooses not to backup operating system data is being miserly, but I will confidently assert that most companies who choose not to backup their operating system data are being miserly. To be more accurate, I’d suggest that if the sole rationale for not doing such a backup is “to save money” (rather than “from an architectural standpoint it is unnecessary”) then it is likely that a company is wasting money, not saving it.