This morning we went to the funeral of our best friends’ father. It was, as funerals go, a lovely service and after the funeral and the burial we headed off to the wake, only to have someone’s hilux slam into the driver’s side of our car on a tight bend. They’d skidded and come onto the wrong side of the road by just enough, given the tight corner, to make the impact. Thankfully speed, alcohol or drugs weren’t in play, just the wet, and even more importantly, no-one was injured. Dignity will be lost any time it’s driven without a passenger though – the driver’s door can’t be opened from the inside:

Alas, poor car, I hardly knew ye

The case is with the insurers, and we’re waiting for an assessment next Tuesday to find out whether the car will be repaired or written off. It would be a shame if it’s written off; it’s a Toyota Avalon, circa 2001, and while those cars were frumpy they were damn good cars. With only around 120,000km on the clock it’s not really all that old. About 3 or 4 years ago it was almost completely totalled in a massive hail storm on the central coast; as I recall the repair was in the order of about $12,000, and it only scraped through for repair on an insurance value of around $14,000. Now, with insurance of $7,500 and the repair estimate saying that it’ll top $5,000, age is against the car and it doesn’t look good.

But, this blog isn’t about my hassles, or my car.

It is however about insurance, and insurance is something I’ll be dealing with quite a bit over the coming days. Or I will be, once we hit next Tuesday and the car gets checked out by the assessors.

When we think of “backup as insurance”, there’s some fairly close analogies:

  • Backup is insurance because it’s about having a solution when something goes wrong;
  • Making a claim is performing a recovery;
  • Your excess is how easy (or hard) it is to make a recovery.

Given what’s happened today, it made me wonder what the analogy to “written off” is. That’s a little bit more unpleasant to deal with, but it’s still something that has to be considered.

In this case I’d suggest that the analogy for the insured item being “written off” is one of the following:

  • Having clonesseems simple, but if one recovery fails due to media, having clones that you can recover from instead are the cheapest, logical solution.
  • Having an alternate recovery strategy – so for items with really high availability requirements or minimal data loss requirements, this would refer to having some other replica system in place.
  • Having insurance that can get you through the worst of events – sometimes no matter what you do to protect yourself, you can have a disaster that exceeds all your preparation. So in the absolute worst case scenario, you need something that will help you pay your bills, or ameliorate your building debt while you get yourself back on-board.

Of course, it remains preferable to not have to rely on any of these options, but the case remains that it’s always important to have an idea what your “worst case scenario” recovery situation will be. If you haven’t prepared for one, I’ll suggest what it’s likely to be: going out of business. Yes, it’s that critical that you have an idea what you’ll do in a worst-case scenario. It’s not called “business continuity” for the heck of it – when that critical situation occurs, not having plans usually results in the worst kind of failure.

Me? I’ll be visiting a few car-yards on the weekend to scope up what options I have in the event the car gets written off on Tuesday.

 

As evidenced by the title of my book (Enterprise Systems Backup and Recovery: A corporate insurance policy), I’m a firm believer that the only way to conceptualise the purpose of backup is to describe it as insurance. The way I describe this is to compare the way in which we take out insurance, but hope not to use it, and to make backups, and similarly hope not to use them. This can be easiest described through a couple of Venn diagrams.

First, let’s look at insurance:

Backup and Insurance: Insurance Venn DiagramNo-one wants to claim on their insurance. We take it out on a yearly basis, and any year that we don’t have to use it is good. (Particularly in countries where insurance companies run rough-shod over morality, decency and legal restraint.) I personally have home insurance, contents insurance, car insurance, travel insurance (whenever I travel) and health insurance. Any time I don’t have to make a claim on any of these types of insurance is good – because in order to make a claim, something bad needs to have happened. So I’m much happier paying the fees each year and hoping that I don’t have any more involvement than that with my insurance agencies. Do I resent paying these fees? Hell no – because I’m well aware that if I don’t, and something bad happens, I’ll be up the creek without a paddle. (Or to use the Australian vernacular, I’d be up s––t creek.)

So let’s see the Venn diagram for backup:

Venn Diagram for BackupAs you can see, it’s spookily similar to the diagram for insurance. Now, one of the first things that I tend to hear when I roll out my “backup = insurance” argument is that occasionally, people will want to recover from backups – e.g., to migrate between systems, refresh Q/A systems from production, etc. Well, this isn’t really using backup for the primary purpose – recovery, but instead using it as a data migration/retrieval system. It’s a fine distinction, but it’s an important distinction. The primary reason backup systems are deployed is to recover data when there’s been a failure – any secondary benefit from a backup and recovery system is just that – a secondary benefit.

Your next question may be – so what point is there in classifying backup as a type of insurance?

This is the absolute core of why companies need to think of backup as being a type of insurance – it’s all about the budget.

Look at an example company. Let’s say there’s 5 departments:

  • IT
  • Finance and Human Resources
  • Sales
  • Warehousing and Operations
  • Solutions Delivery

In a standard company, each department will have it’s own budget, but there’s also the corporate budget. That’s the budget that covers costs which affect all departments and have to be met regardless of the size or capacity of each department – it’s for the core business costs. One of those “core” costs is usually the various insurance policies that companies take out. This will definitely include some sort of standard business insurance, but will then cover other types of insurance – professional indemnity, building insurance, contents insurance, car insurance, etc. Few businesses would argue that each department needs to individually seek out and/or pay for its own insurance on each of those matters.

The mistake then made by many businesses is to fail to think of backup as insurance, and therefore work on the basis that IT will manage data and systems backup out of its own budget. This sort of thinking leads to the most common disasters where:

  • Backup systems budget is cut to meet the budget requirements of “production” systems. (See my points here about why it’s a fallacy to think of backup systems as anything other than production systems.)
  • “Make do” data protection systems are deployed that require significant time to complete recovery – e.g., to “save” money, some IT departments will decide to only backup actual data, and leave operating systems and applications at the mercy of being re-installed from the ground up.
  • Backup retention is cut to reduce operational expenditure (i.e., limit the purchase of new media).
  • SLAs, if established, are silently ignored – or even railed against by IT.

None of these processes or decisions are conducive to sensible or useful business systems management – yet they’re the inevitable consequence of asking one department to meet costs that are shared between all departments. It would be like demanding that the sales department pay for all company insurance out of their budget: it just doesn’t make sense.

Where does this discussion leave us? There’s a lesson any business can take out of this: backup, being insurance, is something that’s funded by the corporate operational and capital budget, not the budgets of any individual department.

Chances are if your business isn’t thinking of backup as insurance, it’s not handling or funding backup properly either.

 

I thought it about time that I cited the two key reasons why, if faced with a choice between NetWorker and NetBackup, I would choose NetWorker every time.

As you might expect, given my focus on backup as insurance, both of these reasons are firmly focused on recovery. In fact, so much so that I still don’t really understand why EMC doesn’t go to market with these points time and time and time again and just smack Symantec around until it’s blue in the face and begging for mercy.

Reason 1: NetBackup does not implement backup dependencies

I struggle to call NetBackup an “enterprise” backup product because of this simple fact. Honestly, backup dependencies are critically important when it comes to guaranteeing anything but last-backup recoverability.

What does this mean?

In short, as soon as a backup hits its retention period in NetBackup, it’s toast – it’s a goner.

Irrespective of whether there are any backups of the same filesystem/data set that requires the “outside retention” backup for recovery purposes.

I can’t sum this up any other way: in a backup product, I see this as recklessly irresponsible. It provides a focus on media savings that even the most miserly bean cruncher would admire. Well, until the bean cruncher’s system can’t be recovered from 6 weeks ago to fulfil audit requirements.

Reason 2: True Image Recovery is “optional”

If you’ve grown up in a NetWorker world, where the emphasis has always been, and will always continue to be on recovery, this will, like the reason above, make you soil yourself. Imagine having a full backup plus six incremental backups of a directory, and wanting to recover the filesystem from last night. Now imagine just selecting the full plus the incrementals for recovery and getting back everything generated during that time.

Even the files that had been deleted between backups. I.e., you don’t get back what the filesystem looked like at the time of the backup that you’re recovering from, but what it looked like for every backup that you’re recovering from.

NetWorker, once, in the 5.5.x stream implemented this. It was called a BUG. In NetBackup, it’s a “feature”. In order to enable a correct recovery, you have to turn on “true image recovery”, something that takes extra resources, and is typically advised  that you keep the data just for a small cycle (e.g., 7 days) rather than the complete retention time for the backups.

There’s another word for this: Joke.

On another front…

As recently as December I mentioned that I wished EMC would get their act together and implement inline cloning – one of the few things where I saw that NetBackup had a distinct competitive advantage over NetWorker.

Maybe it was the glow of the cider, but I had an epiphany in Copacabana on a hill watching (probably illegal) fireworks in Avoca and Terrigal on new years eve. Inline cloning is no longer a compelling factor in a backup product. Why? Media streaming speeds have reached a point where companies with serious amounts of data just should not be implementing direct-to-tape backup solutions any more. Inline cloning was developed at a time when you’d want to generate both sets of tapes as quickly as possible, but only companies with very small data sets will find themselves not backing up to some disk unit first (be it say, ADV_FILE, or VTL, in NetWorker), and those companies won’t be constrained on backup/clone windows to a point where they’d need inline cloning anyway.

When not backing up direct-to-tape, there are several factors that mitigate the need to do inline cloning. In organisations with a very strong need for offsiting, there’s replication at a VTL or disk backup unit layer. In organisations that just need a second copy generated “as soon as possible”, doing disk/virtual tape to physical tape cloning following the backup should be fast enough to handle the cloning at appropriate performance levels.

In other words: there’s no need for EMC to implement inline cloning. As a technology, it’s a dead-end from a tape-only time. I feel somewhat silly this didn’t occur to me sooner.

 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

 

Over at The Daily WTF, there’s a story at the moment about a company that went out of business due to a developer deleting the company database for which there were no backups. Lamentably, this is still a common story. Oh, in many cases backups may actually be taken, but it’s still the case that we see situations such as:

  • Backups are never taken off-site,

or

  • Backups are never even taken out of a tape drive (i.e., constantly overwritten),

or

  • Backups are never checked.

My book is titled Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. That’s how much backup, to me, represents insurance. It’s the level of insurance necessary for any business to survive a disaster.

Failing to treat backup as insurance is unfortunately still familiar. The ever obvious-stating Gartner is frequently quoted as saying that one in three companies hit by a disaster will be unprepared and lose critical data.

I’d like to hope that within my career we’ll see that percentage shrink considerably – one in three is an unacceptably high number. One in a hundred might be more acceptable, but realistically, one in twenty would be a good number to start aiming for.

How do we aim for such an improvement? It’s remarkably simple, and comes from a few basic rules:

  • Backup is insurance, it’s not an IT process.
  • Backup requires buy-in from all aspects of a company.
  • Backup budget is sourced from the entire company, not the IT budget.
  • Company policies should prohibit deployment of new systems without a backup/recovery policy.

A good backup system comprises no more than 50% IT infrastructure and operations. The rest stems from policies, procedures, planning and awareness. Paraphrasing what I state in the introduction to my book, having backup software does not mean you have a backup system.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha