This morning, Christopher Biggs, aka @unixbigot, tweeted a truth in IT that has always been a strong personal bugbear for me. He said:

“The reward for successful disaster preparation is always idiots decrying the wasted effort & resources.”

This is so, so very true.

While I’m sure it had been experienced by countless administrators beforehand, my first real experience of this was the year 2000 issue – Y2K. Post-Y2K there were hundreds of opinionated trashbag journalists and management consultants happy to jump up and slam the amount of money invested in addressing the issue. It’s a sad fact of life that there’s always going to be people who want to write negatively. (Those same trashbags would have equally written about the disgusting unprofessional nature of IT people had the skies really fallen in post-Y2K, after all.)

The work of system administrators is largely invisible, but the work of backup administrators is even more so. No-one cares about backup until something goes wrong, so an exceedingly common reaction in IT is for people to jump up and down and decry the amount of money or time spent on such activities as:

  • disaster recovery testing;
  • disaster recovery planning (trust me, they’re often done in this order…);
  • backup duplication;
  • high availability.

And why? Because in each case the end goal or the hope is that they’re not actually required.

It’s a tired, stupid meme that we, as a data protection industry, have to put to rest. It has to become accepted fact that all these activities are required for healthy business function, and you should be grateful that you don’t need to act on those plans and backups, rather than getting upset about the time and money taken.

Will we convince everyone? No. Then again, there’s still flat earthers out there. There’ll always be that small percentage who stubbornly cling to rampant stupidity as a shield against the real world.

Preparedness is not wastefulness.

Make it your mantra.

 

Last month, I posted a survey with the following questions:

  1. What is your backup server (currently)?
    1. Physical server
    2. Virtual server, backing up directly
    3. Virtual server, in director mode only
    4. Blade server, backing up directly
    5. Blade server, director mode only
  2. Would you run a virtual backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.
  3. Would you run a blade backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.

Now, I did preface this survey with my own feelings at the time:

I have to admit, I have great personal reservations towards virtualising backup servers. There’s a simple, fundamental reason for this: the backup server should have as few dependencies as possible in an environment. Therefore to me it seems completely counter-intuitive to make the backup server dependent on an entire virtualisation layer existing before it can be used.

For this reason I also have some niggling concerns with running a backup server as a blade server.

Personally, at this point in time, I would never willingly advocate deploying a NetWorker server as a virtual machine (except in a lab situation) – even when running in director mode.

At the time of the survey, I already knew from a few different sources that EMC run virtualised NetWorker servers as part of their own environment, and are happy to recommend it. I however, wasn’t. (And let’s face it, I’ve been working with NetWorker for longer than EMC’s owned it.) That being said, I wasn’t looking for confirmation that I was right – I was looking for justifiable reasons why I might be wrong.

First, I want to present the survey findings, and then I’ll discuss some of the comments and where I now stand.

There were 122 respondents to the survey, and the answers were:

Current Backup Server

Did this number surprise me? Not really – by its very nature, backup operations and administration is about being conservative: keep things simple, don’t go bleeding edge, and trust what is known. As such, the majority of sites are running a physical backup server. Of the respondents, only 10% were running any form of virtualised backup server, regardless of whether that was a software or hardware virtualised server, and regardless of whether it was directly doing backups or backing up in director mode only.

Would you run a virtual backup server?

So this question was a simple one – would you run a backup server that was virtual? Anyone who has done any surveys would claim (rightly so) that my leading questions into the survey may have coloured the results of the survey, and I’d not disagree with them.

Yet, let’s look at those numbers – less than 50% (admittedly only by a small margin) gave an outright “No” response to this question. I was pleased though that those who would run a virtualised backup server seemed to mirror my general thoughts on the matter – the majority would only do so in director mode, with the next biggest group being willing to backup to disk to the backup server, but not using other devices.

Would you run a blade backup server?

The final question asked the same about blade servers. To be fair to those using blade servers, this probably should have been prefaced with a question “Do you use blade servers in your environment already?”, since it would seem logical that anyone currently not using blade servers probably wouldn’t answer yes to this. But I was still curious – as you may be aware, I’ve had some questions about blade servers in the past; and other than offering better rack density I see them having no tangible benefits. (Then again, I am in a country that has no lack of space.)

The big difference between a software virtualised backup server and a hardware virtualised backup server though was that people who would run a backup server in a blade environment were more willing to backup to any device. That’s probably understandable. It smells like and looks like regular hardware, so it feels easier than say, a virtual machine accessing a physical tape drive does.

So, the survey showed me fairly much what I was expecting I’d see – a high level of users with physical backup servers. I was hoping though that I might see some comments from people who were either using, or considering using virtual servers, and get some feedback on what they found to be the case.

One of the best comments that came through was from Alex Kaasjager. He started with this:

I agree with you that a backup server (master, director) should be as independent as possible – and right for that specific reason, I’d prefer the server virtualised. Virtualisation solves the problem of a hardware, a hardware-bound OS, location and redundancy.

That immediately got my attention – and so Alex followed with these examples:

- if my hardware breaks (and it will at a certain point in time) I will have to keep a spare machine or go with reinstall-recovery, which, as you will agree, poses its own very peculiar set of problems
- the OS, regardless which one, is bound to the hardware, be it for licensing, MAC address, or drivers. A change in the OS (because of a move to another datacenter for example) may hurt (although it probably won’t, in all fairness)
- I can move my VM anywhere, to another rack, datacenter, or country without much hassle, I can copy, make a snap and even export it. Hardware will prevent this.

Of all the things I hadn’t considered, it was the simple ability to move your backup server between virtual servers wasn’t what I’d considered. Alex’s first point – about protection from hardware failure – is very cogent on its own, but being able to just move the backup server around without impacting any operations, or disrupting licenses – now that’s the kind of “bonus” argument I was looking for. (It’s why, for instance, I’ve advocated that if you’re going to have a License Manager server, you make that virtual.)

Another backup administrator (E. O’S) advocated:

It absolutely has to be in director mode as you describe. All the benefits of hardware abstraction and HA/FT that you get with VM are just as relevant to a critical an app as NetWorker, especially for storage mobility and expansion for a growing and changing datazone. Snapshots before major upgrades? Cloning for testing or redeployment to another site? Yes please. You have to be more confident than ever in your ability to recover NetWorker with bootstraps and indices (even onto a physical host if you need to, to solve your virtualisation layer dependency conundrum) if and when the time comes. Plan for it, practice it, and sleep easy.

The final part of what I’ve quoted there comes to the heart of my reservations of running NetWorker virtualised, even in a director role – how do you do an mmrecov of it? In particular, even when running as a backup director, the NetWorker server still has to back its own bootstrap information up to a local device. Ensuring that you can still recover from such a device would become of paramount importance.

I think the solution here is three-fold:

  • (Already available) Design a virtualised backup server such that the risk of having to do a bootstrap recovery in DR is as minimal as possible.
  • (Already available) Assuming you’re doing those bootstrap backups to disk/virtual disk, be sure to keep them as a separate disk file to the standard disk file for the VM, so that you can run any additional cloning/copying of that you want at a lower level, or attach it to another VM in an emergency.
  • (EMC please take note) It’s time that we no longer needed to do any backups to devices directly attached to the backup server. NetWorker does need architectural enhancements to allow bootstrap backup/recovery to/from storage node devices. Secondary to this: DR should not be dependent on the original and the destination host having the same names.)

So, has this exercise changed my mind or reinforced my belief that you should always run a physical backup server?

I’m probably now awkwardly sitting on the fence – facing the “virtual is OK for director mode only” camp. That would be with strong caveats to do with recoverability arrangements for the virtual machine. In particular, what I’d suggest is that I would not agree with virtualising the backup server if you were in such a small environment that there’s no provisioning for moving the guest machine between virtual servers. The absolute minimum, for me, in terms of reliability of such a solution is being able to move the backup server from one physical host to another. If you can do that, and you can then have a very well practiced and certain recovery plan in the event of a DR, then yeah, I’m sold on the merits of having a virtualised backup director server.

(If EMC updated NetWorker as per that final bullet point above? I’d be very happy to pitch my tent in that camp.)

I’ve got a couple of follow-up points and questions I’ll be making over the coming week, but I wanted to at least get this initial post out.

 

As a consultant, you get attuned to (or as some would have it, “cynical”) certain key phrases and statements when you’re in meetings. Sometimes these statements are innocent and exactly what the person says, but usually they set the alarm bells ringing.

As a bit of winding down after a hectic 7 days, I thought I’d share the top 15 statements that cause me to start immediately trying to get deep qualification of what I’ve just been told…

What they say...What I worry it means...
"Our backup results get filed automatically and someone reviews them.""We have a server that hasn't successfully backed up for 6 months, but no-one's been checking the notifications."
"All our backups fit on a single tape""We upgrade our hardware every time this isn't the case."
"We're very selective about what we backup.""We have critical production systems we forgot to add to our schedule."
"We don't want to get backup notifications.""Backup? Meh."
"Our DBAs do their own backups.""The DBAs don't believe in enterprise backup software and think dumps are better" ... OR ... "The backup administrators have lost control of the system and its spiralling out of control."
"We don't have SLAs""No one wants ownership of establishing SLAs"
"We don't need SLAs""We trust in luck, and hope we don't ever need SLAs"
"Our users are responsible for backing up their laptops""Every day we're losing critical data that may be legally or fiscally required by the company."
"We don't have to do monthly backups.""Even though we know we SHOULD do monthly backups, until someone puts it in writing, we're not going to."
"We've been asked to shrink our backup budget...""The business has this crazy idea that backup is an IT function and problem."
"Tape is dead""Someone with a vested interest in selling lots of HDD storage has visited lately."
"We do per-incident support.""We have an Icarus support contract."
"It's too busy here to do capacity planning.""We're wasting money as fast as we can get the budget for it."
"We don't need to {clone or otherwise duplicate} our backups.""We're going to suffer a critical data loss situation."
"We only backup production data.""A lot of people's work within the company is unprotected."

 

It used to be 10 years ago that you couldn’t do anything in the backup space without having an answer to the question, “How do you achieve BMR?” Nowadays, it’s not a dirty word in backup, but it certainly seems to be somewhat passé.

So what happened? Is BMR now dead? Is it on life support? Did it ascend?

It’s an interesting question. I think that as an independent technology, BMR has become ever more niche, and what we’ve seen is a gradual shift in technology so as to allow BMR to become a silent feature. As such, it doesn’t necessarily get a lot of attention – it just blends into the background.

For the most part, I’d suggest that I found BMR to be more of a focus point in the Windows market, then later in the emerging Linux market, though still with a primary focus on Windows. This wasn’t to say that rapid systems recovery wasn’t important on other platforms, but on those platforms there were frequently technologies built into the OS. AIX could boot from a system image tape. Solaris could be Jumpstarted, etc. Eventually, Linux could be Kickstarted.

In the Legato space, BMR options were pretty challenging for the most part, so 10 years ago I’d regularly recommend customers wanting to BMR their Windows servers to deploy Ghost. It wasn’t perfect, but it did the trick – the goal in my mind was to get a system back to a state of easy recoverability; i.e., BMR was about allowing you to get a system back to the point where you could run a full recovery. Nothing more, nothing less. That was undoubtedly influenced by the lack of integrated BMR within NetWorker, but it worked, and it let each product focus on what it did best.

These days I think BMR is something that’s effectively available in most enterprise spaces without actually needing to reference it as an independent technology. So it comes into play primarily as a result of virtualisation and snapshots.

Within virtualisation, there’s two options that tend resolve independent BMR requirements – templates, and image level backups, though for slightly different reasons.

Templates are designed to allow a rapid deployment of a new guest – be it just at the operating system level, or a combination operating system and application level; such templates will usually include a certain level of patching – enough to get a host at a secure enough point to connect to a corporate network. But they don’t have to be used just for the deployment of a new guest; instead, if a guest fails or becomes otherwise hopelessly corrupt, there’s nothing stopping the use of a template to rapidly bring the guest “back to life” to allow a regular recovery. If backups are being done at the guest level, then a smart template will also include the backup software so that it’s immediately available on system (re)creation.

On the other hand, image level backups fulfil the old “cold backup” niche. When virtualisation started hitting its stride, image level backups were seen as the future, but then reality struck and it became painfully obvious that recovering a 100GB virtual machine to pull out a 10KB document was wasteful and time consuming. Since then file level recovery from image level backup has improved, but it’s still not an omnipresent technology. That being said, image level backup works perfectly as a rapid BMR mechanism. Even assuming a situation where an image level backup is only taken once a month, recovering a machine from an image backup done 30 days ago puts you in a situation to allow regular host-based recoveries to run with minimum effort.

We frequently look at snapshots at enabling more useful RPO and RTOs than traditional “once per day” backups. It’s common for instance to see NAS systems with hourly read-only snaps immediately available to end users for self-directed recoveries. They’re also used to facilitate traditional backups by doing quiesced backups with minimum downtime, or less disruptive backups.

However, certainly in the enterprise space, snapshots equally provide an excellent BMR solution. Snapshot, patch, revert to snapshot if patch fails, etc. Array level snapshots (IMHO) provide a significantly greater level of flexibility than a traditional BMR solution where the primary focus is getting a machine back to its most recent usable state. Snapshots are so useful on this front that they’re even used within virtualisation for exactly that reason – why go back to an image level backup, or waste time doing a cold backup of a virtual machine when you can just roll back to a snapshot taken 10 minutes ago?

What I’ve been observing now for a while is that BMR as an independent product gets very little attention these days in enterprises. At the small to medium business it still gets bandied about – often for desktops as much as for servers, but it increasingly seems that virtualisation and snapshots have gobbled up most of the BMR space in the enterprise.

It seems that over time even that space may become narrowed. Looking at Mac OS X as an example, the ability to do a new system install referencing a Time Machine backup is a perfect example of an operating system integrated approach to BMR. Does it solve all BMR issues, even on the OS X platform? No, but it addresses the 80% rule, I believe. Will it be the only such product? I can’t believe so – I have to believe we’ll eventually see something comparable in other operating systems.

What are your thoughts?

 

Once upon a time, if you said to someone “do you have a test environment?” there was at least a 70 to 80% chance that the answer would be one of the following:

  • Only some very old systems that we decommissioned from production years ago
  • No, management say it’s too expensive

I’d like to suggest that these days, with virtualisation so easy, there are few reasons why the average site can’t have a reasonably well configured backup and recovery test environment. This would allow the following sorts of tests could be readily conducted:

  • Disaster recovery of hosts and databases
  • Disaster recovery of the backup server
  • Testing new versions of operating systems, databases and applications with the backup software
  • Testing new versions of the backup software

Focusing on the Intel/x86/x86_64 world, we see where this is immediately achievable. Remember, for the average set of tests that you run, speed is not necessarily going to be the issue. Let’s focus on non-speed functionality testing, and think of what would be required to have a test environment that would suit many businesses, regardless of size:

  1. Virtualisation server – obviously VMware ESXi springs to mind here, if cost is a driving factor.
  2. Cheap storage – if performance is not an issue for testing (i.e., you’re after functionality not speed testing), there’s no reason why you can’t use cheap storage. A few 2TB SATA drives in a RAID-5 configuration will give you oodles of space if you need any level of redundancy, or just in a RAID-0 stripe will give you capacity and performance. Optionally present storage via iSCSI if its available.
  3. Tiny footprint – previously test environments were disqualified in a lot of organisations, particularly those at locations where space was at a premium. Allocating room for say, 15 machines to simulate part of the production network took up tangible space – particularly when it was common for test environments to not be built using rackable equipment.

In the 2000′s, much excitement was heralded over the notion of supercomputers at your desk – for example, remember when Orion released a 96-CPU capable system? The notion of that much CPU horsepower under your desk for single tasks may be appealing to some, but let’s look at more practical applications flowing from multi-core/multi-CPU systems – a mini datacentre under your desk. Or in that spare cubicle. Or just in a 3U rack enclosure somewhere within your datacentre itself.

Gone are the days when backup and recovery test environments are cost prohibitive. You’re from a small organisation? Maybe 10-20 production servers at most? Well that simply means your requirements will be smaller and you can probably get away with just VMware Workstation, VMware Fusion, Parallels or VirtualBox running on a suitably powerful desktop machine.

For companies already running virtualised environments, it’s more than likely the case that you can even use a production virtualisation server due for replacement as a host to the test environment, so long as it can still virtualise a subset of the production systems you’d need to test with. During budgetary planning this can make the process even more painless.

This sort of test environment obviously doesn’t suit every single organisation or every single test requirement – however, no single solution ever does. If it does suit your organisation though, it can remove a lot of the traditional objections to dedicated test environments.

 

This morning we went to the funeral of our best friends’ father. It was, as funerals go, a lovely service and after the funeral and the burial we headed off to the wake, only to have someone’s hilux slam into the driver’s side of our car on a tight bend. They’d skidded and come onto the wrong side of the road by just enough, given the tight corner, to make the impact. Thankfully speed, alcohol or drugs weren’t in play, just the wet, and even more importantly, no-one was injured. Dignity will be lost any time it’s driven without a passenger though – the driver’s door can’t be opened from the inside:

Alas, poor car, I hardly knew ye

The case is with the insurers, and we’re waiting for an assessment next Tuesday to find out whether the car will be repaired or written off. It would be a shame if it’s written off; it’s a Toyota Avalon, circa 2001, and while those cars were frumpy they were damn good cars. With only around 120,000km on the clock it’s not really all that old. About 3 or 4 years ago it was almost completely totalled in a massive hail storm on the central coast; as I recall the repair was in the order of about $12,000, and it only scraped through for repair on an insurance value of around $14,000. Now, with insurance of $7,500 and the repair estimate saying that it’ll top $5,000, age is against the car and it doesn’t look good.

But, this blog isn’t about my hassles, or my car.

It is however about insurance, and insurance is something I’ll be dealing with quite a bit over the coming days. Or I will be, once we hit next Tuesday and the car gets checked out by the assessors.

When we think of “backup as insurance”, there’s some fairly close analogies:

  • Backup is insurance because it’s about having a solution when something goes wrong;
  • Making a claim is performing a recovery;
  • Your excess is how easy (or hard) it is to make a recovery.

Given what’s happened today, it made me wonder what the analogy to “written off” is. That’s a little bit more unpleasant to deal with, but it’s still something that has to be considered.

In this case I’d suggest that the analogy for the insured item being “written off” is one of the following:

  • Having clonesseems simple, but if one recovery fails due to media, having clones that you can recover from instead are the cheapest, logical solution.
  • Having an alternate recovery strategy – so for items with really high availability requirements or minimal data loss requirements, this would refer to having some other replica system in place.
  • Having insurance that can get you through the worst of events – sometimes no matter what you do to protect yourself, you can have a disaster that exceeds all your preparation. So in the absolute worst case scenario, you need something that will help you pay your bills, or ameliorate your building debt while you get yourself back on-board.

Of course, it remains preferable to not have to rely on any of these options, but the case remains that it’s always important to have an idea what your “worst case scenario” recovery situation will be. If you haven’t prepared for one, I’ll suggest what it’s likely to be: going out of business. Yes, it’s that critical that you have an idea what you’ll do in a worst-case scenario. It’s not called “business continuity” for the heck of it – when that critical situation occurs, not having plans usually results in the worst kind of failure.

Me? I’ll be visiting a few car-yards on the weekend to scope up what options I have in the event the car gets written off on Tuesday.

 

Over the weekend I wrote up a piece about how snapshots are not a valid replacement to enterprise backup. The timing of this was in response to NetApp recently abandoning development of their VTL systems, and subsequent discussions this triggered, but it was something that I’d had sitting in the wings for a while.

It’s fair to say that discussions on snapshots and backups polarise a lot of people; I’ll fully admit that I side with the “snapshots can’t replace backups” side of the argument.

I want to go into this in a little more detail. First I’ll point out in fairness that there are people willing to argue the other side that don’t work for NetApp, in the same way that I don’t work for EMC. One of those is the other Preston – W. Curtis Preston, and you can read his articulate case here. I’m not going to spend this article going point for point against Curtis – it’s not the primary point of discussion I want to make in this entry.

Moving away from vendors and consultants, another and very interesting opinion, from the customer perspective, comes from Martin Glassborow’s Storagebod blog. Martin brings up some valid customer points – that being snapshot and replication represents extreme hardware lock-in. Some would argue that any vendor’s backup product represents vendor lock in as well, and this is partly right – though remember it’s not so difficult to keep a virtual machine around with the “last state” of the previous backup application available for recovery purposes. Keeping old and potentially obsolete NAS technology running to facilitate older recoveries after a vendor switch can be a little more challenging.

To get onto what I want to raise today, I need to revisit a previous topic as a means of further explaining my position. Let’s look for instance at my previous coverage of Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP). You can read the entire piece here, but the main point I want to focus on is my ILP ‘diagram’:

Components of ILP

One of the first points I want to make from that diagram is that I don’t exclude snapshots (and their subsequent replication) from an overall information lifecycle protection mechanism. Indeed, depending on the SLAs involved, they’re going to be practically mandatory. But, to use the analogy offered by the above diagram, they’re just pieces of the pie rather than the entire pie.

I’m going to extend my argument a little now, and go beyond just snapshots and replication, so I can elucidate the core reasons why I don’t like replicated snapshots as a permanent backup solution. Here’s a few other things I don’t like as a permanent backup solution:

  • VTLs replicated between a primary and disaster recovery site, with no tape out.
  • ADV_FILE (or other products disk backup solutions) cloned/duplicated between the primary and disaster recovery site, with no tape out.
  • Source based deduplication products with replication between two locations, with no tape out.

My fundamental objection in all of these solutions is the long term failure caused by keeping everything “online”. Maybe I’m a pessimist, but when I’m considering backup/recovery and disaster recovery solutions, I firmly believe that I’m being paid to consider all likely scenarios. I don’t personally believe in luck, and I won’t trust a backup/disaster recovery solution on luck either. The old Clint Eastwood quote comes to mind here:

You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do ya, punk?

When it comes to your data, no, no I don’t. I don’t feel lucky, I don’t encourage you to feel lucky. Instead I rely on solid, well protected systems with offline capabilities. Thus, I plan for at least some level of cascading failures.

It’s the offline component that’s most critical. Do I want all my backups for a year online, only online, even with replication? Even more importantly – do I want all your backups online, only online, even with replication? The answer remains a big fat no.

The simple problem with any solution that doesn’t provide for offline storage is that (in my opinion), it brings the risk of cascading failures into play too easily. It’s like putting all storage for your company on a single RAID-5 LUN and not having a hot spare. Sure you’re protected against that first failure, but it’s shortly after the first failure that Murphy will make an appearance in your computer room. (And I’ll qualify here: I don’t believe in luck, but I’ve observed over the years in many occasions that Murphy’s Law rules in computer rooms as well as in other places.) Or to put it another way: you may hope for the best, but you should plan for the worst. Let’s imagine a “worst case scenario”: a fire starts in your primary datacentre 10 minutes after upgrade work has commenced on the array that receives replicated snapshots in your disaster recovery runs into problems with firmware, leaving that array inaccessible until vendor upgrades are complete. Or worse again, it leaves storage corrupted.

Or if that seems too extreme, consider a more basic failure: a contractor near to your primary datacentre digs through the cables linking your production and disaster recovery sites, and it’s going to take 3 days to repair. Suddenly you’ve got snapshots and no replication. Just how lucky does that leave you feeling? Personally, I feel slightly naked and vulnerable when I have a single backup that’s not cloned. If suddenly none of my backups were getting duplicated, and I had no easy access to my clones, I’d feel much, much worse. (And that full body shiver I do from time to time would get very pronounced.)

Usually all this talk of a single instance failure frequently leads proponents of snapshots+replication only to suggest that a good design will see 3-way replication, so there’s always two backup instances. This doubles a lot of costs while merely moving the failure point just a jump to the left. On the other hand, offline backup where there’s the backup from today, the backup from yesterday, the backup from the day before … the backup from last week, the backup from last month, etc., all offline, all likely on different media – now that’s failure mitigation. Even if something happens and I can’t recover the most recent backup, in many recovery scenarios I can go back one day, two days, three days, etc. Oh yes, you can do that with snapshots too, but not if the array is a smoking pile of metal and plastic fused to the floor after a fire. In some senses, it’s similar to the old issue of trying to get away from cloning by backing up from the production site to media on the disaster recovery site. It just doesn’t provide adequate protection. If you’re thinking of using 3-way replication, why not instead have a solution that uses two entirely different types of data protection to mitigate against extreme levels of failure?

It’s possible I’ll have more to say on this in the coming weeks, as I think it’s important, regardless of your personal view point, to be aware of all of the arguments on both sides of the fence.

 

A borked LaCie 2TB BigDisk Extreme has reminded me of the role of backup and recovery within disaster recovery itself. By disaster recovery, I mean total “system” failure, whether that system is an entire server, an entire datacentre, or in my case, a large drive.

What is the difference between a regular failure and a disaster? I think it’s one of those things that’s entirely the perspective of organisation or person who experiences it.

As for my current disaster, I’ve got a 2TB drive with just 34GB free. I’ve got up-to-date backups for this drive which I can restore from, and in the event of a catastrophe, I could actually regenerate the data, given that it’s all my media files. It’s also operational, so long as I don’t power it off again. (This time it took more than 30 minutes to become operational after a shutdown. It’s been getting worse and worse.)

So I’ve got a backup, I’ve got a way of regenerating the data if I have to, and my storage is still operational. Why is it a disaster? Here’s a few reasons, which I’ll then use to explain what makes for a disaster more generally, and why backup/recovery is only a small part of disaster recovery:

  1. I don’t have spares. Much as I’d love to have a 10 or 20TB array at home running on RAID-6 or something like that, I don’t have that luxury. For me, if a drive fails, I have to go out and buy a replacement drive. That’s budget – capital expenditure, if you will. What’s more, it’s usually unexpected capital expenditure.
  2. Not all my storage is high speed. Being a home user, a chunk of my storage is either USB-2 or FireWire 400/800. None of these formats offer blistering data transfer speeds. The 2TB drive is hooked up to Firewire 800, and I backup to Firewire 400, which means I’m bound to a maximum of around 30-35MB/s throughput for either running the backup or recovering from it.
  3. The failure constrains me. Until I get the drive replaced, I have to be particularly careful about any situation that would see the drive powered off.

So there’s three factors there that constitute a “disaster”:

  1. Tangible cost.
  2. Time to repair.
  3. Interruptive.

A regular failure will often have one or two of the above, but all three are needed to turn it into a disaster. This is why a disaster is highly specific to the location where it happens – it’s not any specific thing, but a combination of the situation, the impact locally and the required response that render a disaster from a failure.

There’s of course varying levels of disasters too, even at an individual level. Having a borked media drive is a disaster, but it’s not a “primary” disaster for me, because the core of what I do on my computer I can still get done. The same applies with corporations – it could be that losing both a primary fileserver and a manually controlled archive fileserver would constitute a “disaster”, but the first is always likely to be a far more serious disaster. That’s because it generates higher spikes in one or more of the factors – cost and interruption.

So, returning to the topic of the post – let’s consider why backup/recovery only forms a fraction of disaster recovery. When we consider a regular failure requiring recovery, it’s clear that the backup/recovery process forms not only the nexus to the activity, but likely the longest or most “costly” component (usually in terms of staff time).

In a disaster recovery situation, that’s no longer guaranteed to be the case. While the actual act of recovery is likely to take some time within a disaster recovery situation, there’s usually going to be a heap of other activities. There’ll be:

  • Personnel issues – getting human resources allocated to fixing the problem, and the impact of the failure on a number of people. Typically you don’t find (in a business world) that a disaster is something that only affects a single user within the organisation. It’s going to impact a significant number of workers – hence the tangible cost and the interruptive nature of them.
  • Fault resolution time – If you can seamlessly failover from an event, it’s unlikely it will be treated as a disaster. Sure, it may be a major issue, but a disaster is something that is going to take real time to fix. A disaster will see staff needing to work nigh-continuously in order to get the system operational. That will include:
    • Time taken to assess the situation,
    • Time taken to get replacement systems ready,
    • Time taken to recover,
    • Time taken to mop up/finalise access,
    • Time taken to repair original failure,
    • Time taken to revert services and
    • Time taken to report.
  • Post recovery exercises – in a good organisation, disaster recovery operations don’t just stop when the last byte of data has been recovered. As alluded to in the above bullet point, there needs to be a formal evaluation of the circumstances that lead up to the disaster, the steps required to rectify it, any issues that might have occurred, and plans to avoid it (or mitigate it) in future. For some staff, this exercise may be the longest part of the disaster recovery process.
  • Post disaster upgrades – if, as a result of the disaster and the post recovery exercises it’s determined that new systems must be put into place (e.g., adding a new cluster, or changing the way business continuity is handled), then it can be fairly stated that all of the work involved in such upgrades is still attributed to the original disaster recovery situation.

All of these factors (and many more – it will vary, site by site) lead to the inevitable conclusion that it’s insufficient to consider that disaster recovery is just a logical extension of a regular backup and recovery process. It’s far more interruptive. It’s more costly in terms of either direct staff time or a variety of other factors, and it’s far more interruptive – both to individuals within the organisation, and the organisation as a whole.

As such, the response to a disaster recovery situation should not be driven directly by the IT department. IT of course will play a valuable and critical role in the recovery process, but the response must be driven by a team with oversight against all affected areas, and the post-recovery processes must equally be driven by a team whose purdue extends beyond just the IT department.

We can’t possibly prepare for every disaster. To do so would require unlimited budget and unlimited resources. (It would also be reminiscent of the Brittas Empire.)

Instead, what we can plan for is that disasters will, inevitably happen. By acknowledging that there is always a risk of a disaster, organisations can prepare for them by:

  • Determining “levels” of disaster – quantifying what tier of disaster a situation will be by say, percentage of affected employees, loss of ability to perform primary business functions, etc.
  • Determining role based involvement in disaster response teams for each of those levels of disaster.
  • Determining procedures for:
    • Communication throughout the disaster recovery process.
    • Activating disaster response teams.
    • Documenting the disaster.
    • Reporting on the disaster.
    • Post-disaster meetings.

Good preparation of the above will not mitigate a disaster, but it’ll at least considerably reduce the risk of a disaster becoming a complete catastrophe.

Don’t just assume that disaster recovery is a standard backup and recovery process. It’s not – not by a long shot. Making this assumption puts the business very much at risk.

 

Do you have a clear picture of everything that you’re not backing up? For many sites, the answer is not as clear cut as they may think.

It’s easy to quantify the simple stuff – QA or test servers/environments that literally aren’t configured within the backup environment.

It’s also relatively easy to quantify the more esoteric things within a datacentre – PABXs, switch configurations, etc. (Though in a well run backup environment, there’s no reason why you can’t configure scripts that, as part of the backup process, logs onto such devices and retrieves the configuration, etc.)

It should also be very, very easy to quantify what data on any individual system that you’re not backing up – e.g., knowing that for fileservers you may be backing up everything except for files that have a “.mp3″ extension.

What most sites find difficult to quantify is the quasi-backup situations – files and/or data that they are backing up, but which is useless in a recovery scenario. Now, many readers of that last sentence will probably think of one of the more immediate examples: live database files that are being “accidentally” picked up in the filesystem backup (even if they’re being backed up elsewhere, by a module). Yes, such a backup does fall into this category, but there are other types of backups which are even less likely to be considered.

I’m talking about information that you only need during a disaster recovery – or worse, a site disaster recovery. Let’s consider an average Unix (or Linux) system. (Windows is no different, I just want to give some command line details here.) If a physical server goes up in smoke, and a new one has to be built, there’s a couple of things that have to be considered pre-recovery:

  • What was the partition layout?
  • What disks were configured in what styles of RAID layout?

In an average backup environment, this sort of information isn’t preserved. Sure, if you’ve got say, HomeBase licenses (taking the EMC approach), or using some other sort of bare metal recovery system, and that system supports your exact environment*, then you may find that such information is preserved and is available.

But what about the high percentage of cases where it’s not?

This is where the backup process needs to be configured/extended to support generation of system or disaster recovery information. It’s all very good for instance, for a Linux machine to say that you can just recover “/etc/fstab”, but what if you can’t remember the size of the partitions referenced by that file system table? Or, what if you aren’t there to remember what the size of the partitions were? (Memory is a wonderful yet entirely fallible and human-dependent process. Disaster recovery situations shouldn’t be bound by what we can or can’t remember about the systems, and so we have to gather all the information required to support disaster recovery.)

On a running system, there’s all sorts of tools available to gather this sort of information, but when the system isn’t running, we can’t run the tools, so we need to run them in advance, either as part of the backup process or as a scheduled, checked-upon function. (My preference is to incorporate it into the backup process.)

For instance, consider that Linux scenario – we can quickly assemble the details of all partition sizes on a system with one simple command – e.g.:

[root@nox ~]# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        2089    16779861   fd  Linux raid autodetect
/dev/sda2            2090        2220     1052257+  82  Linux swap / Solaris
/dev/sda3            2221       19457   138456202+  fd  Linux raid autodetect
/dev/sda4           19458      121601   820471680    5  Extended
/dev/sda5           19458       19701     1959898+  82  Linux swap / Solaris
/dev/sda6           19702      121601   818511718+  fd  Linux raid autodetect

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         250     2008093+  82  Linux swap / Solaris
/dev/sdb2             251      121601   974751907+  83  Linux

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1      121601   976760001   83  Linux

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1        2089    16779861   fd  Linux raid autodetect
/dev/sdd2            2090        2220     1052257+  82  Linux swap / Solaris
/dev/sdd3            2221       19457   138456202+  fd  Linux raid autodetect
/dev/sdd4           19458      121601   820471680    5  Extended
/dev/sdd5           19458       19701     1959898+  82  Linux swap / Solaris
/dev/sdd6           19702      121601   818511718+  fd  Linux raid autodetect

That wasn’t entirely hard. Scripting that to occur at the start of the backup process isn’t difficult either. For systems that have RAID, there’s another, equally simple command to extract RAID layouts as well – again, for Linux:

[root@nox ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda3[0] sdd3[1]
 138456128 blocks [2/2] [UU]

md2 : active raid1 sda6[0] sdd6[1]
 818511616 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdd1[1]
 16779776 blocks [2/2] [UU]

unused devices: <none>

I don’t want to consume realms of pages discussing what, for each operating system you should be gathering. The average system administrator for any individual platform should, with a cup of coffee (or other preferred beverage) in hand, should be able to sit down and in under 10 minutes jot down the sorts of information that would need to be gathered in advance of a disaster to assist in the total system rebuild of an operating system of a machine they administer.

Once these information gathering steps have been determined, they can be inserted into the backup process as a pre-backup command. (In NetWorker parlance, this would be via a savepnpc “pre” script. Other backup products will equally feature such options.) Once the information is gathered, a copy should be kept on the backup server as well as in an offsite location. (I’ll give you a useful cloud backup function now: it’s called Google Mail. Great for offsiting bootstraps and system configuration details.)

When it comes to disaster recovery, such information can take the guess work or reliance on memory out of the equation, allowing a system or backup administrator in any (potentially sleep-deprived) state, with any level of knowledge about the system in question, to conduct the recovery with a much higher degree of certainty.


* Due to what they offer to do, bare metal recovery (BMR) products tend to be highly specific in which operating system variants, etc., they support. In my experience a significantly higher number of sites don’t use BMR than do.

 

Introduction

Being one of those freaky weird IT people who are passionate about backups*, when Apple first previewed Mac OS X 10.5 (aka Leopard), the number one thing I of course got excited about was Time Machine. Now, before anyone tells me that it’s “just a poor rip-off of VSS”, let me be blunt – analysts who started that talk have no clue what they’re talking about.

Yes, VSS is great on Windows systems – in fact, its great to see that standard VSS functionality has reached a point in NetWorker 7.5 that it’s just part of the Windows client for filesystem backups, rather than requiring additional licenses.

But VSS in itself is not in the same league as Time Machine for end user backup – and more importantly, recovery – and quite frankly, that’s more important when we’re talking about non-server backup systems.

Evaluating it as an end-user backup system

If you’re not fully across Time Machine, here’s how it works:

  1. You plug a new or otherwise unused hard drive into your Mac.
  2. The OS asks you if you want to use that drive for Time Machine backups.
  3. You answer Yes**.

That’s all there is to getting basic Time Machine backups running. At that point, Time Machine does a full backup, then from that point onwards does incremental backups making use of hard links, thus making very efficient use of space. Backups are taken every hour, and it manages backups such that:

  • Hourly backups are kept for 24 hours.
  • Daily backups are kept for a month.
  • Weekly backups are kept until the disk becomes full.

All pruning of space is automatically handled by the OS. For the system volume at least, Time Machine is an exclusive backup product – it backs up everything by default, and you have to explicitly tell it what you want excluded from the backup. This is a Really Good Thing. However, you can go into preferences and exclude other regions (e.g., I have a “DNB” (Do Not Backup) folder on my desktop that I drop stuff into for temporary storage), or explicitly include other drives attached to the system.

Overall the settings for Time Machine are simple – very simple:

Main preferences for Time Machine

Main preferences for Time Machine

The Options button is what allows you to manage exclusions for your backups:

Options pane for Time Machine

Options pane for Time Machine

To be honest though, who cares about backup? Desktop backup products abound, and in reality what we care about is whether you can recover. Indeed, for desktop products what we care most about is whether our parents, or our grandparents, or those people down the street who ask us for technical support simply because we’re in IT, can recover. Boy, can you recover.

Time Machine presents a visually beautiful way of browsing the backups. Unfortunately we won’t see it appear in other backup products because, well, according to Steve Jobs when it was first introduced, Apple took out a lot of patents on it***. The standard recovery browser will look like the following:

Time Machine Browsing Files

Time Machine Browsing Files

Equally importantly though, Time Machine isn’t just about facilitating file level recoveries, but also recoveries of other data that it understands – such as say, mail. Now, yes, enlightened readers will point out that Apple’s Mail.app program stores mail in files and thus is easily browseable, but the files aren’t named in such a way that say, my father could work out which file needs to be recovered.

Here’s an example of what Time Machine looks like when browsing for recovery of mail:

Browsing mail with Time Machine

Browsing mail with Time Machine

To browse and retrieve email, the user simply browses through the folder structure – and the time of the backups – to pick the email(s) to be recovered. It’s incredibly intuitive, and takes less than 5 minutes to learn for the average user. As an enterprise backup consultant, honestly, I almost cried when I saw this and thought about how much of a pain message level recovery has been for so long. (Yes, getting better now, and has been for a while.)

Browsing back in time is straight forward – just scroll the mouse over the time bar on the right hand side of the screen and select the date you want:

Selecting alternate recovery time

Selecting alternate recovery time

This, quite honestly, is the epitome of simplicity. Going beyond standard backup and recovery operations, Time Machine is also an excellent disaster recovery tool – if you have serious enough issues that you need to rebuild your machine, the Mac OS X installer actually has the option of doing a rebuild and recovery from Time Machine backups.

To be blunt – as a backup utility for end users, Time Machine is an ace in the hole, and one of the most underrated features of Mac OS X.

There are some things that I think are lacking in Time Machine at the moment that will only come in time:

  1. Support for multiple backup destinations – savvy users want to be able to swap out their backup destination periodically to take it off site.
  2. Granular control of timing – some users complain that Time Machine affects the performance of their machine too much. Personally, I consider myself a power user and have not noticed it slowing me down yet, but others feel that it does, and don’t like the frequency at which it backs up. Being able to choose whether you want your most frequent backups done hourly, 2-hourly, 3-hourly, 4-hourly, etc., would be a logical enhancement to Time Machine, and one which I hope does arrive. Personally if this were available I’d more be seeking to keep daily backups for at least a month.
  3. Better application support – this actually isn’t an Apple issue at all, but one for third party software developers. Over time, I want to see any application that does database style storage, or storage where multiple files must remain consistent, to offer Time Machine integration. (The biggest failure in this respect is Microsoft Entourage – the monolithic database format makes hourly backups via Time Machine not only impractical, but unusable.)

Still, regardless of these deficiencies, Time Machine as it currently stands was a fantastic addendum to a robust operating system, one which puts easy recovery in the hands of average users.

(I have no idea what Apple intends to do with Time Machine at the server level – while Time Machine exists on Mac OS X Server, for the most part it’s to backup the server itself plus act as a repository point for machines on the LAN, much in the same way that Apple’s Time Capsule product works. However, if they added a little bit more – say, backing up multiple clients with file level deduplication across the clients, suddenly it would be very interesting.)

Comparing it to enterprise products…

Time Machine is great for providing a backup mechanism for end users, but it pales in comparison to what enterprise backup products such as NetWorker can do for an entire environment. As such, it’s not fair to compare it against those products – it’s not in their league, and it doesn’t pretend to be there. It doesn’t support remote storage, it doesn’t support true centralisation of backups, it doesn’t support removable media, … the list goes on, and on. Most importantly for any enterprise however, it doesn’t really support native backups of other operating systems. (Yes, you can shoe-horn it into say, backing up a SMB or CIFS share, but like any such form of backup, it’s not a true, integrated solution.)

As such, Time Machine isn’t something that’s going to replace your NetWorker environment. Chances are it won’t even replace your Retrospect environment. Used correctly though, it can act as a valuable enhancement in a backup environment, but if you’re a backup administrator, it isn’t going to put you out of a job today, next week, next year, or even in the next 5 years.


* Honestly, tell someone in a different discipline in IT that you specialise in data protection and that you enjoy it, and watch their eyes glaze over…

** Or in my case, since I can never resist the temptation, you answer no, and rename the disk to TARDIS, since if it’s going to be a Time Machine, it may as well be a good one.

*** Good for them. It’s tiresome watching what sometimes seems to be the entire computer industry using Apple as a free R&D centre.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha