Counting the numbers

 Architecture, Recovery  Comments Off on Counting the numbers
Apr 252012
 

How many datacentres do you have across your organisation?

Server Rooms

It’s a simple enough question, but the answer is sometimes not as simple as some organisations think.

The reason for this is it’s too easy to imagine different physical locations equating to different datacentres.

There are actually two conditions that must be met before a server room can be considered a fully independent datacentre. These are:

  1. Physical separation – The room/building must be sufficiently physically separated from other datacentres. By “sufficiently”, I mean that any disaster situation the company designs into its contingency plans should not be able to take out both more than one datacentre on the basis of physical proximity to one another.
  2. Technical separation – The room/building must be able to operate at its full production potential without the direct availability of any other datacentre within the environment.

So what does it mean if you have a datacentre that doesn’t meet both of those requirements? Quite simply, it’s not an independent datacentre at all, and likely should be considered just a remote server room which is part of a geographically disperse datacentre.

If you’re wondering what the advantage of making this distinction is, it’s this: unless they’re truly independent, considering geographically disperse server rooms to be datacentres results in the business often making highly incorrect assumptions about the resiliency of the IT systems, and by extension, the business itself.

You might think that we have enough differentiation by referring to simply datacentres and independent datacentres. This, I believe, compounds the problem rather than introducing clarity; many people, particularly those who are budget conscious, will assume the best possible scenario for the least possible price. We all do it – that’s why getting a bargain when shopping can be such a thrill. So while-ever a non-independent datacentre is referred to as a datacentre, it’s going to be read by a plethora of people within a business, or the customers of that business, as an independent one. The solution is to take the word away.

So, on that basis, it’s time to recount, and answer: how many datacentres does your business truly have?

10 Things Still Wrong with Data Protection Attitudes

 Architecture, Backup theory, NetWorker  Comments Off on 10 Things Still Wrong with Data Protection Attitudes
Mar 072012
 

When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

Aug 302011
 

This post has now moved to the Enterprise Systems Backup Blog, and can be read here.

Psst! Want to touch my backup server?

 Architecture, NetWorker  Comments Off on Psst! Want to touch my backup server?
Apr 052011
 

Last month, I posted a survey with the following questions:

  1. What is your backup server (currently)?
    1. Physical server
    2. Virtual server, backing up directly
    3. Virtual server, in director mode only
    4. Blade server, backing up directly
    5. Blade server, director mode only
  2. Would you run a virtual backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.
  3. Would you run a blade backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.

Now, I did preface this survey with my own feelings at the time:

I have to admit, I have great personal reservations towards virtualising backup servers. There’s a simple, fundamental reason for this: the backup server should have as few dependencies as possible in an environment. Therefore to me it seems completely counter-intuitive to make the backup server dependent on an entire virtualisation layer existing before it can be used.

For this reason I also have some niggling concerns with running a backup server as a blade server.

Personally, at this point in time, I would never willingly advocate deploying a NetWorker server as a virtual machine (except in a lab situation) – even when running in director mode.

At the time of the survey, I already knew from a few different sources that EMC run virtualised NetWorker servers as part of their own environment, and are happy to recommend it. I however, wasn’t. (And let’s face it, I’ve been working with NetWorker for longer than EMC’s owned it.) That being said, I wasn’t looking for confirmation that I was right – I was looking for justifiable reasons why I might be wrong.

First, I want to present the survey findings, and then I’ll discuss some of the comments and where I now stand.

There were 122 respondents to the survey, and the answers were:

Current Backup Server

Did this number surprise me? Not really – by its very nature, backup operations and administration is about being conservative: keep things simple, don’t go bleeding edge, and trust what is known. As such, the majority of sites are running a physical backup server. Of the respondents, only 10% were running any form of virtualised backup server, regardless of whether that was a software or hardware virtualised server, and regardless of whether it was directly doing backups or backing up in director mode only.

Would you run a virtual backup server?

So this question was a simple one – would you run a backup server that was virtual? Anyone who has done any surveys would claim (rightly so) that my leading questions into the survey may have coloured the results of the survey, and I’d not disagree with them.

Yet, let’s look at those numbers – less than 50% (admittedly only by a small margin) gave an outright “No” response to this question. I was pleased though that those who would run a virtualised backup server seemed to mirror my general thoughts on the matter – the majority would only do so in director mode, with the next biggest group being willing to backup to disk to the backup server, but not using other devices.

Would you run a blade backup server?

The final question asked the same about blade servers. To be fair to those using blade servers, this probably should have been prefaced with a question “Do you use blade servers in your environment already?”, since it would seem logical that anyone currently not using blade servers probably wouldn’t answer yes to this. But I was still curious – as you may be aware, I’ve had some questions about blade servers in the past; and other than offering better rack density I see them having no tangible benefits. (Then again, I am in a country that has no lack of space.)

The big difference between a software virtualised backup server and a hardware virtualised backup server though was that people who would run a backup server in a blade environment were more willing to backup to any device. That’s probably understandable. It smells like and looks like regular hardware, so it feels easier than say, a virtual machine accessing a physical tape drive does.

So, the survey showed me fairly much what I was expecting I’d see – a high level of users with physical backup servers. I was hoping though that I might see some comments from people who were either using, or considering using virtual servers, and get some feedback on what they found to be the case.

One of the best comments that came through was from Alex Kaasjager. He started with this:

I agree with you that a backup server (master, director) should be as independent as possible – and right for that specific reason, I’d prefer the server virtualised. Virtualisation solves the problem of a hardware, a hardware-bound OS, location and redundancy.

That immediately got my attention – and so Alex followed with these examples:

– if my hardware breaks (and it will at a certain point in time) I will have to keep a spare machine or go with reinstall-recovery, which, as you will agree, poses its own very peculiar set of problems
– the OS, regardless which one, is bound to the hardware, be it for licensing, MAC address, or drivers. A change in the OS (because of a move to another datacenter for example) may hurt (although it probably won’t, in all fairness)
– I can move my VM anywhere, to another rack, datacenter, or country without much hassle, I can copy, make a snap and even export it. Hardware will prevent this.

Of all the things I hadn’t considered, it was the simple ability to move your backup server between virtual servers wasn’t what I’d considered. Alex’s first point – about protection from hardware failure – is very cogent on its own, but being able to just move the backup server around without impacting any operations, or disrupting licenses – now that’s the kind of “bonus” argument I was looking for. (It’s why, for instance, I’ve advocated that if you’re going to have a License Manager server, you make that virtual.)

Another backup administrator (E. O’S) advocated:

It absolutely has to be in director mode as you describe. All the benefits of hardware abstraction and HA/FT that you get with VM are just as relevant to a critical an app as NetWorker, especially for storage mobility and expansion for a growing and changing datazone. Snapshots before major upgrades? Cloning for testing or redeployment to another site? Yes please. You have to be more confident than ever in your ability to recover NetWorker with bootstraps and indices (even onto a physical host if you need to, to solve your virtualisation layer dependency conundrum) if and when the time comes. Plan for it, practice it, and sleep easy.

The final part of what I’ve quoted there comes to the heart of my reservations of running NetWorker virtualised, even in a director role – how do you do an mmrecov of it? In particular, even when running as a backup director, the NetWorker server still has to back its own bootstrap information up to a local device. Ensuring that you can still recover from such a device would become of paramount importance.

I think the solution here is three-fold:

  • (Already available) Design a virtualised backup server such that the risk of having to do a bootstrap recovery in DR is as minimal as possible.
  • (Already available) Assuming you’re doing those bootstrap backups to disk/virtual disk, be sure to keep them as a separate disk file to the standard disk file for the VM, so that you can run any additional cloning/copying of that you want at a lower level, or attach it to another VM in an emergency.
  • (EMC please take note) It’s time that we no longer needed to do any backups to devices directly attached to the backup server. NetWorker does need architectural enhancements to allow bootstrap backup/recovery to/from storage node devices. Secondary to this: DR should not be dependent on the original and the destination host having the same names.)

So, has this exercise changed my mind or reinforced my belief that you should always run a physical backup server?

I’m probably now awkwardly sitting on the fence – facing the “virtual is OK for director mode only” camp. That would be with strong caveats to do with recoverability arrangements for the virtual machine. In particular, what I’d suggest is that I would not agree with virtualising the backup server if you were in such a small environment that there’s no provisioning for moving the guest machine between virtual servers. The absolute minimum, for me, in terms of reliability of such a solution is being able to move the backup server from one physical host to another. If you can do that, and you can then have a very well practiced and certain recovery plan in the event of a DR, then yeah, I’m sold on the merits of having a virtualised backup director server.

(If EMC updated NetWorker as per that final bullet point above? I’d be very happy to pitch my tent in that camp.)

I’ve got a couple of follow-up points and questions I’ll be making over the coming week, but I wanted to at least get this initial post out.

Mar 032011
 

As a consultant, you get attuned to (or as some would have it, “cynical”) certain key phrases and statements when you’re in meetings. Sometimes these statements are innocent and exactly what the person says, but usually they set the alarm bells ringing.

As a bit of winding down after a hectic 7 days, I thought I’d share the top 15 statements that cause me to start immediately trying to get deep qualification of what I’ve just been told…

[table “4” not found /]

Oct 042010
 

It used to be 10 years ago that you couldn’t do anything in the backup space without having an answer to the question, “How do you achieve BMR?” Nowadays, it’s not a dirty word in backup, but it certainly seems to be somewhat passé.

So what happened? Is BMR now dead? Is it on life support? Did it ascend?

It’s an interesting question. I think that as an independent technology, BMR has become ever more niche, and what we’ve seen is a gradual shift in technology so as to allow BMR to become a silent feature. As such, it doesn’t necessarily get a lot of attention – it just blends into the background.

For the most part, I’d suggest that I found BMR to be more of a focus point in the Windows market, then later in the emerging Linux market, though still with a primary focus on Windows. This wasn’t to say that rapid systems recovery wasn’t important on other platforms, but on those platforms there were frequently technologies built into the OS. AIX could boot from a system image tape. Solaris could be Jumpstarted, etc. Eventually, Linux could be Kickstarted.

In the Legato space, BMR options were pretty challenging for the most part, so 10 years ago I’d regularly recommend customers wanting to BMR their Windows servers to deploy Ghost. It wasn’t perfect, but it did the trick – the goal in my mind was to get a system back to a state of easy recoverability; i.e., BMR was about allowing you to get a system back to the point where you could run a full recovery. Nothing more, nothing less. That was undoubtedly influenced by the lack of integrated BMR within NetWorker, but it worked, and it let each product focus on what it did best.

These days I think BMR is something that’s effectively available in most enterprise spaces without actually needing to reference it as an independent technology. So it comes into play primarily as a result of virtualisation and snapshots.

Within virtualisation, there’s two options that tend resolve independent BMR requirements – templates, and image level backups, though for slightly different reasons.

Templates are designed to allow a rapid deployment of a new guest – be it just at the operating system level, or a combination operating system and application level; such templates will usually include a certain level of patching – enough to get a host at a secure enough point to connect to a corporate network. But they don’t have to be used just for the deployment of a new guest; instead, if a guest fails or becomes otherwise hopelessly corrupt, there’s nothing stopping the use of a template to rapidly bring the guest “back to life” to allow a regular recovery. If backups are being done at the guest level, then a smart template will also include the backup software so that it’s immediately available on system (re)creation.

On the other hand, image level backups fulfil the old “cold backup” niche. When virtualisation started hitting its stride, image level backups were seen as the future, but then reality struck and it became painfully obvious that recovering a 100GB virtual machine to pull out a 10KB document was wasteful and time consuming. Since then file level recovery from image level backup has improved, but it’s still not an omnipresent technology. That being said, image level backup works perfectly as a rapid BMR mechanism. Even assuming a situation where an image level backup is only taken once a month, recovering a machine from an image backup done 30 days ago puts you in a situation to allow regular host-based recoveries to run with minimum effort.

We frequently look at snapshots at enabling more useful RPO and RTOs than traditional “once per day” backups. It’s common for instance to see NAS systems with hourly read-only snaps immediately available to end users for self-directed recoveries. They’re also used to facilitate traditional backups by doing quiesced backups with minimum downtime, or less disruptive backups.

However, certainly in the enterprise space, snapshots equally provide an excellent BMR solution. Snapshot, patch, revert to snapshot if patch fails, etc. Array level snapshots (IMHO) provide a significantly greater level of flexibility than a traditional BMR solution where the primary focus is getting a machine back to its most recent usable state. Snapshots are so useful on this front that they’re even used within virtualisation for exactly that reason – why go back to an image level backup, or waste time doing a cold backup of a virtual machine when you can just roll back to a snapshot taken 10 minutes ago?

What I’ve been observing now for a while is that BMR as an independent product gets very little attention these days in enterprises. At the small to medium business it still gets bandied about – often for desktops as much as for servers, but it increasingly seems that virtualisation and snapshots have gobbled up most of the BMR space in the enterprise.

It seems that over time even that space may become narrowed. Looking at Mac OS X as an example, the ability to do a new system install referencing a Time Machine backup is a perfect example of an operating system integrated approach to BMR. Does it solve all BMR issues, even on the OS X platform? No, but it addresses the 80% rule, I believe. Will it be the only such product? I can’t believe so – I have to believe we’ll eventually see something comparable in other operating systems.

What are your thoughts?

Virtualisation and testing

 Architecture, Backup theory, General Technology, NetWorker  Comments Off on Virtualisation and testing
Jun 032010
 

Once upon a time, if you said to someone “do you have a test environment?” there was at least a 70 to 80% chance that the answer would be one of the following:

  • Only some very old systems that we decommissioned from production years ago
  • No, management say it’s too expensive

I’d like to suggest that these days, with virtualisation so easy, there are few reasons why the average site can’t have a reasonably well configured backup and recovery test environment. This would allow the following sorts of tests could be readily conducted:

  • Disaster recovery of hosts and databases
  • Disaster recovery of the backup server
  • Testing new versions of operating systems, databases and applications with the backup software
  • Testing new versions of the backup software

Focusing on the Intel/x86/x86_64 world, we see where this is immediately achievable. Remember, for the average set of tests that you run, speed is not necessarily going to be the issue. Let’s focus on non-speed functionality testing, and think of what would be required to have a test environment that would suit many businesses, regardless of size:

  1. Virtualisation server – obviously VMware ESXi springs to mind here, if cost is a driving factor.
  2. Cheap storage – if performance is not an issue for testing (i.e., you’re after functionality not speed testing), there’s no reason why you can’t use cheap storage. A few 2TB SATA drives in a RAID-5 configuration will give you oodles of space if you need any level of redundancy, or just in a RAID-0 stripe will give you capacity and performance. Optionally present storage via iSCSI if its available.
  3. Tiny footprint – previously test environments were disqualified in a lot of organisations, particularly those at locations where space was at a premium. Allocating room for say, 15 machines to simulate part of the production network took up tangible space – particularly when it was common for test environments to not be built using rackable equipment.

In the 2000’s, much excitement was heralded over the notion of supercomputers at your desk – for example, remember when Orion released a 96-CPU capable system? The notion of that much CPU horsepower under your desk for single tasks may be appealing to some, but let’s look at more practical applications flowing from multi-core/multi-CPU systems – a mini datacentre under your desk. Or in that spare cubicle. Or just in a 3U rack enclosure somewhere within your datacentre itself.

Gone are the days when backup and recovery test environments are cost prohibitive. You’re from a small organisation? Maybe 10-20 production servers at most? Well that simply means your requirements will be smaller and you can probably get away with just VMware Workstation, VMware Fusion, Parallels or VirtualBox running on a suitably powerful desktop machine.

For companies already running virtualised environments, it’s more than likely the case that you can even use a production virtualisation server due for replacement as a host to the test environment, so long as it can still virtualise a subset of the production systems you’d need to test with. During budgetary planning this can make the process even more painless.

This sort of test environment obviously doesn’t suit every single organisation or every single test requirement – however, no single solution ever does. If it does suit your organisation though, it can remove a lot of the traditional objections to dedicated test environments.

What a day!

 Backup theory, Policies  Comments Off on What a day!
Mar 302010
 

This morning we went to the funeral of our best friends’ father. It was, as funerals go, a lovely service and after the funeral and the burial we headed off to the wake, only to have someone’s hilux slam into the driver’s side of our car on a tight bend. They’d skidded and come onto the wrong side of the road by just enough, given the tight corner, to make the impact. Thankfully speed, alcohol or drugs weren’t in play, just the wet, and even more importantly, no-one was injured. Dignity will be lost any time it’s driven without a passenger though – the driver’s door can’t be opened from the inside:

Alas, poor car, I hardly knew ye

The case is with the insurers, and we’re waiting for an assessment next Tuesday to find out whether the car will be repaired or written off. It would be a shame if it’s written off; it’s a Toyota Avalon, circa 2001, and while those cars were frumpy they were damn good cars. With only around 120,000km on the clock it’s not really all that old. About 3 or 4 years ago it was almost completely totalled in a massive hail storm on the central coast; as I recall the repair was in the order of about $12,000, and it only scraped through for repair on an insurance value of around $14,000. Now, with insurance of $7,500 and the repair estimate saying that it’ll top $5,000, age is against the car and it doesn’t look good.

But, this blog isn’t about my hassles, or my car.

It is however about insurance, and insurance is something I’ll be dealing with quite a bit over the coming days. Or I will be, once we hit next Tuesday and the car gets checked out by the assessors.

When we think of “backup as insurance”, there’s some fairly close analogies:

  • Backup is insurance because it’s about having a solution when something goes wrong;
  • Making a claim is performing a recovery;
  • Your excess is how easy (or hard) it is to make a recovery.

Given what’s happened today, it made me wonder what the analogy to “written off” is. That’s a little bit more unpleasant to deal with, but it’s still something that has to be considered.

In this case I’d suggest that the analogy for the insured item being “written off” is one of the following:

  • Having clonesseems simple, but if one recovery fails due to media, having clones that you can recover from instead are the cheapest, logical solution.
  • Having an alternate recovery strategy – so for items with really high availability requirements or minimal data loss requirements, this would refer to having some other replica system in place.
  • Having insurance that can get you through the worst of events – sometimes no matter what you do to protect yourself, you can have a disaster that exceeds all your preparation. So in the absolute worst case scenario, you need something that will help you pay your bills, or ameliorate your building debt while you get yourself back on-board.

Of course, it remains preferable to not have to rely on any of these options, but the case remains that it’s always important to have an idea what your “worst case scenario” recovery situation will be. If you haven’t prepared for one, I’ll suggest what it’s likely to be: going out of business. Yes, it’s that critical that you have an idea what you’ll do in a worst-case scenario. It’s not called “business continuity” for the heck of it – when that critical situation occurs, not having plans usually results in the worst kind of failure.

Me? I’ll be visiting a few car-yards on the weekend to scope up what options I have in the event the car gets written off on Tuesday.

Snapshots and Backups, Part 2

 Backup theory, General thoughts  Comments Off on Snapshots and Backups, Part 2
Feb 082010
 

Over the weekend I wrote up a piece about how snapshots are not a valid replacement to enterprise backup. The timing of this was in response to NetApp recently abandoning development of their VTL systems, and subsequent discussions this triggered, but it was something that I’d had sitting in the wings for a while.

It’s fair to say that discussions on snapshots and backups polarise a lot of people; I’ll fully admit that I side with the “snapshots can’t replace backups” side of the argument.

I want to go into this in a little more detail. First I’ll point out in fairness that there are people willing to argue the other side that don’t work for NetApp, in the same way that I don’t work for EMC. One of those is the other Preston – W. Curtis Preston, and you can read his articulate case here. I’m not going to spend this article going point for point against Curtis – it’s not the primary point of discussion I want to make in this entry.

Moving away from vendors and consultants, another and very interesting opinion, from the customer perspective, comes from Martin Glassborow’s Storagebod blog. Martin brings up some valid customer points – that being snapshot and replication represents extreme hardware lock-in. Some would argue that any vendor’s backup product represents vendor lock in as well, and this is partly right – though remember it’s not so difficult to keep a virtual machine around with the “last state” of the previous backup application available for recovery purposes. Keeping old and potentially obsolete NAS technology running to facilitate older recoveries after a vendor switch can be a little more challenging.

To get onto what I want to raise today, I need to revisit a previous topic as a means of further explaining my position. Let’s look for instance at my previous coverage of Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP). You can read the entire piece here, but the main point I want to focus on is my ILP ‘diagram’:

Components of ILP

One of the first points I want to make from that diagram is that I don’t exclude snapshots (and their subsequent replication) from an overall information lifecycle protection mechanism. Indeed, depending on the SLAs involved, they’re going to be practically mandatory. But, to use the analogy offered by the above diagram, they’re just pieces of the pie rather than the entire pie.

I’m going to extend my argument a little now, and go beyond just snapshots and replication, so I can elucidate the core reasons why I don’t like replicated snapshots as a permanent backup solution. Here’s a few other things I don’t like as a permanent backup solution:

  • VTLs replicated between a primary and disaster recovery site, with no tape out.
  • ADV_FILE (or other products disk backup solutions) cloned/duplicated between the primary and disaster recovery site, with no tape out.
  • Source based deduplication products with replication between two locations, with no tape out.

My fundamental objection in all of these solutions is the long term failure caused by keeping everything “online”. Maybe I’m a pessimist, but when I’m considering backup/recovery and disaster recovery solutions, I firmly believe that I’m being paid to consider all likely scenarios. I don’t personally believe in luck, and I won’t trust a backup/disaster recovery solution on luck either. The old Clint Eastwood quote comes to mind here:

You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do ya, punk?

When it comes to your data, no, no I don’t. I don’t feel lucky, I don’t encourage you to feel lucky. Instead I rely on solid, well protected systems with offline capabilities. Thus, I plan for at least some level of cascading failures.

It’s the offline component that’s most critical. Do I want all my backups for a year online, only online, even with replication? Even more importantly – do I want all your backups online, only online, even with replication? The answer remains a big fat no.

The simple problem with any solution that doesn’t provide for offline storage is that (in my opinion), it brings the risk of cascading failures into play too easily. It’s like putting all storage for your company on a single RAID-5 LUN and not having a hot spare. Sure you’re protected against that first failure, but it’s shortly after the first failure that Murphy will make an appearance in your computer room. (And I’ll qualify here: I don’t believe in luck, but I’ve observed over the years in many occasions that Murphy’s Law rules in computer rooms as well as in other places.) Or to put it another way: you may hope for the best, but you should plan for the worst. Let’s imagine a “worst case scenario”: a fire starts in your primary datacentre 10 minutes after upgrade work has commenced on the array that receives replicated snapshots in your disaster recovery runs into problems with firmware, leaving that array inaccessible until vendor upgrades are complete. Or worse again, it leaves storage corrupted.

Or if that seems too extreme, consider a more basic failure: a contractor near to your primary datacentre digs through the cables linking your production and disaster recovery sites, and it’s going to take 3 days to repair. Suddenly you’ve got snapshots and no replication. Just how lucky does that leave you feeling? Personally, I feel slightly naked and vulnerable when I have a single backup that’s not cloned. If suddenly none of my backups were getting duplicated, and I had no easy access to my clones, I’d feel much, much worse. (And that full body shiver I do from time to time would get very pronounced.)

Usually all this talk of a single instance failure frequently leads proponents of snapshots+replication only to suggest that a good design will see 3-way replication, so there’s always two backup instances. This doubles a lot of costs while merely moving the failure point just a jump to the left. On the other hand, offline backup where there’s the backup from today, the backup from yesterday, the backup from the day before … the backup from last week, the backup from last month, etc., all offline, all likely on different media – now that’s failure mitigation. Even if something happens and I can’t recover the most recent backup, in many recovery scenarios I can go back one day, two days, three days, etc. Oh yes, you can do that with snapshots too, but not if the array is a smoking pile of metal and plastic fused to the floor after a fire. In some senses, it’s similar to the old issue of trying to get away from cloning by backing up from the production site to media on the disaster recovery site. It just doesn’t provide adequate protection. If you’re thinking of using 3-way replication, why not instead have a solution that uses two entirely different types of data protection to mitigate against extreme levels of failure?

It’s possible I’ll have more to say on this in the coming weeks, as I think it’s important, regardless of your personal view point, to be aware of all of the arguments on both sides of the fence.

Backup forms only a fraction of disaster recovery

 Backup theory  Comments Off on Backup forms only a fraction of disaster recovery
Jan 262010
 

A borked LaCie 2TB BigDisk Extreme has reminded me of the role of backup and recovery within disaster recovery itself. By disaster recovery, I mean total “system” failure, whether that system is an entire server, an entire datacentre, or in my case, a large drive.

What is the difference between a regular failure and a disaster? I think it’s one of those things that’s entirely the perspective of organisation or person who experiences it.

As for my current disaster, I’ve got a 2TB drive with just 34GB free. I’ve got up-to-date backups for this drive which I can restore from, and in the event of a catastrophe, I could actually regenerate the data, given that it’s all my media files. It’s also operational, so long as I don’t power it off again. (This time it took more than 30 minutes to become operational after a shutdown. It’s been getting worse and worse.)

So I’ve got a backup, I’ve got a way of regenerating the data if I have to, and my storage is still operational. Why is it a disaster? Here’s a few reasons, which I’ll then use to explain what makes for a disaster more generally, and why backup/recovery is only a small part of disaster recovery:

  1. I don’t have spares. Much as I’d love to have a 10 or 20TB array at home running on RAID-6 or something like that, I don’t have that luxury. For me, if a drive fails, I have to go out and buy a replacement drive. That’s budget – capital expenditure, if you will. What’s more, it’s usually unexpected capital expenditure.
  2. Not all my storage is high speed. Being a home user, a chunk of my storage is either USB-2 or FireWire 400/800. None of these formats offer blistering data transfer speeds. The 2TB drive is hooked up to Firewire 800, and I backup to Firewire 400, which means I’m bound to a maximum of around 30-35MB/s throughput for either running the backup or recovering from it.
  3. The failure constrains me. Until I get the drive replaced, I have to be particularly careful about any situation that would see the drive powered off.

So there’s three factors there that constitute a “disaster”:

  1. Tangible cost.
  2. Time to repair.
  3. Interruptive.

A regular failure will often have one or two of the above, but all three are needed to turn it into a disaster. This is why a disaster is highly specific to the location where it happens – it’s not any specific thing, but a combination of the situation, the impact locally and the required response that render a disaster from a failure.

There’s of course varying levels of disasters too, even at an individual level. Having a borked media drive is a disaster, but it’s not a “primary” disaster for me, because the core of what I do on my computer I can still get done. The same applies with corporations – it could be that losing both a primary fileserver and a manually controlled archive fileserver would constitute a “disaster”, but the first is always likely to be a far more serious disaster. That’s because it generates higher spikes in one or more of the factors – cost and interruption.

So, returning to the topic of the post – let’s consider why backup/recovery only forms a fraction of disaster recovery. When we consider a regular failure requiring recovery, it’s clear that the backup/recovery process forms not only the nexus to the activity, but likely the longest or most “costly” component (usually in terms of staff time).

In a disaster recovery situation, that’s no longer guaranteed to be the case. While the actual act of recovery is likely to take some time within a disaster recovery situation, there’s usually going to be a heap of other activities. There’ll be:

  • Personnel issues – getting human resources allocated to fixing the problem, and the impact of the failure on a number of people. Typically you don’t find (in a business world) that a disaster is something that only affects a single user within the organisation. It’s going to impact a significant number of workers – hence the tangible cost and the interruptive nature of them.
  • Fault resolution time – If you can seamlessly failover from an event, it’s unlikely it will be treated as a disaster. Sure, it may be a major issue, but a disaster is something that is going to take real time to fix. A disaster will see staff needing to work nigh-continuously in order to get the system operational. That will include:
    • Time taken to assess the situation,
    • Time taken to get replacement systems ready,
    • Time taken to recover,
    • Time taken to mop up/finalise access,
    • Time taken to repair original failure,
    • Time taken to revert services and
    • Time taken to report.
  • Post recovery exercises – in a good organisation, disaster recovery operations don’t just stop when the last byte of data has been recovered. As alluded to in the above bullet point, there needs to be a formal evaluation of the circumstances that lead up to the disaster, the steps required to rectify it, any issues that might have occurred, and plans to avoid it (or mitigate it) in future. For some staff, this exercise may be the longest part of the disaster recovery process.
  • Post disaster upgrades – if, as a result of the disaster and the post recovery exercises it’s determined that new systems must be put into place (e.g., adding a new cluster, or changing the way business continuity is handled), then it can be fairly stated that all of the work involved in such upgrades is still attributed to the original disaster recovery situation.

All of these factors (and many more – it will vary, site by site) lead to the inevitable conclusion that it’s insufficient to consider that disaster recovery is just a logical extension of a regular backup and recovery process. It’s far more interruptive. It’s more costly in terms of either direct staff time or a variety of other factors, and it’s far more interruptive – both to individuals within the organisation, and the organisation as a whole.

As such, the response to a disaster recovery situation should not be driven directly by the IT department. IT of course will play a valuable and critical role in the recovery process, but the response must be driven by a team with oversight against all affected areas, and the post-recovery processes must equally be driven by a team whose purdue extends beyond just the IT department.

We can’t possibly prepare for every disaster. To do so would require unlimited budget and unlimited resources. (It would also be reminiscent of the Brittas Empire.)

Instead, what we can plan for is that disasters will, inevitably happen. By acknowledging that there is always a risk of a disaster, organisations can prepare for them by:

  • Determining “levels” of disaster – quantifying what tier of disaster a situation will be by say, percentage of affected employees, loss of ability to perform primary business functions, etc.
  • Determining role based involvement in disaster response teams for each of those levels of disaster.
  • Determining procedures for:
    • Communication throughout the disaster recovery process.
    • Activating disaster response teams.
    • Documenting the disaster.
    • Reporting on the disaster.
    • Post-disaster meetings.

Good preparation of the above will not mitigate a disaster, but it’ll at least considerably reduce the risk of a disaster becoming a complete catastrophe.

Don’t just assume that disaster recovery is a standard backup and recovery process. It’s not – not by a long shot. Making this assumption puts the business very much at risk.

%d bloggers like this: