Normally you don’t want to be in this position, but sometimes you’ll strike a situation where the only possible location of data that you need to get back is in a saveset that aborted (i.e., failed) during the backup process. Now, if the saveset/media is almost completely hosed, you’re probably going to need to recover using the scanner|uasm process, but if it was just a case of a failed backup, you can direct a partial saveset recovery using the recover command.

When you’re at this point the first thing you need to do is find the saveset ID of the aborted saveset, but I’ll leave that as an exercise to the reader. Now, once you’ve got the aborted saveset ID, it’s as simple as running a saveset recovery. The basic command might look like this:

C:\> recover -d path -s buServer -iN -S ssid

Where:

  • ‘path’ is the path that you want to recover to. Note that in these situations, it’s usually a very, very good idea to make sure you recover to somewhere new, rather than overwriting any existing files.
  • ‘buServer’ is the backup server that you want to recover from.
  • ‘ssid’ is the saveset ID for the aborted saveset that you want to recover from.

Depending on whether you’re doing a directed recovery, etc., you may end up with a few additional arguments, but the above is fairly much what you need in this situation. (If you’re confident that a specific path or file you want back is going to be in the part of the saveset backed up, you can always add that path at the end of the recovery command, too.)

Once the recovery runs, you’ll get a standard file-by-file listing of what is being recovered, but the recovery will end with what looks like an error – it’s effectively though just a notification that NetWorker has hit the data that was ‘in transit’, so to speak, when the saveset was aborted. This error will look similar to the following:

5041:recover: Unable to read checksum from save stream

16294:recover: Encountered an error recovering C:\temp2\Temp\744\win_x86\networkr\hba\emc-homebase-agent-6.1.2-win-x86.exe

53363:recover: Recover of rsid 851692923 failed: Error receiving files from NSR server `tara'

The process cannot access the file because it is being used by another process.

Received 231 matching file(s) from NSR server `tara'

Recover errors with 1 file(s)

Recover completion time: 4/20/2010 3:41:12 PM

At that point, you know that you’ve got back all the data you’re going to get back, and you can search through the recovered files for the data you want.

(As an aside, don’t forget to join the forums if you’ve got questions that aren’t answered in this blog.)

 

Over the weekend I wrote up a piece about how snapshots are not a valid replacement to enterprise backup. The timing of this was in response to NetApp recently abandoning development of their VTL systems, and subsequent discussions this triggered, but it was something that I’d had sitting in the wings for a while.

It’s fair to say that discussions on snapshots and backups polarise a lot of people; I’ll fully admit that I side with the “snapshots can’t replace backups” side of the argument.

I want to go into this in a little more detail. First I’ll point out in fairness that there are people willing to argue the other side that don’t work for NetApp, in the same way that I don’t work for EMC. One of those is the other Preston – W. Curtis Preston, and you can read his articulate case here. I’m not going to spend this article going point for point against Curtis – it’s not the primary point of discussion I want to make in this entry.

Moving away from vendors and consultants, another and very interesting opinion, from the customer perspective, comes from Martin Glassborow’s Storagebod blog. Martin brings up some valid customer points – that being snapshot and replication represents extreme hardware lock-in. Some would argue that any vendor’s backup product represents vendor lock in as well, and this is partly right – though remember it’s not so difficult to keep a virtual machine around with the “last state” of the previous backup application available for recovery purposes. Keeping old and potentially obsolete NAS technology running to facilitate older recoveries after a vendor switch can be a little more challenging.

To get onto what I want to raise today, I need to revisit a previous topic as a means of further explaining my position. Let’s look for instance at my previous coverage of Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP). You can read the entire piece here, but the main point I want to focus on is my ILP ‘diagram’:

Components of ILP

One of the first points I want to make from that diagram is that I don’t exclude snapshots (and their subsequent replication) from an overall information lifecycle protection mechanism. Indeed, depending on the SLAs involved, they’re going to be practically mandatory. But, to use the analogy offered by the above diagram, they’re just pieces of the pie rather than the entire pie.

I’m going to extend my argument a little now, and go beyond just snapshots and replication, so I can elucidate the core reasons why I don’t like replicated snapshots as a permanent backup solution. Here’s a few other things I don’t like as a permanent backup solution:

  • VTLs replicated between a primary and disaster recovery site, with no tape out.
  • ADV_FILE (or other products disk backup solutions) cloned/duplicated between the primary and disaster recovery site, with no tape out.
  • Source based deduplication products with replication between two locations, with no tape out.

My fundamental objection in all of these solutions is the long term failure caused by keeping everything “online”. Maybe I’m a pessimist, but when I’m considering backup/recovery and disaster recovery solutions, I firmly believe that I’m being paid to consider all likely scenarios. I don’t personally believe in luck, and I won’t trust a backup/disaster recovery solution on luck either. The old Clint Eastwood quote comes to mind here:

You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do ya, punk?

When it comes to your data, no, no I don’t. I don’t feel lucky, I don’t encourage you to feel lucky. Instead I rely on solid, well protected systems with offline capabilities. Thus, I plan for at least some level of cascading failures.

It’s the offline component that’s most critical. Do I want all my backups for a year online, only online, even with replication? Even more importantly – do I want all your backups online, only online, even with replication? The answer remains a big fat no.

The simple problem with any solution that doesn’t provide for offline storage is that (in my opinion), it brings the risk of cascading failures into play too easily. It’s like putting all storage for your company on a single RAID-5 LUN and not having a hot spare. Sure you’re protected against that first failure, but it’s shortly after the first failure that Murphy will make an appearance in your computer room. (And I’ll qualify here: I don’t believe in luck, but I’ve observed over the years in many occasions that Murphy’s Law rules in computer rooms as well as in other places.) Or to put it another way: you may hope for the best, but you should plan for the worst. Let’s imagine a “worst case scenario”: a fire starts in your primary datacentre 10 minutes after upgrade work has commenced on the array that receives replicated snapshots in your disaster recovery runs into problems with firmware, leaving that array inaccessible until vendor upgrades are complete. Or worse again, it leaves storage corrupted.

Or if that seems too extreme, consider a more basic failure: a contractor near to your primary datacentre digs through the cables linking your production and disaster recovery sites, and it’s going to take 3 days to repair. Suddenly you’ve got snapshots and no replication. Just how lucky does that leave you feeling? Personally, I feel slightly naked and vulnerable when I have a single backup that’s not cloned. If suddenly none of my backups were getting duplicated, and I had no easy access to my clones, I’d feel much, much worse. (And that full body shiver I do from time to time would get very pronounced.)

Usually all this talk of a single instance failure frequently leads proponents of snapshots+replication only to suggest that a good design will see 3-way replication, so there’s always two backup instances. This doubles a lot of costs while merely moving the failure point just a jump to the left. On the other hand, offline backup where there’s the backup from today, the backup from yesterday, the backup from the day before … the backup from last week, the backup from last month, etc., all offline, all likely on different media – now that’s failure mitigation. Even if something happens and I can’t recover the most recent backup, in many recovery scenarios I can go back one day, two days, three days, etc. Oh yes, you can do that with snapshots too, but not if the array is a smoking pile of metal and plastic fused to the floor after a fire. In some senses, it’s similar to the old issue of trying to get away from cloning by backing up from the production site to media on the disaster recovery site. It just doesn’t provide adequate protection. If you’re thinking of using 3-way replication, why not instead have a solution that uses two entirely different types of data protection to mitigate against extreme levels of failure?

It’s possible I’ll have more to say on this in the coming weeks, as I think it’s important, regardless of your personal view point, to be aware of all of the arguments on both sides of the fence.

 

Every now and then the topic arises over whether snapshots are backups.

This is going through a resurgence at the moment, as NetApp has dropped development of their VTL systems, with some indications being that they’re going to revert to recommending people use snapshots and replication for backup.

So this raises the question again – is a snapshot a backup? I’ll start by quoting from my book here:

A backup is a copy of any data that can be used to restore the data as/when required to its original form. That is, a backup is a valid copy of data, files, applications, or operating systems that can be used for the purposes of recovery.

On the face of this definition, a snapshot is indeed a backup, and I’d agree that on a per-instance basis snapshots can act as backups. However, I’d equally argue that building your entire backup and recovery system on the basis of snapshots and replication is like building a house of cards on shifting sand in the face of an oncoming storm. In short, I don’t believe that snapshots and replication alone provide:

  1. Sufficient long-term protection.
  2. Sufficient long-term management.
  3. Sufficient long-term performance.

I’ll be the first to argue that in a system with high SLAs, having snapshots and/or replication is going to be almost a 100% requirement. You can’t meet a 1 hour data loss deadline if you only backup once every 24 hours – and backing up every hour using conventional backup systems is rarely appropriate (or rarely even works). So I’m not dismissing snapshots at all.

It’s easy to discuss the theoretical merits of using snapshots in lieu of backup/recovery software as a total backup system, but I think that the practical considerations quickly overcome any theoretical discussion. So let’s consider a situation though where you want to keep your backups for 6 months. (These days that’s a fairly short period.) Do you really want to keep 6 months of snapshots around? Let’s assume we keep hourly snapshots for 2 weeks, then one snapshot per day for the rest of the time. That’s 504 snapshots per system – in fact, normally per NAS filesystem. Say you’ve got 4 NAS units and 30 filesystems on each one – that’s around 60,000 snapshots over a course of 6 months.

What’s 60,000+ snapshots going to do to:

  • Primary production storage performance?
  • Storage and backup administrator management?
  • Storage costs?
  • Indexing costs?

The argument that snapshots and replication alone can replace a healthy enterprise backup system (or act in lieu of it) just doesn’t wash as far as I’m concerned. It looks good on paper to some, but on closer inspection, it’s a paper tiger. By all means within environments with heavy SLAs they’re likely to form part of the data protection solution, but they shouldn’t be the only solution.

 

A borked LaCie 2TB BigDisk Extreme has reminded me of the role of backup and recovery within disaster recovery itself. By disaster recovery, I mean total “system” failure, whether that system is an entire server, an entire datacentre, or in my case, a large drive.

What is the difference between a regular failure and a disaster? I think it’s one of those things that’s entirely the perspective of organisation or person who experiences it.

As for my current disaster, I’ve got a 2TB drive with just 34GB free. I’ve got up-to-date backups for this drive which I can restore from, and in the event of a catastrophe, I could actually regenerate the data, given that it’s all my media files. It’s also operational, so long as I don’t power it off again. (This time it took more than 30 minutes to become operational after a shutdown. It’s been getting worse and worse.)

So I’ve got a backup, I’ve got a way of regenerating the data if I have to, and my storage is still operational. Why is it a disaster? Here’s a few reasons, which I’ll then use to explain what makes for a disaster more generally, and why backup/recovery is only a small part of disaster recovery:

  1. I don’t have spares. Much as I’d love to have a 10 or 20TB array at home running on RAID-6 or something like that, I don’t have that luxury. For me, if a drive fails, I have to go out and buy a replacement drive. That’s budget – capital expenditure, if you will. What’s more, it’s usually unexpected capital expenditure.
  2. Not all my storage is high speed. Being a home user, a chunk of my storage is either USB-2 or FireWire 400/800. None of these formats offer blistering data transfer speeds. The 2TB drive is hooked up to Firewire 800, and I backup to Firewire 400, which means I’m bound to a maximum of around 30-35MB/s throughput for either running the backup or recovering from it.
  3. The failure constrains me. Until I get the drive replaced, I have to be particularly careful about any situation that would see the drive powered off.

So there’s three factors there that constitute a “disaster”:

  1. Tangible cost.
  2. Time to repair.
  3. Interruptive.

A regular failure will often have one or two of the above, but all three are needed to turn it into a disaster. This is why a disaster is highly specific to the location where it happens – it’s not any specific thing, but a combination of the situation, the impact locally and the required response that render a disaster from a failure.

There’s of course varying levels of disasters too, even at an individual level. Having a borked media drive is a disaster, but it’s not a “primary” disaster for me, because the core of what I do on my computer I can still get done. The same applies with corporations – it could be that losing both a primary fileserver and a manually controlled archive fileserver would constitute a “disaster”, but the first is always likely to be a far more serious disaster. That’s because it generates higher spikes in one or more of the factors – cost and interruption.

So, returning to the topic of the post – let’s consider why backup/recovery only forms a fraction of disaster recovery. When we consider a regular failure requiring recovery, it’s clear that the backup/recovery process forms not only the nexus to the activity, but likely the longest or most “costly” component (usually in terms of staff time).

In a disaster recovery situation, that’s no longer guaranteed to be the case. While the actual act of recovery is likely to take some time within a disaster recovery situation, there’s usually going to be a heap of other activities. There’ll be:

  • Personnel issues – getting human resources allocated to fixing the problem, and the impact of the failure on a number of people. Typically you don’t find (in a business world) that a disaster is something that only affects a single user within the organisation. It’s going to impact a significant number of workers – hence the tangible cost and the interruptive nature of them.
  • Fault resolution time – If you can seamlessly failover from an event, it’s unlikely it will be treated as a disaster. Sure, it may be a major issue, but a disaster is something that is going to take real time to fix. A disaster will see staff needing to work nigh-continuously in order to get the system operational. That will include:
    • Time taken to assess the situation,
    • Time taken to get replacement systems ready,
    • Time taken to recover,
    • Time taken to mop up/finalise access,
    • Time taken to repair original failure,
    • Time taken to revert services and
    • Time taken to report.
  • Post recovery exercises – in a good organisation, disaster recovery operations don’t just stop when the last byte of data has been recovered. As alluded to in the above bullet point, there needs to be a formal evaluation of the circumstances that lead up to the disaster, the steps required to rectify it, any issues that might have occurred, and plans to avoid it (or mitigate it) in future. For some staff, this exercise may be the longest part of the disaster recovery process.
  • Post disaster upgrades – if, as a result of the disaster and the post recovery exercises it’s determined that new systems must be put into place (e.g., adding a new cluster, or changing the way business continuity is handled), then it can be fairly stated that all of the work involved in such upgrades is still attributed to the original disaster recovery situation.

All of these factors (and many more – it will vary, site by site) lead to the inevitable conclusion that it’s insufficient to consider that disaster recovery is just a logical extension of a regular backup and recovery process. It’s far more interruptive. It’s more costly in terms of either direct staff time or a variety of other factors, and it’s far more interruptive – both to individuals within the organisation, and the organisation as a whole.

As such, the response to a disaster recovery situation should not be driven directly by the IT department. IT of course will play a valuable and critical role in the recovery process, but the response must be driven by a team with oversight against all affected areas, and the post-recovery processes must equally be driven by a team whose purdue extends beyond just the IT department.

We can’t possibly prepare for every disaster. To do so would require unlimited budget and unlimited resources. (It would also be reminiscent of the Brittas Empire.)

Instead, what we can plan for is that disasters will, inevitably happen. By acknowledging that there is always a risk of a disaster, organisations can prepare for them by:

  • Determining “levels” of disaster – quantifying what tier of disaster a situation will be by say, percentage of affected employees, loss of ability to perform primary business functions, etc.
  • Determining role based involvement in disaster response teams for each of those levels of disaster.
  • Determining procedures for:
    • Communication throughout the disaster recovery process.
    • Activating disaster response teams.
    • Documenting the disaster.
    • Reporting on the disaster.
    • Post-disaster meetings.

Good preparation of the above will not mitigate a disaster, but it’ll at least considerably reduce the risk of a disaster becoming a complete catastrophe.

Don’t just assume that disaster recovery is a standard backup and recovery process. It’s not – not by a long shot. Making this assumption puts the business very much at risk.

 

I thought it about time that I cited the two key reasons why, if faced with a choice between NetWorker and NetBackup, I would choose NetWorker every time.

As you might expect, given my focus on backup as insurance, both of these reasons are firmly focused on recovery. In fact, so much so that I still don’t really understand why EMC doesn’t go to market with these points time and time and time again and just smack Symantec around until it’s blue in the face and begging for mercy.

Reason 1: NetBackup does not implement backup dependencies

I struggle to call NetBackup an “enterprise” backup product because of this simple fact. Honestly, backup dependencies are critically important when it comes to guaranteeing anything but last-backup recoverability.

What does this mean?

In short, as soon as a backup hits its retention period in NetBackup, it’s toast – it’s a goner.

Irrespective of whether there are any backups of the same filesystem/data set that requires the “outside retention” backup for recovery purposes.

I can’t sum this up any other way: in a backup product, I see this as recklessly irresponsible. It provides a focus on media savings that even the most miserly bean cruncher would admire. Well, until the bean cruncher’s system can’t be recovered from 6 weeks ago to fulfil audit requirements.

Reason 2: True Image Recovery is “optional”

If you’ve grown up in a NetWorker world, where the emphasis has always been, and will always continue to be on recovery, this will, like the reason above, make you soil yourself. Imagine having a full backup plus six incremental backups of a directory, and wanting to recover the filesystem from last night. Now imagine just selecting the full plus the incrementals for recovery and getting back everything generated during that time.

Even the files that had been deleted between backups. I.e., you don’t get back what the filesystem looked like at the time of the backup that you’re recovering from, but what it looked like for every backup that you’re recovering from.

NetWorker, once, in the 5.5.x stream implemented this. It was called a BUG. In NetBackup, it’s a “feature”. In order to enable a correct recovery, you have to turn on “true image recovery”, something that takes extra resources, and is typically advised  that you keep the data just for a small cycle (e.g., 7 days) rather than the complete retention time for the backups.

There’s another word for this: Joke.

On another front…

As recently as December I mentioned that I wished EMC would get their act together and implement inline cloning – one of the few things where I saw that NetBackup had a distinct competitive advantage over NetWorker.

Maybe it was the glow of the cider, but I had an epiphany in Copacabana on a hill watching (probably illegal) fireworks in Avoca and Terrigal on new years eve. Inline cloning is no longer a compelling factor in a backup product. Why? Media streaming speeds have reached a point where companies with serious amounts of data just should not be implementing direct-to-tape backup solutions any more. Inline cloning was developed at a time when you’d want to generate both sets of tapes as quickly as possible, but only companies with very small data sets will find themselves not backing up to some disk unit first (be it say, ADV_FILE, or VTL, in NetWorker), and those companies won’t be constrained on backup/clone windows to a point where they’d need inline cloning anyway.

When not backing up direct-to-tape, there are several factors that mitigate the need to do inline cloning. In organisations with a very strong need for offsiting, there’s replication at a VTL or disk backup unit layer. In organisations that just need a second copy generated “as soon as possible”, doing disk/virtual tape to physical tape cloning following the backup should be fast enough to handle the cloning at appropriate performance levels.

In other words: there’s no need for EMC to implement inline cloning. As a technology, it’s a dead-end from a tape-only time. I feel somewhat silly this didn’t occur to me sooner.

 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

 

One of the areas where administrators have been rightly able to criticise NetWorker has been the lack of reporting or auditing options to do with recoveries. While some information has always been retrievable from the daemon logs, it’s been only basic and depends on keeping the logs. (Which you should of course always do.)

NetWorker 7.6 however does bring in recovery reporting, which starts to rectify those criticisms. Now in the enterprise reporting section, you’ll find the following section:

  • NetWorker Recover
    • Server Summary
    • Client Summary
    • Recover Details
    • Recover Summary over Time

Of these reporting options, I think the average administrator will want the bottom two the most, unless they operate in an environment where clients are billed for recoveries.

Let’s look at the Recover Summary over Time report:

Recover summary over time

This presents a fairly simple summary of the recoveries that have been done on a per-client basis, including the number of files recovered, the amount of data recovered and the breakdown of successful vs failed recovery actions.

I particularly like the Recover Details report though:

Recover Details report

(Click the picture to see the entire width.)

As you can see there, we get a per user breakdown of recovery activities, when they were started, how long they took, how much data was recovered, etc.

These reports are a brilliant and much needed addition to NetWorker reporting capabilities, and I’m pleased to see EMC has finally put them into the product.

There’s probably one thing still missing that I can see administrators wanting to see – file lists of recovery sessions. Hopefully 7.(6+x) would see that report option though.

 

You might think, given that I wrote an article awhile ago about the Procedural Obligations of Backup Administrators that it wouldn’t be necessary to explicitly spell out any recovery rules – but this isn’t quite the case. It’s handy to have a “must follow” list of rules for recovery as well.

In their simplest form, these rules are:

  1. How
  2. Why
  3. Where
  4. When
  5. Who

Let’s look at each one in more detail:

  1. How – Know how to do a recovery, before you need to do it. The worst forms of data loss typically occur when a backup routine is put in place that is untried on the assumption that it will work. If a new type of backup is added to an environment, it must be tested before it is relied on. In testing, it must be documented by those doing the recovery. In being documented, it must be referenced by operational procedures*.
  2. Why – Know why you are doing a recovery. This directly affects the required resources. Are you recovering a production system, or a test system? Is it for the purposes of legal discovery, or because a database collapsed?
  3. Where – Know where you are recovering from and to. If you don’t know this, don’t do the recovery. You do not make assumptions about data locality in recovery situations. Trust me, I know from personal experience.
  4. When – Know when the recovery needs to be completed by. This isn’t always answered by the why factor – you actually need to know both in order to fully schedule and prioritise recoveries.
  5. Who – Know who requested the recovery is authorised to do so. (In order to know this, there should be operational recovery procedures – forms and company policies – that indicate authorisation.)

If you know the how, why, where, when and who, you’re following the golden rules of recovery.


* Or to put it another way – documentation is useless if you don’t know it exists, or you can’t find it!

 

There’s a Top 10 Reasons marketing document from EMC now explaining why you should be using NetWorker.

While obviously this is a marketing document, it has one serious flaw above and beyond any issues that people may have over standard marketing documents. It puts recovery at the wrong numbered item.

The first reason, according to the document, is about driving costs out of the environment. Sure, that’s a reason for deployment of an integrated solution, but it’s not the number 1 reason. The second reason is apparently due to closed backup windows. That is a good second reason, but it doesn’t have priority over the real number one reason.

The third reason they cite on the document is:

If your backups don’t recover, neither will your business.

NetWorker is the leader in recovery performance – up to 100 percent faster than the alternatives. It delivers fast, secure, and reliable data and server recovery to ensure you meet your required service levels.

Who in their right mind would put this at #3 on a list of reasons? Backup is recovery – or rather, it’s all about recovery. Whoever put this marketing document together was wrong on one very, key point:

Recovery is and has always been Number One.

(If you think that’s wrong, just go talk to all the Sidekick users…)

 

In environments with satellite offices, a common “backup” technique described in a lot of situations is what I’d generically call a “trickle backup” technique. This uses either some form of asynchronous replication (either block or file), or some other state of file/block deduplication, etc., to achieve very small backups (after the first) back to a central site.

These are inevitably done for one or both of the following two reasons:

  1. Staff at the satellite office do not have the technical skills to manage local media or backup storage nodes/media servers.
  2. The WAN bandwidth is too little (or too costly) to do full-scale backups.

Remembering my definition of recoverable in The 7 Procedural Obligations of Backup Administrators, I’d like to suggest that for most situations, there’s no such thing as a trickle recovery. To reiterate, my definition of recoverable is:

  1. The item that was backed up can be retrieved from the backup media.
  2. The item that is retrieved from the backup media is usable as a replacement to the data that was backed up.
  3. The item can be retrieved within the required window.

Trickle backups, if not considered properly, cease to be valid backups if they violate item 3 above.

Whenever trickle backups are considered, there must be rigorous planning conducted (including discussions with appropriate stakeholders) to determine what recovery methods will be considered valid. Let’s discuss briefly what might need to be considered:

  1. Will individual file level recovery back across the WAN connection be possible?
  2. What will be the maximum amount of data that can be recovered across the WAN connection?
  3. How will larger data recoveries be facilitated?
  4. How will complete system recoveries be facilitated?
  5. Can all recoveries be completed within SLAs?
  6. Are HR and IT policies in place to prevent situations where satellite office recovery requirements may be abandoned or delayed due to staff shortages or local workloads?

If all 6 of those questions don’t have answers that are compatible with SLAs and business requirements, then there is no valid satellite backup system in place.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha