A borked LaCie 2TB BigDisk Extreme has reminded me of the role of backup and recovery within disaster recovery itself. By disaster recovery, I mean total “system” failure, whether that system is an entire server, an entire datacentre, or in my case, a large drive.
What is the difference between a regular failure and a disaster? I think it’s one of those things that’s entirely the perspective of organisation or person who experiences it.
As for my current disaster, I’ve got a 2TB drive with just 34GB free. I’ve got up-to-date backups for this drive which I can restore from, and in the event of a catastrophe, I could actually regenerate the data, given that it’s all my media files. It’s also operational, so long as I don’t power it off again. (This time it took more than 30 minutes to become operational after a shutdown. It’s been getting worse and worse.)
So I’ve got a backup, I’ve got a way of regenerating the data if I have to, and my storage is still operational. Why is it a disaster? Here’s a few reasons, which I’ll then use to explain what makes for a disaster more generally, and why backup/recovery is only a small part of disaster recovery:
- I don’t have spares. Much as I’d love to have a 10 or 20TB array at home running on RAID-6 or something like that, I don’t have that luxury. For me, if a drive fails, I have to go out and buy a replacement drive. That’s budget – capital expenditure, if you will. What’s more, it’s usually unexpected capital expenditure.
- Not all my storage is high speed. Being a home user, a chunk of my storage is either USB-2 or FireWire 400/800. None of these formats offer blistering data transfer speeds. The 2TB drive is hooked up to Firewire 800, and I backup to Firewire 400, which means I’m bound to a maximum of around 30-35MB/s throughput for either running the backup or recovering from it.
- The failure constrains me. Until I get the drive replaced, I have to be particularly careful about any situation that would see the drive powered off.
So there’s three factors there that constitute a “disaster”:
- Tangible cost.
- Time to repair.
- Interruptive.
A regular failure will often have one or two of the above, but all three are needed to turn it into a disaster. This is why a disaster is highly specific to the location where it happens – it’s not any specific thing, but a combination of the situation, the impact locally and the required response that render a disaster from a failure.
There’s of course varying levels of disasters too, even at an individual level. Having a borked media drive is a disaster, but it’s not a “primary” disaster for me, because the core of what I do on my computer I can still get done. The same applies with corporations – it could be that losing both a primary fileserver and a manually controlled archive fileserver would constitute a “disaster”, but the first is always likely to be a far more serious disaster. That’s because it generates higher spikes in one or more of the factors – cost and interruption.
So, returning to the topic of the post – let’s consider why backup/recovery only forms a fraction of disaster recovery. When we consider a regular failure requiring recovery, it’s clear that the backup/recovery process forms not only the nexus to the activity, but likely the longest or most “costly” component (usually in terms of staff time).
In a disaster recovery situation, that’s no longer guaranteed to be the case. While the actual act of recovery is likely to take some time within a disaster recovery situation, there’s usually going to be a heap of other activities. There’ll be:
- Personnel issues – getting human resources allocated to fixing the problem, and the impact of the failure on a number of people. Typically you don’t find (in a business world) that a disaster is something that only affects a single user within the organisation. It’s going to impact a significant number of workers – hence the tangible cost and the interruptive nature of them.
- Fault resolution time – If you can seamlessly failover from an event, it’s unlikely it will be treated as a disaster. Sure, it may be a major issue, but a disaster is something that is going to take real time to fix. A disaster will see staff needing to work nigh-continuously in order to get the system operational. That will include:
- Time taken to assess the situation,
- Time taken to get replacement systems ready,
- Time taken to recover,
- Time taken to mop up/finalise access,
- Time taken to repair original failure,
- Time taken to revert services and
- Time taken to report.
- Post recovery exercises – in a good organisation, disaster recovery operations don’t just stop when the last byte of data has been recovered. As alluded to in the above bullet point, there needs to be a formal evaluation of the circumstances that lead up to the disaster, the steps required to rectify it, any issues that might have occurred, and plans to avoid it (or mitigate it) in future. For some staff, this exercise may be the longest part of the disaster recovery process.
- Post disaster upgrades – if, as a result of the disaster and the post recovery exercises it’s determined that new systems must be put into place (e.g., adding a new cluster, or changing the way business continuity is handled), then it can be fairly stated that all of the work involved in such upgrades is still attributed to the original disaster recovery situation.
All of these factors (and many more – it will vary, site by site) lead to the inevitable conclusion that it’s insufficient to consider that disaster recovery is just a logical extension of a regular backup and recovery process. It’s far more interruptive. It’s more costly in terms of either direct staff time or a variety of other factors, and it’s far more interruptive – both to individuals within the organisation, and the organisation as a whole.
As such, the response to a disaster recovery situation should not be driven directly by the IT department. IT of course will play a valuable and critical role in the recovery process, but the response must be driven by a team with oversight against all affected areas, and the post-recovery processes must equally be driven by a team whose purdue extends beyond just the IT department.
We can’t possibly prepare for every disaster. To do so would require unlimited budget and unlimited resources. (It would also be reminiscent of the Brittas Empire.)
Instead, what we can plan for is that disasters will, inevitably happen. By acknowledging that there is always a risk of a disaster, organisations can prepare for them by:
- Determining “levels” of disaster – quantifying what tier of disaster a situation will be by say, percentage of affected employees, loss of ability to perform primary business functions, etc.
- Determining role based involvement in disaster response teams for each of those levels of disaster.
- Determining procedures for:
- Communication throughout the disaster recovery process.
- Activating disaster response teams.
- Documenting the disaster.
- Reporting on the disaster.
- Post-disaster meetings.
Good preparation of the above will not mitigate a disaster, but it’ll at least considerably reduce the risk of a disaster becoming a complete catastrophe.
Don’t just assume that disaster recovery is a standard backup and recovery process. It’s not – not by a long shot. Making this assumption puts the business very much at risk.