micromanual for nsradmin

Join hundreds of others and download the NetWorker Power User's micromanual for nsradmin. Check this blog article for details.

Enterprise Systems Backup and Recovery

If you have an interest in, or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system.

 

July 2009
M T W T F S S
« Jun   Aug »
 12345
6789101112
13141516171819
20212223242526
2728293031  

Archives

Of cascading failures and the need to test

Over at Daily WTF, there’s a new story that has two facets of relevance to any backup administrator. Titled “Bourne Into Oblivion“, the key points for a backup administrator are:

  • Cascading failures.
  • Test, test, test.

In my book, I discuss the both the implications of cascading failures, and the need to test within a backup environment. Indeed, my ongoing attitude is that if you want to assume something about an untested backup, assume it’s failed. (Similarly, if you want to make an assumption about an unchecked backup, assume it failed too.)

While normally in backup, cascading failures come down to situations such as “the original failed, and the clone failed too”, this article points out a more common form of data loss through cascading failures –  the original failure coupled with backup failure.

In the article, a shell script includes the line:

rm -rf $var1/$var2

Any long-term Unix user will shudder to think of what can happen with the above script. (I’d hazard a guess that a lot of Unix users have themselves written scripts such as the above, and suffered the consequences. What we can hope for in most situations is that we do it on well backed up personal systems rather than corporate systems with inadequate data protection!)

Something I’ve seen in several sites however is the unfortunate coupling of the above shell script with the execution of said script on a host that has read/write network mounted a host of other filesystems across the corporate network. (Indeed, the first system administration group I ever worked with told me a horror story about a script with a similar command run from a system with automounts enabled under /a.)

The net result in the story at Daily WTF? Most of a corporate network wiped out by a script run with the above command where a new user hadn’t populated either $var1 or $var2, making the script instead:

rm -rf /

You could almost argue that there’s already been a cascading failure in the above – allowing scripts to be written that have the potential for that much data loss and allowing said scripts to be run on systems that mount many other systems.

The true cascading failure however was that the backup media was unusable, having been repeatedly overwritten rather than replaced. Whether this meant that the backups ran after the above incident, or that the backups couldn’t recover all required data (e.g., running an incremental on top of a tape with a previous incremental on top of a tape with a previous full, each time overwriting the previous data), or that the tapes were literally unusable due to high overuse (or indeed, all 3), the results were the same – data loss coupled with recovery loss.

With backups not being tested periodically, such errors (in some form) can creep into any environment. Obviously in the case in this article, there’s also the problem that either (a) procedures were not established regarding rotation of media or (b) procedures were not followed.

The short of it: any business that collectively thinks that either formalisation of backup processes or the rigorous checking of backups is unnecessary is just asking for data loss.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>