Of cascading failures and the need to test

Over at Daily WTF, there’s a new story that has two facets of relevance to any backup administrator. Titled “Bourne Into Oblivion“, the key points for a backup administrator are:

  • Cascading failures.
  • Test, test, test.

In my book, I discuss the both the implications of cascading failures, and the need to test within a backup environment. Indeed, my ongoing attitude is that if you want to assume something about an untested backup, assume it’s failed. (Similarly, if you want to make an assumption about an unchecked backup, assume it failed too.)

While normally in backup, cascading failures come down to situations such as “the original failed, and the clone failed too”, this article points out a more common form of data loss through cascading failures –  the original failure coupled with backup failure.

In the article, a shell script includes the line:

rm -rf $var1/$var2

Any long-term Unix user will shudder to think of what can happen with the above script. (I’d hazard a guess that a lot of Unix users have themselves written scripts such as the above, and suffered the consequences. What we can hope for in most situations is that we do it on well backed up personal systems rather than corporate systems with inadequate data protection!)

Something I’ve seen in several sites however is the unfortunate coupling of the above shell script with the execution of said script on a host that has read/write network mounted a host of other filesystems across the corporate network. (Indeed, the first system administration group I ever worked with told me a horror story about a script with a similar command run from a system with automounts enabled under /a.)

The net result in the story at Daily WTF? Most of a corporate network wiped out by a script run with the above command where a new user hadn’t populated either $var1 or $var2, making the script instead:

rm -rf /

You could almost argue that there’s already been a cascading failure in the above – allowing scripts to be written that have the potential for that much data loss and allowing said scripts to be run on systems that mount many other systems.

The true cascading failure however was that the backup media was unusable, having been repeatedly overwritten rather than replaced. Whether this meant that the backups ran after the above incident, or that the backups couldn’t recover all required data (e.g., running an incremental on top of a tape with a previous incremental on top of a tape with a previous full, each time overwriting the previous data), or that the tapes were literally unusable due to high overuse (or indeed, all 3), the results were the same – data loss coupled with recovery loss.

With backups not being tested periodically, such errors (in some form) can creep into any environment. Obviously in the case in this article, there’s also the problem that either (a) procedures were not established regarding rotation of media or (b) procedures were not followed.

The short of it: any business that collectively thinks that either formalisation of backup processes or the rigorous checking of backups is unnecessary is just asking for data loss.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.