Disaster Movies: A Model for Data Protection

So, the other night, my husband and I finally got around to watching San Andreas, the 2015 movie featuring Dwayne Johnson. I’d heard at best average reviews for it, but it had two things going for it: Dwayne Johnson, and it was there on Netflix when I couldn’t think of anything else to watch.

San Andreas is, without a doubt, a typical disaster movie in the same vein as Armageddon (1998), 2012 (2009), Volcano (1997), Deep Impact (1998), Poseidon (2006), Sunshine (2007), and a myriad of other disaster movies, new and old.

Disaster strikes!

Disaster movies follow a classic convention where:

  • Something bad is going to happen
  • Something bad happens
  • People who are reeling from the disaster have something else bad happen to them
  • When they’re recovering from that they have something else bad happen to them
  • (Ad infinitum.)
  • Eventually they escape the disaster. (Usually.)

It struck me when watching San Andreas the other night that disaster movies have a strong parallel to data protection. Why? Because it’s not the initial kick that does you in, it’s the cascading problems.

Cascading problems. If the universe was fair we’d only have to deal with one problem at a time, but the universe runs to its own schedule, and there’s a reason why Murphy’s Law resonates with so many people. (Or maybe if the universe was fair, we’d not have problems at all, but that’s another story for another time.)

While it’s easy to forget, we’ve often recognised the risks of cascading problems. Take RAID-6 for example: Someone (or perhaps many someones), at some point, looked at RAID-5 and said, “You know what would really suck? Having a drive fail when we’re trying to rebuild from a drive failure.” Then, Shazam! RAID-6 was born. (Fast voiceover: Actual real sequence of events may differ from description. Always consult your data protection professional before switching from RAID-6 to RAID-5.)

One of the fundamental aspects of data protection planning — not just disaster recovery planning (although, it’s critical there too), is to seriously think about the impact of cascading failures.

Effectively, by considering the risk of cascading failures, you’re opening your data protection process to contingency planning. There’s a limit to the number of cascading failures you can plan for — and by extension, the amount of contingency planning you can do. I’d suggest for instance that if you’re at the point of trying to plan for recovering your Sharepoint farm after an extinction level event, you’re probably over-thinking it.

But there’s a level you have to think it through, and I’d suggest you’ll probably want to keep 2 numbers in mind as a starting point: 2, and 3.

For non-crtical data — for anything that the business can afford to lose, but would rather not, then you can probably focus on thinking about 2 cascaded failures. For example: a file needs to be recovered. That’s not a failure in itself — we’d think of that as the trigger event.

If we’re talking non-critical data, then you want your data protection to at least handle cascading failures such as:

  • The primary backup target is inaccessible, and
    • Redundancy has been activated for the secondary backup target

So if we’re talking say, NetWorker and cloned backups, this means that you want to be able to recover data even when:

  • Your original copy is offline (you might be doing a DDOS upgrade just at the time the recovery request comes through), and
    • A drive has failed in your DD at the remote site.

If you’re still dealing with tape, it would equally apply:

  • The tape library at your primary datacenter is experiencing a fault, and
    • There’s a failed tape drive in the secondary datacenter.

But what about 3? Well, that’s what you need to think of for your production, or critical data:

  • Your primary site is inaccessible because someone pressed the “computer room shutdown” button instead of the “computer room exit” button.
    • So, you need to run the recovery from a failover backup server, and a failover backup/clone target — i.e., you have two components that have ‘failed’ straight away
    • Your secondary site Data Domain experienced a drive failure overnight (your third cascaded failure) and there’s a Dell EMC technician signing themselves in at security to replace the drive.

As I mention in my books, this is all risk vs cost, in the same way that the x-9’s protection strategy is for business critical systems is. Sure, you might want to achieve 9-9’s availability (99.9999999% available), but beyond 5-9s, the costs to meet the next 9 availability vs the conditions it protects from could increase more than you’re willing to pay.

2 and 3 are good starting points, but they’re not the end of it. Consider banks and other financial institutions — there’s increasing focus on making sure they have a cyber-recovery solution — an airgapped (physical or electronic) tertiary copy of their critical data. For that sort of data, there’s a recognised additional layer of contingency planning, and in some instances cascading failures that’s catered for with cyber-recovery that you just won’t get in a classic 2-datacentre configuration.

So here’s my tip: if you want to get your infrastructure team, your disaster recovery team, or your business continuity team to effectively think about contingency planning, plan for a long meeting, order some popcorn and snacks, and instead of doing a boring “This.Is.Our.Agenda.Today”, play a disaster movie to get everyone thinking about cascading failures.

When the end-credits roll, you’re ready to talk about the contingency planning that’ll help you survive multiple failures at once.


Hey, if you’re reading this before 31 July 2019, there’s a book giveaway competition here — check it out!.

1 thought on “Disaster Movies: A Model for Data Protection”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.