Start with “Why?”

Introduction

I admit, I’m riffing on Simon Sinek’s Start with Why? for the title of my blog post here, but I’m going to be talking about the different services we often lump together as ‘backup and recovery’. By lumping these services together, it makes it convenient to discuss, but inconvenient to architect and expensive to implement.

Man facing a fork in the road
Why do we do data protection?

Collectively we like to compartmentalise problems, reducing them to the fewest number of problems possible. I get that – I do it myself as much as possible. But lumping different things under “backup and recovery” represents too much consolidation of our purposes, so I want to step back and revisit a question I’ve asked in the past:

Why do we do data protection?

or, more specifically,

Why do we backup?

Particularly for the “why do we backup?” question, I bet most people will answer something along the lines of so we can recover when necessary. I know I would have in the past. But this isn’t really the right answer. It’s a good one, but it’s not the right one. In order to answer it correctly, we need to split out backup and recovery into the three things that it usually forms part of, which lets us begin to better plan and architect our data protection. Those three things are:

  • Operational Recovery
  • Disaster Recovery
  • Compliance

A backup and recovery system may functionally provide those services (or elements thereof), but they are all different services. And while-ever we treat them as a single service, we get a Frankenstein’s monster as our solution. Now, in this, I’m talking exclusively of primary use cases. We can do all sorts of other things with data we’re backing up, like populating data lakes, doing testing against instant access images, and so on, but they’re all secondary use cases. I want to tackle the primary use cases, because – surprise surprise – they’re what gives the budget. So, let’s step back and ask why. Specifically:

  • Why do we do operational recovery?
  • Why do we do disaster recovery?
  • Why do we do long term retention?

Spoiler alert: none of them are for the same reason.

Why do we operational recovery?

We do operational recovery because something bad has happened. We don’t actually know the precise nature of that what is, but here’s what we do know: when you get to the point of doing an operational recovery, there’s an operational problem which requires data recovery to fix. Remember the components of the FARR model for data protection: fault tolerance, availability, redundancy and recoverability. We’re up to our second R when we’re doing operational recovery.

Why do we do disaster recovery?

To put it succinctly, we do disaster recovery because something really bad has happened. We’re not just talking “Phil in finance has lost his Excel spreadsheet”, or “Alice the CIO needs to recover that email from a month ago”, we’re talking if-this-doesn’t-come-back-the-company-is-paddling-against-the-current-of-an-excrement-filled-river scenario.

Why do we do long term retention?

Compliance.

Why do we care?

Operational recovery is about the data. Disaster recovery is about the service. Long term retention is about neither.

That’s your TL;DR takeaway: operational recovery is about the data, disaster recovery is about the data and the service, and long term retention is about neither. Let’s break it down though.

Operational recovery is for when we’ve lost data and we need to get it back. When I first started in backup and recovery, that was just as likely to be something induced by a system fault (hardware failure, firmware fault, OS or application crash, etc.), and a little bit to do with end users making a mistake. These days, those reasons tend to have reversed, though ‘end user mistake’ can encompass a lot of scenarios including letting in ransomware, viruses, etc.

Operational recovery is not about the service. And there’s a simple reason here: as soon as you get into recovering services, you’re doing disaster recovery.

You see, disaster recovery isn’t about the data. Or rather, the data is one aspect of the bigger picture: the service. You don’t invoke disaster recovery procedures or processes because Phil-in-finance-lost-his-Excel-spreadsheet, you invoke disaster recovery because The-billing-system-is-down. Disaster recovery, of course, comes in different levels of complexity and seriousness. You can perform disaster recovery against a single service, a collection of services, or the entire business.

Here’s the funky thing though: operational recovery is an IT function, disaster recovery is not.

Bob-the-builder-needs-his-invoice-recovered is an IT issue. We-can’t-bill-customers is a business issue. That’s where you get into disaster recovery being a part of business continuity, of course. And yes, when you do disaster recovery, you usually want to get the data back, but what you always want to get back is the service.

Jenny-the-lawyer-needs-all-the-emails-John-sent-six-years-ago is related to neither services or data. At least, not directly related to data. You may be recovering data at that point, but you’re doing it for compliance reasons. There’s a legal obligation to get that data back, or to prove that there was no data. Previously I said long term retention was about “compliance” and left it at that. It’s a little more complex, of course, since there’s different types of compliance. I usually break them down into three categories: legal, fiduciary, and operational. (Whoa! Operational compliance? That’s the compliance you have when you’re not having compliance — more specifically, it’s compliance that’s been handed down by the business that may not have legal or fiduciary backing, but has still nonetheless been designated as required.)

They all have different requirements

Each of those different functions: operational recovery, disaster recovery, and compliance have different why’s, so as you might imagine, they have different requirements, too.

Operational recovery is about speed and granularity. You don’t want to recover a 30TB filesystem because
Phil-in-finance-lost-his-Excel-spreadsheet, and nor do you want to have to tell Phil he can only have yesterday’s version of the spreadsheet. You want to be able to get the right data back as quickly and efficiently as possible.

Disaster recovery is really all about resiliency. Disaster recovery should be resilient against the sorts of things that can take down your environment. There’s always a “risk vs cost” discussion here, but the general rule of thumb is that the most obvious things that can affect your primary systems should not impact your ability to perform disaster recovery. So if your biggest concern is flooding, you wouldn’t build your disaster recovery processes around a basement datacenter near the edge of a river.

(Here’s where some will say but sometimes I just want to DR the data, not the service. I’ve said this myself a few times, and this is true requirement: sometimes you may not care about getting the service back up and running so long as you can be assured the data is safely off-site. Well, that’s actually a function of applying the FARR model to your data protection services themselves: i.e., your data protection services should feature fault tolerance, availability, redundancy and recoverability in themselves. Having your data at an alternate location without the services is not disaster recovery, it’s just a best-practices approach to data protection.)

So, operational recovery is about speed and granularity. Disaster recovery is about resiliency. What about compliance? Compliance is about accuracy and durability. You’re not so concerned with how long it takes to get it back, so long as you can get back what you need and you can be certain that what you saved 2, 5, 10 or 50 years ago will come back exactly as you saved it.

Why is any of this important?

The simple reason why all of this is important is: it guides the architecture. In a typical environment, you have three relatively disparate services warring for architectural control over this thing we call ‘backup and recovery’, and each of them have different requirements. Take long term retention, for instance: if you have a compliance requirement to keep your monthly backups for 7 years, then they will likely consume 90-95% of your data storage requirements within the backup and recovery system.

But your long term retention requirements can’t drive your performance decisions around average recoveries for the simple reason that 90-95% of your recoveries fall into that operational recovery window, usually 1 day – 6 weeks.

But just as much, your backup and recovery system can’t drive your disaster recovery decisions because (for most companies at least), 90-95% of your workforce aren’t IT people, and the services your business offers aren’t IT services, they just use IT services.

All of this is important because for most businesses, backup and recovery services offers some or most of all three of these functions – operational recovery, disaster recovery, and long term recovery. If you architect it wrong, it’s a “one size fits all” approach that tries to shoehorn everything into a single atomic function. If you architect it right, it’s because you’ve asked your why’s, and recognised that the different answers drive different functionality within the overall system. The classic of course is long term retention. Speed isn’t important there, but it is with operational recovery. If you need to ensure your operational recovery is fast but most of your data will be stored without regard to recovery speed, that implies you need to tier your data somehow. Likewise, when you provide disaster recovery, you have to stop thinking about the data and start thinking about the service. In a DR situation getting the Oracle database back, and the blob storage filesystem it was using recovered, is of limited utility if the service they underpin won’t start because the two systems are inconsistent with one another.

Yes You Can Teach Old Dogs New Tricks

All this is easy if you’ve got a greenfields environment where you’re planning out backup and recovery services for the first time, but it’s important to recognise you can still evaluate all of this against an existing backup and recovery service.

“It’s always been done this way” doesn’t have to be an operational mantra. Nor does “if it’s not broken, don’t fix it”: just because something is limping along doing doing all three services OK doesn’t mean you can’t step back and evaluate what might be changed to enhance its capability. For example, maybe all your backups get stored on the same medium today: that’s OK, but it doesn’t mean you can’t change it tomorrow, or next financial year, or start to introduce change. Redirecting a backup and recovery system to something bigger and better can be done in incremental changes rather than thinking that it’s an all-or-nothing endeavour. It’s quite common to find it easier to make 10 x 5% improvements than 1 x 20% improvement – and you’ll get further along by doing it that way, too.

So if the backup and recovery service your business is running today provides operational recovery, disaster recovery and long term recovery services, do yourself a favour and step back to evaluate how the broader service might be better tweaked to save it from being a jack-of-all-trades-master-of-none.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.