I still remember a conversation I had with a customer circa 2001 when we were planning their updated backup environment.
Customer: “For those hosts we only need to backup one file?”
Me: “One file?”
Customer: “Just /etc/hosts”
Me: “… /etc/hosts?”
Customer: “Yeah, we’ll rebuild the host if it fails and just recover /etc/hosts”
Memories.
Back then, NetWorker only had traditional client licensing (unless you went for a magical mystical enterprise license), so it seemed crazy to waste dozens and dozens of client connection licenses just to backup a single file on each system.
We don’t always get to protect everything we want to. Budgets aren’t always what we want. So a perennial conversation I have with customers when planning data protection environments is the reduction of the overall backup volume. Now don’t get me wrong, I’m all for making sure you don’t waste money in protecting things you don’t need to – but how do you determine ‘what we don’t need to protect?’
We start with the inverse – mapping out what we definitely have to protect. That’s always going to include things like:
- Production workloads
- Systems of record
- Unique instances of data
Beyond that though there’s a lot of grey areas for many people. For workloads and components that sit outside of the above list, how then do you determine what you ‘don’t need’ to protect? For that, I look at the clock.
You’ve probably heard of the 5 second rule, I’m guessing: if food falls on the floor it’s safe to eat it if you grab it within 5 seconds. Or so the saying goes. (When you live in a house with two Burmese cats who are perpetually dropping fur everywhere, you give up on the 5 second rule.)
With data protection, we have the 15 minute rule, and it works like this:
If you can’t initiate automated recreation of the workload within 15 minutes of its failure, you should be protecting it.
Note the keyword automated. If you can’t automate the recreation of the workload you’ve got your answer: you should be protecting it. If someone has to manually build it and manually remember to apply the appropriate settings, you should be protecting it.
So you’ve got a virtual machine farm that’s hosting 500 virtual machines. You want to keep your data protection services ‘focused’ on production systems so you only target the ~200 production virtual machines for data protection – let’s look just at backup here. 200 production virtual machines get backup, but 300 virtual machines used for dev/test don’t get backups at all.
You’ve got virtual machine templates for your dev/test environment. So in theory, if one of those machines fails, you could delete it and start deploying it from the template again in a relatively short time: probably less than 15 minutes (let’s ignore any ticket handling processes here). Maybe there’s a test Sharepoint farm of 5 hosts. Can you initiate the building of 5 hosts in 15 minutes? Maybe a developer has a collection of 10 virtual machines and someone accidentally deletes those. Can you start the recreation in 15 minutes for all of them?
15 minutes is the tipping point.
So what happens if the entire test/dev storage system has a failure?
In data protection systems we have to think about cascading failures, and we also have to think about the classic risk vs costs decisions. You have to think about the different sorts of failure scenarios that can and can’t be tolerated, and take the appropriate steps. That’s why production storage arrays tend to be replicated: not because we question their reliability, but because there’s risk, and the cost to the business of a failure that comes from not replicating is higher than the cost to the business of replicating.
The same goes for working out what you don’t need to protect. The 15 minute rule doesn’t just apply to singular systems, but failure zones within your business. Can you initiate recreation of all 300 virtual machines within 15 minutes? Maybe some of those don’t need to be recreated immediately, but there’s a fair chance if you’ve taken the time to spin up a test/dev environment in the first place, those systems matter to someone and collectively, the loss of those systems represent a loss of productivity to groups within the business – and by extension, a loss to the business, too. If you do have that sort of storage level failure, you’re going to have had to rebuilt the storage before you can start recreating, so users and the business will already be champing at the bit for the fast recreation of their services.
15 minutes. You’ve got 15 minutes.
Because here’s the thing: you can kick off an awful lot of recoveries in 15 minutes.
If there’s one thing that’s consistent in business it’s that no-one remembers with gratitude that you saved 1TB of backup or data protection storage by excluding systems from recovery services to meet the budget. Yes, business can be fickle: you might have got a tick for meeting the budget requirements of the solution, but in a failure situation, you can bet someone will turn and say “but why didn’t you say this would be a consequence later?”
If you can meet the 15 minute rule and you can document that in the decision process, you’ll have outlined exactly what can and can’t be done. After all, disaster recovery (regardless of whether it’s an atomic workload, a collection of workloads, or all workloads) isn’t an IT function. IT helps, but it’s business driven. Your 15 minute rule isn’t just for your IT budget, it’s for the business as a whole.
One might suggest that the easy part of planning for a data protection environment is working out the various policies, data retention timings, protection frequencies, and SLAs.
The hard part – the really hard part – is getting agreement from the business that the 15 minute rule is OK. If you think a conversation with the business about needing to delete data is hard, wait until you start having a frank conversation with them about what you can or can’t recreate within 15 minutes when it’s not being protected.
So when you’re next working out what you should or shouldn’t protect, plan out how you’d recreate the workload and start the stopwatch running. You’ve got 15 minutes.
Hey, if you’re reading this before the end of July 2019, I’m giving away a copy of my latest book (in fact, two copies). To go in the running, click here.