When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?
Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.
So what metrics do I recommend? They’re pretty simple:
- Recoveries – 100% of recoveries should succeed.
- Backups – 95-98% of backups should succeed.
That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.
Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.
For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.
This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:
- 20 x 4 filesystems = 80 savesets
- 20 index savesets
- 1 bootstrap saveset
- 1 NMC database saveset
98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.
You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.
Those are metrics you not only can depend on, but you should depend on, too.