Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

 

In “Distribute.IT reveals shared server data loss – News – iTnews Mobile Edition” (June 21, 2011), we’re told:

Distribute.IT has revealed that production data and backups for four of its shared servers were erased in a debilitating hack on its systems over a week ago.

“In assessing the situation, our greatest fears have been confirmed that not only was the production data erased during the attack, but also key backups, snapshots and other information that would allow us to reconstruct these Servers from the remaining data,” the company reported.

You may think that I’m saying the hack is wrong – and anyone conducting such a malicious attack is certainly being particularly unpleasant. But the simple truth is that such an attack should not be capable of rendering a company unable to recover its data.

It suggests multiple design failures on behalf of Distribute.IT:

  • Backups were not physically isolated; regardless of whether you can erase the current backup, or all the backups on nearline storage, there should be backup copies that are sent off-site and removed from such attack;
  • Alternatively, if there were offsite backups – if they were physically isolated, they were not sufficiently secured;
  • Retention policies seem inappropriately small; why could they not recover from say, a week ago, or two weeks ago? The loss of some data even under a sustained hack should be somewhat reversible if longer-term backups can be recovered from. Instead, we’re told: “we have been advised by the recovery teams that the chances for recovery beyond the data and files so far retrieved are slim”.

It’s also worth noting that this goes to demonstrate a worst case scenario about snapshots – they’re typically reliant on some preservation of original data (either running disks, or ensuring that the amount of data deleted/corrupted doesn’t exceed snapshot capacity).

I’m not crowing about data loss – I completely sympathise with Distribute.IT on this incident. However, it is undoubtedly the case that with an appropriately designed backup system, this level of data destruction should not have happened to them.

 

Martin Glassborow, aka @storagebod, and I had a bit of a discussion via Twitter, which came down to the following:

  • Martin feels the default backup policy within an environment should be to backup nothing;
  • I feel the default backup policy within an environment should be to backup everything.

Now the interesting thing is, we both actually meet in the middle, but just start from different points.

Martin has discussed his reasoning behind his default policy here, in “Don’t BackUp“, which I encourage you to read before continuing. There is, indeed, as Martin suggested in a tweet to me last night, a nice absolutism in either approach – don’t backup, or backup everything. Yet, neither is really the case.

My approach – that being to start with “backup everything”, starts with the following assumptions:

  1. Hardware can fail.
  2. Software can fail.
  3. Humans can make errors.
  4. Processes can fail.

By my very nature I think I’m perfectly suited to working in the backup space. I’ve always been into backup. On the Vic-20, when I was learning to program, I’d always save my programs onto two different tapes. On the Commodore 64, I’d always save my programs and documents onto two different disks. When I went to the PC, I’d always have a copy on a hard drive, and a copy on a floppy drive.

Martin’s approach is this:

Making it policy that nothing gets backed-up unless requested takes out all ambiguity. There can be no assumptions about what is being backed-up, it makes it someone’s responsibility as opposed to an assumed default.

There is, undoubtedly, logic in what Martin suggests, but it’s not a logical starting point I can personally reconcile myself with, for the fundamental reason that it (IMHO) assumes that everyone who interacts with the system understands the system and the nature of their interaction.

It in fact runs completely contrary to an axiom in user desktop/laptop backup approaches – if you leave backups up to the users, nothing will get backed up. That holds true for pretty much every business I’ve ever interacted with, from the most, to the least technical.

It’s for that reason, that lack of total systems awareness and data responsibility from all users of any environment, that my approach starts from the other end. Backup everything.

But I don’t really mean it. I abhor wastage. Recently, I’ve learnt that wastage comes in many forms, which is why the decision to move interstate and re-evaluate what I/we own has been cleansing. (See the article “deconstruction of falling stars” over at my personal blog for a bit more on that front.)

As I abhor wastage, I don’t actually believe you should backup everything within your environment. Sure, some vendors might like that notion – infinite tapes, disk, storage, snapshots, you name it. But it’s neither practical nor commercial reality to do this.

No, there is a middle ground. For me, the sweet spot is this what I always come back to:

It is always better to backup a little more than you need, and waste some storage media, than it is to not backup quite enough, and be unable to recover.

So if your tape usage is say, 5-10% higher than it should be, or your VTL/B2D environment is 5-10% bigger than it really needs to be, I’m not concerned. (If it’s a crazy amount, like 100% more, then there’s a problem – a serious problem that has arisen from a lack of capacity planning, etc.)

I’ve seen IT sites where NetWorker agents have been deployed on every server within the environment, and when I’ve done a coverage analysis, I’ve seen servers that have this as the saveset:

/etc/hosts

Just that. Nothing more, nothing less. (You couldn’t get much less anyway.) I’ve equally seen sites where not only was a hot backup done of the production Oracle database via a module, but the database files were backed up as part of the filesystem backup, and then export/dumps were generated and backed up as well. Overkill? Yes. Were some backups unrecoverable? Yes.

Both are very clear examples of wastage, but I’ll tell you the difference.

The latter one – backing up too much, is time and money wastage. Neither are pleasant, both can hurt the bottom line of a company, yet that’s where it stops.

The former – backing up only what is explicitly requested, nothing more, is corporate wastage. There’s a little bit of monetary wastage involved (why spend the money on an agent to backup a single file?) – the real wastage though is that it could waste the company. Unable to recover legally required files because someone forgot to request them to be backed up? Hello, lawsuit loss. Unable to recover financial data that proves your company has correctly paid its taxes because someone forgot to request them to be backed up? Hello, double tax payments. For me it triggers thought of every possible nightmare scenario a company might experience, right through to total dissolution and loss of the company itself.

In my book, I make the differentiation between what I call inclusive and exclusive backup products. I define:

  • An inclusive backup product is one where you have to explicitly specify what gets backed up. By default, nothing is backed up unless you specify it.
  • An exclusive backup product is one where you have to explicitly specify what doesn’t get backed up. By default, everything is selected and you have to winnow that selection down yourself.

The first, I consider to be the hallmark of a workgroup backup product approach. Cost reduction is the primary focus of this approach. The second, I consider to be a fundamental requirement for a product to earn the “enterprise backup product” badge of honour. Without this, there is a distinct lack of trust.

While I can understand Martin’s starting point, and that he moves more to the middle of making sure the right things are backed up, I can’t agree with this logic that this is the best approach.

I’ve seen, heard of, and witnessed too many IT war stories.

 

In the past I’ve talked about the importance of having zero error policies.

In “What is a zero error policy?“, I said:

Having a zero error policy requires the following three rules:

1. All errors shall be known.

2. All errors shall be resolved.

3. No error shall be allowed to continue to occur indefinitely.

If you’ve not read that article, I suggest you go read it, as well as the follow-up article, “Zero error policy management“.

I’m going to make, and stand by, with fervid determination, the following assertion:

If you do not have a zero-error policy for your backups, you do not have a valid backup system.

No ifs, no buts, no maybes, no exceptions.

Why? Because why. Because across all the sites I’ve seen, regardless of size, regardless of complexity, the only ones that actually work properly are those where every error is captured, identified, and dealt with. Only those sites would I point at and say “They have every chance of meeting their SLAs”.

In my book, I introduce the notion that just deploying software and thinking you have a backup system is like making a sacrifice to a volcano. So, without a zero error policy, what does a network diagram of your IT environment look like?

It looks like this:

Network diagram of backup environment without zero error policies

 

The folks over at 37 Signals published a little piece of what I would have to describe as crazy fiction, about how the combination of cloud and more technically savvy users means that we’re now seeing the end of the IT department.

I thought long and hard about writing a rebuttal here, but quite frankly, their lack of logic made me too mad to publish the article on my main blog, where I try to be a little more polite.

So, if you don’t mind a few strong words and want to read a rebuttal to 37 Signals, check out my response here.

 

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive :-)

 

I’m not a storage geek – storage to me is a means to an end, almost irrelevant to the final goal.

I’m passionate about backup though, because backup is about making people happy.

Backup is about recovery, you see.

Recovery is about making sure people can go home on time rather than re-entering lost data all night.

Recovery is about knowing someone can turn up for a flight they booked six weeks earlier and know the airline still knows they booked the ticket.

Recovery is about knowing someone’s pay deposit isn’t lost after a brief systems hiccup.

Recovery is about a student saving a 50,000 word thesis on a server and knowing it will still be there next morning.

Recovery is about being able to look at digital photos of a loved one ten years after they’re gone.

I have the best job in the world.

If you work in backup and recovery, so do you.

 

I don’t like having to do this, particularly since I’m on holidays and only logged into my work email to send one, rather than read, but I noticed an email come in on a support case that I’ve been keenly dealing with, and wanted to check what the latest update from EMC on it was.

But on this case, I’ve been passed a response from EMC NetWorker engineering which is so boneheaded and stupid that I can’t help but have a short rant about it.

(I’ll qualify one thing here: I’m talking EMC NetWorker engineering – the back-end people, not the support people.)

In short, as of 7.6, there’s a new media database field called ‘validcopies’, which, according to the man page is:

The number of successful copies (instances or clones) of the save set, all with the same save time and save set identifier.

Now, digging a little bit further, we’ve got the release notes for 7.6, which states:

mminfo changed to allow query for valid save set copies in order to prevent data loss

There was no convenient method to query for save sets with valid clone copies on other volumes using mminfo. This made certain tasks more difficult to perform, such as determining if space could be cleared on the EDLs.

(Italicised emphasis mine, bold from the release notes.)

Now, in addition to validcopies initially being entirely FUBAR as a reporting mechanism (I’m happy with the patch I’ve been testing, and I’m hoping it will get into the first service pack for 7.6), I noted in the support case that I didn’t think it was appropriate for NetWorker to return 2 ‘validcopies’ for savesets on ADV_FILE devices. (I.e., one for the read-only volume, one for the read-write volume.) Sure, in the classic use of the ‘copies’ flag, we’re used to this, but ‘validcopies’, being something new, and being about preventing data loss, should have only reported 1 valid copy per entire disk backup unit, not 2.

Instead, EMC NetWorker engineering have adamantly said that it will report 2 valid copies per disk backup unit, 1 per read-only device, one per read-write device.

This is boneheaded. If the validcopies flag is all about preventing data loss, then it must be accurate as to the number of distinct, usable copies.

If engineering is so confident that a backup to ADV_FILE represents two distinct valid copies for the purposes of preventing data loss if a copy is lost, let’s see them delete a whole bunch of uncloned savesets from the read-write ADV_FILE devices on EMC’s production backups and then recover. What? You can’t do that? But you said you had two valid copies, and you only deleted one of them? Boo-hoo to you too.

I’ll end my grumpy rant with the following advice: don’t say or do something stupid that might allow a customer to do something stupid that might result in data loss. Haven’t you read this, after all?

 

Having recently encountered a situation where a NetWorker client on a customer site repeatedly failed its full backup, I wanted to take a few moments to stress the absolute, importance – no, extreme criticality – of always being on top of your full backups.

Specifically:

  • You should always know whether your full backups have succeeded or not for each and every client of your backup system.
  • Unless there are specific management directives to the contrary, you should always re-run full backups in the event of failure as soon as possible.

To put it another way – a set of backups without a full, when it comes to performing a complete filesystem or system recovery, is about as useful as a chocolate teapot. Perhaps even less so.

I’ve described previously the importance of having a zero error policy, and always knowing if failures occur. So this topic could be summarised as being a subset of the zero error policy. However, if I were to be asked what backup I could “afford to lose” in terms of complete system recoverability, I’d pick an incremental any day over a full. (It’s actually a fine line, but it’s still an important differentiation.)

Without a full backup, at best you can pull back bits and pieces of a filesystem. Sure, they might be the most recently modified bits, which in themselves are important, but they’re not the entire filesystem. For most organisations, they barely touch the surface of the filesystem. Incrementals (and for that matter, differentials) are like the proverbial tip of the iceberg – perhaps without the penguins though*. The real monstrosity in a backup environment – the rest of the iceberg – are the fulls.

Let’s consider it this way – in most environments (discounting say, backups of database dump regions) you’ll find that an incremental backup covers somewhere between 5% to 10% of the filesystem. Not only that, the delta change on a day to day basis will also be quite small. That is, in many situations the files that are backed up each day in incremental backup regimes are the same files, modified day after day for working purposes. So while you may have incrementals of even up to 10% per day of your fulls, in turn 90% or more of those files may be the same files each day that are getting backed up in incrementals.

If we look at a 200GB filesystem though, even 10% of that filesystem is just 20GB. So if your full is somehow lost, that’s 180GB that you can’t readily recover. Additionally, the 20% or so that you can recover is going to be a pigs breakfast as far as getting it back in any consistent state.

NetWorker, through its use of saveset dependency chains, will do its utmost to protect you from regular saveset failures. If a full filesystem backup fails, subsequent incrementals will be chained onto the previous dependency set, retaining the previous full backup for a longer period of time.

It’s important we don’t let those dependency chains just keep building and building. They need to be broken and restarted so that we don’t get into messy situations or use up too much media. That’s why you should have a policy to rerun a full backup as soon as possible if it fails, rather than just waiting for the next one. (Further, I’ve far too often seen that sites with a “just wait until the next full backup runs” policy continually miss full backup failures, often for months at a time, because that sort of attitude also seems to be accompanied with informal records keeping.)

The next thing to consider is that we mustn’t just arbitrarily break dependency chains ourselves. By this, I’m referring to manually recycling media without regards to what may depend on that media, just because we need to free up volumes or have policies that media should be recycled after a certain length of time.

More than anything else, I see this as the reason companies find themselves in situations where NetWorker returns an “Unknown” volume being required for recovery. In this situation, NetWorker knows there should be a full backup, but it doesn’t have access to it, and therefore it can’t do anything to get the complete filesystem (or other type of data) recovered. Or, if there’s going to be a significant recovery error

Your full backups are like gold. No, gold isn’t special enough. Platinum, maybe. Or some combination of gold, platinum and saffron. They’re not to be cavalierly deleted, they’re not to be ignored, and they’re not to be left unchecked. (They’re not to be uncloned, either.)

In actual fact, it really doesn’t matter what your backup product is. What always matters is that your full backups are done, they’re done as soon as possible around the scheduled time, they’re successful, they’re known to be successful, and they’re successfully cloned. If any of those factors aren’t in play, you’ve got to get it fixed straight away.


* Unless they’re incrementals from a Linux system, of course.

 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha