Periodically, I talk about backup being just a part of a broader set of strategies that I refer to as Information Lifecycle Protection (ILP). This is distinct from Information Lifecycle Management (ILM), and has components as follows:

Components of ILP

A common mistake within an organisation, sometimes triggered by not having merged Backup, Storage and Virtualisation administration, is to approach all backup requirements and challenges only from a backup perspective. When approached from just a backup technology perspective, sometimes it doesn’t matter how elegant your solution is – it just may not be optimal.

Optimal solutions sometimes require extending the umbrella. A classic example of this is NAS. Consider for instance an enterprise environment that has a NAS in the production datacentre, replicating to a disaster recovery datacentre:

Replicated NAS

This is a fairly standard strategy, yet NAS often presents significant challenges to backup environments. Even with NDMP in place, coming up with a nightly data protection strategy for fileservers presenting tens of millions of files is not easy. Various NDMP techniques may allow for speeding up the backup process via block level strategies, but file level recovery from these styles of backups tend to either be challenging at best, or not even possible in the worst case scenario.

As is always the case, whether you can even get a backup done is irrelevant if you can’t recover the data in an appropriately usable way.

What’s more, unstructured data doesn’t really lend itself well to more frequent backups than every 24 hours. While database logs can be captured on an almost continual basis, if it takes 8 hours to do an incremental walk of a highly dense filesystem for traditional backup, but the business requires a Recovery Point Objective (RPO) of just 1 hour, your traditional nightly-incremental strategy just doesn’t cut it.

So, we turn to other aspects in ILP.

The first step is to start using snapshots:

NAS and snapshotsOnce configured at the storage layer, NAS snapshots happen pretty much automatically. If the business requires an RPO of 1 hour, then the most obvious protection strategy is to have the NAS take a snapshot every hour. These copy-on-write style snapshots are typically browsable by end-users, and in that situation they have an added advantage – if users can browse a snapshot and find the file they want, they don’t need to ask the backup team to recover the file(s) they need.

However – snapshots on their own represent a poor data protection strategy, since they’re only as safe as the array they’re sitting on, and relying solely on snapshots to protect data on an array, when the snapshots are also on that array, is … well, insane.

So, we have to make use of that replication strategy, and ensure that the snapshots are replicated as well:

Replicated Snapshots

So at this point, we’ve got:

  • Snapshots providing an hourly RPO;
  • Snapshots providing a user-directed nearly recovery process;
  • Replication providing protection for snapshots in case of total array failure.

Now, some storage manufacturers would like to suggest that at this point you’ve got a valid backup solution. Not so fast, though! It’s only a valid backup solution if you’re prepared to burn through money to buy enough storage to provide long-term recoverability from snapshot. It’s around this point that you’ll want a backup product inserted into the protection strategy.

However, we don’t just insert a daily backup and leave it at that; if the NAS snapshots are configured correctly we can extend that the convenience factor for end-users whilst still getting a copy out to off-line storage. In this scenario, we might end up with a solution such as the following:

Snapshots with Daily Backup

In this scenario, hourly snapshots are kept for 24 hours, with the final snapshot of each day kept in turn as the “daily” backup for n days. In many businesses this will extend to more than a week – e.g., 28 or 31 days. In the above example, those “daily” snapshots are each written out to tape. Keep in mind that we’re still replicating the NAS and its snapshots from one site to another, so we hit a new benefit of combining snapshot, replication and backup into a comprehensive ILP strategy – when the traditional backup is run, it can be run from the replicated data, offloading the impact of the backup from the production NAS:

Replica Snapshot Backups

Of course, this isn’t the only way the backup strategy can work. If sufficient protection is available on both the production and replica NAS units, and the filesystems are large enough, only weekly backups might get output to tape:

Snapshots with Weekly Backups

With that strategy, no incremental backups of the NAS are ever written to tape – just weekly fulls.

Nothing in the above data protection strategy is particularly complex – but equally, none of it is really all that possible when considering backups in isolation. As soon as backups are considered along side with the other activities in ILP (RAID, Replication and Snapshots), advanced and flexible strategies such as the above become available.

So before you design you approach your next data protection challenge, ask yourself the following question:

Does this need a backup strategy, or does it need an Information Lifecycle Protection strategy?

 

RIP Old Backup Software

Much of what I deal with relates to active backup systems, but sometimes a backup system will reach an end-point in its lifecycle. To be fair, this isn’t something that should necessarily happy regularly. If chosen correctly, a backup system (particularly an enterprise one) should evolve with the needs of business. Indeed, it could be argued that in order to even be classified as an enterprise backup product, software must feature both growth and scaleability so it can remain useful and relevant in a deployment.

That being said, there are still times when a company will decide to decommission a backup system. Reasons I’ve seen in the past include:

  1. Business is purchased by another company that has a backup software standard;
  2. Critical feature set<->requirements gap develops, necessitating re-evaluation;
  3. Backup product is discontinued (or subsumed by another product);
  4. OS platform shift necessitates a product change;
  5. New manager has a beef against existing product or vendor (sadly, while this shouldn’t come into play, it really does sometimes).

There are going to be other reasons from time to time, of course, but those represent the most common reasons I’ve seen (not in any real particular order, I should note).

These days it’s actually extremely rare to encounter a business that doesn’t have any long-term recovery requirements. (Indeed, typically businesses that believe they don’t have any long-term recovery requirements are mistaken.) Out of all my current customers, there’s only one that I can immediately think of that has short-term retention policies only and proof that’s all they need.

It’s the transitioning between backup products that sees us lose the insurance policy analogy. We can compare a lot of backup and recovery system operations to insurance policies – backing up is taking out the policy, recovery is making a claim, cloning your backups is like ensuring your policy is up to date and your insurer is liquid, and having a support contract is like making sure your insurer has an underwriter.

Switching backup products? You might say that it’s like switching insurance companies, except when you switch insurance companies you don’t have to keep your old policy around “just in case”. It’s a very rare situation to be able to switch without any legacy considerations.

And so, the net result when it comes time to decommission a backup product is that a full decommissioning may in fact take months, or even years, to complete, depending on the retention requirements on the backups.

When a backup environment is due to be decommissioned, you can typically choose one or more of the following actions:

  1. Migrate all, or the critical long-term backups to the new product. This typically is a costly and fairly manual process involving recoveries and new backups, typically requiring third party certification that no data was changed during the process, etc.;
  2. Maintain the old backup environment ‘as-is’, with appropriate support contracts, which may be costly;
  3. Maintain the old backup environment ‘as-is’, without support contracts (i.e., an Icarus support contract process), which will be risky;
  4. Virtualise and the essential components of the backup environment, and reduce to a bare minimum the hardware requirements necessary for a recovery (e.g., replace a large tape library with just one or two standalone drives, etc.);
  5. Decommission the environment, archiving the requisite hardware and systems to facilitate a “cold” startup and recovery (possibly exporting the meta-data necessary for long-term backup tracking before hand to facilitate those recoveries).

To be perfectly honest, none of these options are inherently ideal, and each carry their own risks, costs and compromises. (I believe the most flexible choice, if it’s available to the business, is virtualisation.)

If migration isn’t performed, then there’s another aspect to decommissioning which needs to be considered. Like everything to do with backups, the technology isn’t likely to be the biggest challenge; in this case, the challenge will centre around staff knowledge.

At the best of times, backup product expertise is best acquired by regular use of the product, and moving to a new product will obviously draw attention away from the old product. If a recovery needs to be performed three months after decommissioning, a backup administrator will likely have no issue performing that recovery. But after six months? Twelve months? Three years? People who are rusty with the product will work slower and are more likely to make mistakes.

The simple fact is that there’s no really easy way to decommission a backup system in favour of a new one. That lack of simplicity should, by rights, factor into any decision process relating to the decommissioning itself; namely:

  1. Will we migrate, decommission or retain a reduced, active form of the old system?
  2. What will be the costs associated with each option?
  3. What will be the risks associated with each option?
  4. What are the benefits (both direct and indirect) from the transition?
  5. Do the costs and risks of the transition outweigh the benefits?

The last question is not flippant – any decision to change a backup product must be closely and carefully weighed up. (This is why the “new manager hates vendor X/product Y and insists on change” transition reason is particularly challenging and unpleasant to deal with – there’ll likely be few, if any benefits to that transition.)

Make sure that all of the above questions can be answered clearly and accurately; if they can’t, then in all likelihood the decommissioning will get very messy.

 

New years resolutions for backup

I’d like to suggest that companies be prepared to make (and keep!) 7 new years resolutions when it comes to the field of backup and recovery:

  1. We will test our backups: If you don’t have a testing regime in place, you don’t have a backup system at all.
  2. We will duplicate our backups: Your backup system should not be a single point of failure. If you’re not cloning, replicating or duplicating your backups in some form, your backup system could be the straw that breaks the camel’s back when a major issue occurs.
  3. We will document our backups: As for testing, if your backup environment is undocumented, it’s not a system. All you’ve got is a collection of backups, which, if the right people are around at the right time and in the right frame of mind, you could get a recovery from it. If you want a backup system in place, you not only have to test your backups, you also have to keep them well documented.
  4. We will train our administrators and operators: It never ceases to amaze me the number of companies that deploy enterprise backup software and then insist that administrators and operators just learn how to use it themselves. While the concept of backup is actually pretty simple (“hey, you, back it up or you’ll lose it!”), the practicality of it can be a little more complex, particularly given that as an environment grows in size, so does the scope and the complexity of a backup system. If you don’t have some form of training (whether it’s internal, by an existing employed expert, or external), you’re at the edge of the event horizon, peering over into the abyss.
  5. We will implement a zero error policy: Again, there’s no such thing as a backup system when there’s no zero error policy. No ifs, no buts, no maybes. If you don’t rigorously implement a zero error policy, you’re flipping a coin every time you do a recovery, regardless of what backup product you use. (To learn more about a zero error policy, check out the trial podcast I did where that was the topic.)
  6. We will appoint a Data Protection Advocate: There’s a lot of data “out there” within a company, not necessarily under central IT control. Someone needs to be thinking about it. That someone should be the Data Protection Advocate (DPA). This person should be tasked with being the somewhat annoying person who is present at every change control meeting, raising her or his hand and saying “But wait, how will this affect our ability to protect our data?” That person should also be someone who wanders around the office(s) looking under desks for those pesky departmental servers and “test” boxes that are deployed, the extra hard drives attached to research machines, etc. If you have multiple offices, you should have a DPA per office. (The role of the DPA is outlined in this post, “What don’t you backup?“)
  7. We will assemble an Information Protection Advisory Council (IPAC): Sitting at an equal tier to the change control board, and reporting directly to the CTO/CIO/CFO, the IPAC will liaise with the DPA(s) and the business to make sure that everyone is across the contingencies that are in place for data protection, and be the “go-to” point for the business when it comes to putting new functions in place. They should be the group that sees a request for a new system or service and collectively liaises with the business and IT to ensure that the information generated by that system/service is protected. (If you want to know more about an IPAC and its role in the business, check out “But where does the DPA fit in?“)

And there you have it – the new years resolutions for your company. You may be surprised – while there’ll be a little effort getting these in place, once they’re there, you’re going to find backup, recovery, and the entire information protection process a lot easier to manage, and a lot more reliable.

 

Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

 

I think this is a question that the average company wholly, inadequately, fails to understand. You see, when it’s asked, people start thinking about their servers – “data X is backed up, data Y can be reconstructed, so we don’t backup that…”

At the end of this article though, I hope you’ll want to take a walk.

At this point, the average backup administrator is responsible for just the backups of servers and storage servers for which discrete agents can be connected to. Yet this is woefully inadequate and demonstrates a wholly inappropriate level of planning within a company. That is, the person or people responsible for core data protection don’t get buy-in or oversight on all data protection.

What else is there within an environment? Well, quite a lot, potentially.

You’ve got the obvious things of course – end user desktops and laptops. Is there potential for local data storage on those machines? If there is, is that data protected?

You’ve got the slightly less obvious things – smart phones with critical business contacts, memos, etc., on them. Is that data being routinely being synced? What is it being synced to? Is that synced data accessible if say, the person leaves? Is that synced data backed up?

Moving right along past the “easy” questions, we’ve got the start of the really tricky questions – look at all the appliances within the organisation. No, I’m not talking about microwaves and toaster ovens in the kitchenettes on each floor. I’m talking about those boxes in racks that don’t have either a traditional operating system or an NDMP agent on them.

The network switches.

The fibre-channel switches.

The PABXs.

The encryption routers.

The encryption FC routers.

And so on.

All of these sorts of devices have configuration/state data on them. A month or so ago, I was talking to another third party consultant at a site, and that person whispered to me, with a slightly deer-in-the-headlights facial expression, “Their SAN FC zoning hasn’t even been saved to the switches, because they’re older and they can’t schedule the outage to save the config.”

And I thought, what sort of bizarro world have I entered? Because I’d bet money that if the running state wasn’t committed, it certainly wasn’t backed up either.

So, here’s my challenge to you, as a backup administrator – take ownership and become a Data Protection Advocate. I know, EMC have a product called DPA, but IT is rife with overloaded TLAs, so this is just another one. You need to stop being just the backup administrator, and start being the company’s Data Protection Advocate (DPA).

And how do you do that? You take a walk:

  1. Grab a notepad or an iPad and a suitable writing implement, be that pen or finger.
  2. Go into the server room.
  3. Note every bit of non-server equipment in that room.
  4. Next, start wandering around the offices.
  5. Note the electronic devices people are using. Smartphones? Tablets? PDAs? (Don’t laugh – I actually saw someone still using a Palm V just three weeks ago.)
  6. Ask at least two or three random people in each workgroup where they save their files to.
  7. Now go to your manager’s office.
  8. Tell your manager you want to have the title of DPA, and explain why.

I would suggest to you that very few, if any organisations, have actually formalised and thought through the process of just how much data goes unprotected on a daily basis. As such, it’s time for a new breed of backup administrators. Why? Because it’s damn unlikely that anyone else in the organisation will have anywhere near the level of appreciation for data protection than you – because it’s part of your job.

Do you want to be a Backup Administrator, or do you want to be a Data Protection Advocate?

I previously said that backup administrators should be part of the change control process, but realistically this isn’t the case. In fact, the DPA for the organisation should be part of the change control process. That person should be tasked with speaking out on behalf of the data – how will it be protected? How will it be recovered? If it can’t be protected, how can the risk be ameliorated?

What don’t you backup?

Are you ready to be a DPA?

If you are, read on at “But where does the DPA fit in?

 

Consider the following two questions:

  1. Do you manage your backups, or do your backups manage you?
  2. Does your organisation decide how backups should be done based on SLAs, etc., or do the backups dictate how production operates?

As you can well imagine, the answers to the above questions will very quickly tell you whether you’ve got a healthy, or a sick backup environment.

While it’s obvious how both questions should be answered, I’d wager that at least some readers will be getting that little twinge reading the above knowing that I’ve just described their backup environment as sick. And I don’t mean sick as in Gen-Y “fully sick”, I mean unwell.

If your backup environment manages you (most specifically your time and the amount of hair you’ve got left), or your backup environment dictates how production works, then you’ve got some problems you need to address. Now.

A lazy backup admin is a healthy backup admin

In 1996, I joined a system administration team that had one guiding motto: be lazy. Their attitude towards work was without a doubt the most influential one I’ve ever encountered, and it still guides my work life to this day.

I don’t mean lazy as in “avoid work”.

I mean lazy as in “automate! automate! automate!”

As far as they were concerned, the goal of the system administrator should be to automate all regular activities to the point that they should either be only ever doing one of four activities:

  1. Automating processes.
  2. Checking results of automated processes.
  3. Waiting for something to go wrong/intervention to be required.
  4. Working on a project.

The same approach should be taken in backups. You should not be say, mindlessly doing repetitive tasks that could be automated – you should be automating them and then checking the automation results. You shouldn’t be fixing errors on a daily basis, you should have a zero error policy, and error processing as an exceptional rather than an every day task. Or you should be working on the next phase of expanding or updating the backup environment.

Et tu, defendo?

The backup system shouldn’t be ambushing primary production. It should be there as a guardian, a defender – not the system that stabs from the shadows, or hogs the limelight.

Every backup product, and every backup system, will of course have limitations. But these limitations should not prevent critical activities in production from being undertaken. Instead, limitations should be ameliorated such that what needs to be done in production can still be done, with appropriate workarounds in place. If the limitations are hard ones which require a rethink of how production is done, it should not be at the expense of the business functions or the end users. This may require mitigation with other technologies – for instance, a classic scenario in situations where the backup product can’t run backups as frequently as SLAs require is to mix traditional backups and snapshots.

Some SLAs, in the light of the available budget and technology should be reassessed. However, that’s not to say all of them should in such situations. A sick backup system is where any SLA, no matter how justified, that can’t be immediately met by the backup system “as is”, is abandoned.

You’re not the boss of me

So, are you in charge of your backup system, or is your backup system is in charge of you?

If you can’t answer that question the right way, it’s time to seize control and make sure next time someone asks you, you can.

 

Martin Glassborow, aka @storagebod, and I had a bit of a discussion via Twitter, which came down to the following:

  • Martin feels the default backup policy within an environment should be to backup nothing;
  • I feel the default backup policy within an environment should be to backup everything.

Now the interesting thing is, we both actually meet in the middle, but just start from different points.

Martin has discussed his reasoning behind his default policy here, in “Don’t BackUp“, which I encourage you to read before continuing. There is, indeed, as Martin suggested in a tweet to me last night, a nice absolutism in either approach – don’t backup, or backup everything. Yet, neither is really the case.

My approach – that being to start with “backup everything”, starts with the following assumptions:

  1. Hardware can fail.
  2. Software can fail.
  3. Humans can make errors.
  4. Processes can fail.

By my very nature I think I’m perfectly suited to working in the backup space. I’ve always been into backup. On the Vic-20, when I was learning to program, I’d always save my programs onto two different tapes. On the Commodore 64, I’d always save my programs and documents onto two different disks. When I went to the PC, I’d always have a copy on a hard drive, and a copy on a floppy drive.

Martin’s approach is this:

Making it policy that nothing gets backed-up unless requested takes out all ambiguity. There can be no assumptions about what is being backed-up, it makes it someone’s responsibility as opposed to an assumed default.

There is, undoubtedly, logic in what Martin suggests, but it’s not a logical starting point I can personally reconcile myself with, for the fundamental reason that it (IMHO) assumes that everyone who interacts with the system understands the system and the nature of their interaction.

It in fact runs completely contrary to an axiom in user desktop/laptop backup approaches – if you leave backups up to the users, nothing will get backed up. That holds true for pretty much every business I’ve ever interacted with, from the most, to the least technical.

It’s for that reason, that lack of total systems awareness and data responsibility from all users of any environment, that my approach starts from the other end. Backup everything.

But I don’t really mean it. I abhor wastage. Recently, I’ve learnt that wastage comes in many forms, which is why the decision to move interstate and re-evaluate what I/we own has been cleansing. (See the article “deconstruction of falling stars” over at my personal blog for a bit more on that front.)

As I abhor wastage, I don’t actually believe you should backup everything within your environment. Sure, some vendors might like that notion – infinite tapes, disk, storage, snapshots, you name it. But it’s neither practical nor commercial reality to do this.

No, there is a middle ground. For me, the sweet spot is this what I always come back to:

It is always better to backup a little more than you need, and waste some storage media, than it is to not backup quite enough, and be unable to recover.

So if your tape usage is say, 5-10% higher than it should be, or your VTL/B2D environment is 5-10% bigger than it really needs to be, I’m not concerned. (If it’s a crazy amount, like 100% more, then there’s a problem – a serious problem that has arisen from a lack of capacity planning, etc.)

I’ve seen IT sites where NetWorker agents have been deployed on every server within the environment, and when I’ve done a coverage analysis, I’ve seen servers that have this as the saveset:

/etc/hosts

Just that. Nothing more, nothing less. (You couldn’t get much less anyway.) I’ve equally seen sites where not only was a hot backup done of the production Oracle database via a module, but the database files were backed up as part of the filesystem backup, and then export/dumps were generated and backed up as well. Overkill? Yes. Were some backups unrecoverable? Yes.

Both are very clear examples of wastage, but I’ll tell you the difference.

The latter one – backing up too much, is time and money wastage. Neither are pleasant, both can hurt the bottom line of a company, yet that’s where it stops.

The former – backing up only what is explicitly requested, nothing more, is corporate wastage. There’s a little bit of monetary wastage involved (why spend the money on an agent to backup a single file?) – the real wastage though is that it could waste the company. Unable to recover legally required files because someone forgot to request them to be backed up? Hello, lawsuit loss. Unable to recover financial data that proves your company has correctly paid its taxes because someone forgot to request them to be backed up? Hello, double tax payments. For me it triggers thought of every possible nightmare scenario a company might experience, right through to total dissolution and loss of the company itself.

In my book, I make the differentiation between what I call inclusive and exclusive backup products. I define:

  • An inclusive backup product is one where you have to explicitly specify what gets backed up. By default, nothing is backed up unless you specify it.
  • An exclusive backup product is one where you have to explicitly specify what doesn’t get backed up. By default, everything is selected and you have to winnow that selection down yourself.

The first, I consider to be the hallmark of a workgroup backup product approach. Cost reduction is the primary focus of this approach. The second, I consider to be a fundamental requirement for a product to earn the “enterprise backup product” badge of honour. Without this, there is a distinct lack of trust.

While I can understand Martin’s starting point, and that he moves more to the middle of making sure the right things are backed up, I can’t agree with this logic that this is the best approach.

I’ve seen, heard of, and witnessed too many IT war stories.

 

As a consultant, you get attuned to (or as some would have it, “cynical”) certain key phrases and statements when you’re in meetings. Sometimes these statements are innocent and exactly what the person says, but usually they set the alarm bells ringing.

As a bit of winding down after a hectic 7 days, I thought I’d share the top 15 statements that cause me to start immediately trying to get deep qualification of what I’ve just been told…

What they say...What I worry it means...
"Our backup results get filed automatically and someone reviews them.""We have a server that hasn't successfully backed up for 6 months, but no-one's been checking the notifications."
"All our backups fit on a single tape""We upgrade our hardware every time this isn't the case."
"We're very selective about what we backup.""We have critical production systems we forgot to add to our schedule."
"We don't want to get backup notifications.""Backup? Meh."
"Our DBAs do their own backups.""The DBAs don't believe in enterprise backup software and think dumps are better" ... OR ... "The backup administrators have lost control of the system and its spiralling out of control."
"We don't have SLAs""No one wants ownership of establishing SLAs"
"We don't need SLAs""We trust in luck, and hope we don't ever need SLAs"
"Our users are responsible for backing up their laptops""Every day we're losing critical data that may be legally or fiscally required by the company."
"We don't have to do monthly backups.""Even though we know we SHOULD do monthly backups, until someone puts it in writing, we're not going to."
"We've been asked to shrink our backup budget...""The business has this crazy idea that backup is an IT function and problem."
"Tape is dead""Someone with a vested interest in selling lots of HDD storage has visited lately."
"We do per-incident support.""We have an Icarus support contract."
"It's too busy here to do capacity planning.""We're wasting money as fast as we can get the budget for it."
"We don't need to {clone or otherwise duplicate} our backups.""We're going to suffer a critical data loss situation."
"We only backup production data.""A lot of people's work within the company is unprotected."

 

Pumping data

The age-old consideration in backup is the most simple one: how to pump the required data through in the required time frame in such a way that it can be readily recovered. This challenges us to constantly find the best way to achieve the data throughput required. What worked 10 years ago was not always applicable 5 years ago; what worked 5 years ago is not always applicable now. Consider for instance the adage:

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

(Andrew Tanenbaum, 1996.)

What surprises me, to a degree, is that still, in 2011, we’re having discussions about data throughput where people focus on the wrong thing. I would humbly respect, that you shouldn’t give a flying fracas about how fast  you can back your data up when compared to how fast you can recover it.

That’s right: when talking feeds and speeds, the only one to give a damn about in backup is how quickly you can recover the data once it’s been captured.

This is, in fact, why the terms RPO and RTO were invented. In particular for the topic of “pumping data”, RTO – Recovery Time Objective – is most important. How quickly do you need to get the data back?

In this scenario, Andrew Tanenbaum’s caution about a station wagon full of tapes hurtling down the highway is entirely appropriate. In fact, so much so that when companies start talking about how fast they need to backup (or how fast they can backup) without reference to recovery, I unfortunately go into this loop:

Why? Because it’s like when my grandmother wants to tell me a story about how she bumped into someone she hadn’t seen for 57 years in the supermarket, but gets stuck on an irrelevant detail. “Peaches or pears!” I used to say to her as a kid, perhaps a little disrespectfully – it didn’t matter whether she was out shopping for peaches or pears before the important thing happened! Same here – it doesn’t matter how fast you can pump data into the backup system – it’s how fast you can pump data out of it that is the only number worth focusing on.

We have to, as storage industry insiders, experts, advisors, consultants – whatever we want to call ourselves – keep vendors and customers focused on the real important metric: how fast they can recover. We have a duty of care to stand between the FUD and the hype and steer companies on a safe trajectory. The safe trajectory in this case is talking about recovery speeds rather than backup speeds.

This is, for instance, why I rarely get excited about remote office backup strategies. For instance, a current meme in remote office backup strategy is the use of deduplication – most likely source based. The goal? Reduce the amount of data you have to transfer from the remote office to the head office to a small trickle, and all your problems are solved … until, of course, you need to recover that data.

Don’t get me wrong, I’m not against remote office backups – I’m also not against centralised remote office backups, regardless of whether they’re achieved by deduplication, compression, magic pixies or faerie dust. In this example though there’s a simple fact: to talk about remote office backup without discussing remote office recovery is reprehensible.

Yes, reprehensible. I’ll use that term. It’s not a nice term, I know, but nor is the practice of ignoring the elephant in the room – recovery.

Look folks, do you really want me to prance around a stage doing the monkey dance shouting “Recovery! Recovery! Recovery!”? Is that what it has to take? Because, if it is, I’ll do it. (I might, if you don’t mind, try to avoid the flop sweat though.)

What am I asking for? Maybe it’s this simple thought:

Starting this year, let no company (vendor or otherwise) talk about a product’s backup performance without citing real world recovery scenarios and performance in those scenarios.

There is not a guaranteed 1:1 mapping between backup and recovery performance, and to imply there is, either by obfuscation or omission is disrespectful to the data protection industry.

 

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive :-)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha