New years resolutions for backup

I’d like to suggest that companies be prepared to make (and keep!) 7 new years resolutions when it comes to the field of backup and recovery:

  1. We will test our backups: If you don’t have a testing regime in place, you don’t have a backup system at all.
  2. We will duplicate our backups: Your backup system should not be a single point of failure. If you’re not cloning, replicating or duplicating your backups in some form, your backup system could be the straw that breaks the camel’s back when a major issue occurs.
  3. We will document our backups: As for testing, if your backup environment is undocumented, it’s not a system. All you’ve got is a collection of backups, which, if the right people are around at the right time and in the right frame of mind, you could get a recovery from it. If you want a backup system in place, you not only have to test your backups, you also have to keep them well documented.
  4. We will train our administrators and operators: It never ceases to amaze me the number of companies that deploy enterprise backup software and then insist that administrators and operators just learn how to use it themselves. While the concept of backup is actually pretty simple (“hey, you, back it up or you’ll lose it!”), the practicality of it can be a little more complex, particularly given that as an environment grows in size, so does the scope and the complexity of a backup system. If you don’t have some form of training (whether it’s internal, by an existing employed expert, or external), you’re at the edge of the event horizon, peering over into the abyss.
  5. We will implement a zero error policy: Again, there’s no such thing as a backup system when there’s no zero error policy. No ifs, no buts, no maybes. If you don’t rigorously implement a zero error policy, you’re flipping a coin every time you do a recovery, regardless of what backup product you use. (To learn more about a zero error policy, check out the trial podcast I did where that was the topic.)
  6. We will appoint a Data Protection Advocate: There’s a lot of data “out there” within a company, not necessarily under central IT control. Someone needs to be thinking about it. That someone should be the Data Protection Advocate (DPA). This person should be tasked with being the somewhat annoying person who is present at every change control meeting, raising her or his hand and saying “But wait, how will this affect our ability to protect our data?” That person should also be someone who wanders around the office(s) looking under desks for those pesky departmental servers and “test” boxes that are deployed, the extra hard drives attached to research machines, etc. If you have multiple offices, you should have a DPA per office. (The role of the DPA is outlined in this post, “What don’t you backup?“)
  7. We will assemble an Information Protection Advisory Council (IPAC): Sitting at an equal tier to the change control board, and reporting directly to the CTO/CIO/CFO, the IPAC will liaise with the DPA(s) and the business to make sure that everyone is across the contingencies that are in place for data protection, and be the “go-to” point for the business when it comes to putting new functions in place. They should be the group that sees a request for a new system or service and collectively liaises with the business and IT to ensure that the information generated by that system/service is protected. (If you want to know more about an IPAC and its role in the business, check out “But where does the DPA fit in?“)

And there you have it – the new years resolutions for your company. You may be surprised – while there’ll be a little effort getting these in place, once they’re there, you’re going to find backup, recovery, and the entire information protection process a lot easier to manage, and a lot more reliable.

 

Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

 

I think this is a question that the average company wholly, inadequately, fails to understand. You see, when it’s asked, people start thinking about their servers – “data X is backed up, data Y can be reconstructed, so we don’t backup that…”

At the end of this article though, I hope you’ll want to take a walk.

At this point, the average backup administrator is responsible for just the backups of servers and storage servers for which discrete agents can be connected to. Yet this is woefully inadequate and demonstrates a wholly inappropriate level of planning within a company. That is, the person or people responsible for core data protection don’t get buy-in or oversight on all data protection.

What else is there within an environment? Well, quite a lot, potentially.

You’ve got the obvious things of course – end user desktops and laptops. Is there potential for local data storage on those machines? If there is, is that data protected?

You’ve got the slightly less obvious things – smart phones with critical business contacts, memos, etc., on them. Is that data being routinely being synced? What is it being synced to? Is that synced data accessible if say, the person leaves? Is that synced data backed up?

Moving right along past the “easy” questions, we’ve got the start of the really tricky questions – look at all the appliances within the organisation. No, I’m not talking about microwaves and toaster ovens in the kitchenettes on each floor. I’m talking about those boxes in racks that don’t have either a traditional operating system or an NDMP agent on them.

The network switches.

The fibre-channel switches.

The PABXs.

The encryption routers.

The encryption FC routers.

And so on.

All of these sorts of devices have configuration/state data on them. A month or so ago, I was talking to another third party consultant at a site, and that person whispered to me, with a slightly deer-in-the-headlights facial expression, “Their SAN FC zoning hasn’t even been saved to the switches, because they’re older and they can’t schedule the outage to save the config.”

And I thought, what sort of bizarro world have I entered? Because I’d bet money that if the running state wasn’t committed, it certainly wasn’t backed up either.

So, here’s my challenge to you, as a backup administrator – take ownership and become a Data Protection Advocate. I know, EMC have a product called DPA, but IT is rife with overloaded TLAs, so this is just another one. You need to stop being just the backup administrator, and start being the company’s Data Protection Advocate (DPA).

And how do you do that? You take a walk:

  1. Grab a notepad or an iPad and a suitable writing implement, be that pen or finger.
  2. Go into the server room.
  3. Note every bit of non-server equipment in that room.
  4. Next, start wandering around the offices.
  5. Note the electronic devices people are using. Smartphones? Tablets? PDAs? (Don’t laugh – I actually saw someone still using a Palm V just three weeks ago.)
  6. Ask at least two or three random people in each workgroup where they save their files to.
  7. Now go to your manager’s office.
  8. Tell your manager you want to have the title of DPA, and explain why.

I would suggest to you that very few, if any organisations, have actually formalised and thought through the process of just how much data goes unprotected on a daily basis. As such, it’s time for a new breed of backup administrators. Why? Because it’s damn unlikely that anyone else in the organisation will have anywhere near the level of appreciation for data protection than you – because it’s part of your job.

Do you want to be a Backup Administrator, or do you want to be a Data Protection Advocate?

I previously said that backup administrators should be part of the change control process, but realistically this isn’t the case. In fact, the DPA for the organisation should be part of the change control process. That person should be tasked with speaking out on behalf of the data – how will it be protected? How will it be recovered? If it can’t be protected, how can the risk be ameliorated?

What don’t you backup?

Are you ready to be a DPA?

If you are, read on at “But where does the DPA fit in?

 

Consider the following two questions:

  1. Do you manage your backups, or do your backups manage you?
  2. Does your organisation decide how backups should be done based on SLAs, etc., or do the backups dictate how production operates?

As you can well imagine, the answers to the above questions will very quickly tell you whether you’ve got a healthy, or a sick backup environment.

While it’s obvious how both questions should be answered, I’d wager that at least some readers will be getting that little twinge reading the above knowing that I’ve just described their backup environment as sick. And I don’t mean sick as in Gen-Y “fully sick”, I mean unwell.

If your backup environment manages you (most specifically your time and the amount of hair you’ve got left), or your backup environment dictates how production works, then you’ve got some problems you need to address. Now.

A lazy backup admin is a healthy backup admin

In 1996, I joined a system administration team that had one guiding motto: be lazy. Their attitude towards work was without a doubt the most influential one I’ve ever encountered, and it still guides my work life to this day.

I don’t mean lazy as in “avoid work”.

I mean lazy as in “automate! automate! automate!”

As far as they were concerned, the goal of the system administrator should be to automate all regular activities to the point that they should either be only ever doing one of four activities:

  1. Automating processes.
  2. Checking results of automated processes.
  3. Waiting for something to go wrong/intervention to be required.
  4. Working on a project.

The same approach should be taken in backups. You should not be say, mindlessly doing repetitive tasks that could be automated – you should be automating them and then checking the automation results. You shouldn’t be fixing errors on a daily basis, you should have a zero error policy, and error processing as an exceptional rather than an every day task. Or you should be working on the next phase of expanding or updating the backup environment.

Et tu, defendo?

The backup system shouldn’t be ambushing primary production. It should be there as a guardian, a defender – not the system that stabs from the shadows, or hogs the limelight.

Every backup product, and every backup system, will of course have limitations. But these limitations should not prevent critical activities in production from being undertaken. Instead, limitations should be ameliorated such that what needs to be done in production can still be done, with appropriate workarounds in place. If the limitations are hard ones which require a rethink of how production is done, it should not be at the expense of the business functions or the end users. This may require mitigation with other technologies – for instance, a classic scenario in situations where the backup product can’t run backups as frequently as SLAs require is to mix traditional backups and snapshots.

Some SLAs, in the light of the available budget and technology should be reassessed. However, that’s not to say all of them should in such situations. A sick backup system is where any SLA, no matter how justified, that can’t be immediately met by the backup system “as is”, is abandoned.

You’re not the boss of me

So, are you in charge of your backup system, or is your backup system is in charge of you?

If you can’t answer that question the right way, it’s time to seize control and make sure next time someone asks you, you can.

 

Martin Glassborow, aka @storagebod, and I had a bit of a discussion via Twitter, which came down to the following:

  • Martin feels the default backup policy within an environment should be to backup nothing;
  • I feel the default backup policy within an environment should be to backup everything.

Now the interesting thing is, we both actually meet in the middle, but just start from different points.

Martin has discussed his reasoning behind his default policy here, in “Don’t BackUp“, which I encourage you to read before continuing. There is, indeed, as Martin suggested in a tweet to me last night, a nice absolutism in either approach – don’t backup, or backup everything. Yet, neither is really the case.

My approach – that being to start with “backup everything”, starts with the following assumptions:

  1. Hardware can fail.
  2. Software can fail.
  3. Humans can make errors.
  4. Processes can fail.

By my very nature I think I’m perfectly suited to working in the backup space. I’ve always been into backup. On the Vic-20, when I was learning to program, I’d always save my programs onto two different tapes. On the Commodore 64, I’d always save my programs and documents onto two different disks. When I went to the PC, I’d always have a copy on a hard drive, and a copy on a floppy drive.

Martin’s approach is this:

Making it policy that nothing gets backed-up unless requested takes out all ambiguity. There can be no assumptions about what is being backed-up, it makes it someone’s responsibility as opposed to an assumed default.

There is, undoubtedly, logic in what Martin suggests, but it’s not a logical starting point I can personally reconcile myself with, for the fundamental reason that it (IMHO) assumes that everyone who interacts with the system understands the system and the nature of their interaction.

It in fact runs completely contrary to an axiom in user desktop/laptop backup approaches – if you leave backups up to the users, nothing will get backed up. That holds true for pretty much every business I’ve ever interacted with, from the most, to the least technical.

It’s for that reason, that lack of total systems awareness and data responsibility from all users of any environment, that my approach starts from the other end. Backup everything.

But I don’t really mean it. I abhor wastage. Recently, I’ve learnt that wastage comes in many forms, which is why the decision to move interstate and re-evaluate what I/we own has been cleansing. (See the article “deconstruction of falling stars” over at my personal blog for a bit more on that front.)

As I abhor wastage, I don’t actually believe you should backup everything within your environment. Sure, some vendors might like that notion – infinite tapes, disk, storage, snapshots, you name it. But it’s neither practical nor commercial reality to do this.

No, there is a middle ground. For me, the sweet spot is this what I always come back to:

It is always better to backup a little more than you need, and waste some storage media, than it is to not backup quite enough, and be unable to recover.

So if your tape usage is say, 5-10% higher than it should be, or your VTL/B2D environment is 5-10% bigger than it really needs to be, I’m not concerned. (If it’s a crazy amount, like 100% more, then there’s a problem – a serious problem that has arisen from a lack of capacity planning, etc.)

I’ve seen IT sites where NetWorker agents have been deployed on every server within the environment, and when I’ve done a coverage analysis, I’ve seen servers that have this as the saveset:

/etc/hosts

Just that. Nothing more, nothing less. (You couldn’t get much less anyway.) I’ve equally seen sites where not only was a hot backup done of the production Oracle database via a module, but the database files were backed up as part of the filesystem backup, and then export/dumps were generated and backed up as well. Overkill? Yes. Were some backups unrecoverable? Yes.

Both are very clear examples of wastage, but I’ll tell you the difference.

The latter one – backing up too much, is time and money wastage. Neither are pleasant, both can hurt the bottom line of a company, yet that’s where it stops.

The former – backing up only what is explicitly requested, nothing more, is corporate wastage. There’s a little bit of monetary wastage involved (why spend the money on an agent to backup a single file?) – the real wastage though is that it could waste the company. Unable to recover legally required files because someone forgot to request them to be backed up? Hello, lawsuit loss. Unable to recover financial data that proves your company has correctly paid its taxes because someone forgot to request them to be backed up? Hello, double tax payments. For me it triggers thought of every possible nightmare scenario a company might experience, right through to total dissolution and loss of the company itself.

In my book, I make the differentiation between what I call inclusive and exclusive backup products. I define:

  • An inclusive backup product is one where you have to explicitly specify what gets backed up. By default, nothing is backed up unless you specify it.
  • An exclusive backup product is one where you have to explicitly specify what doesn’t get backed up. By default, everything is selected and you have to winnow that selection down yourself.

The first, I consider to be the hallmark of a workgroup backup product approach. Cost reduction is the primary focus of this approach. The second, I consider to be a fundamental requirement for a product to earn the “enterprise backup product” badge of honour. Without this, there is a distinct lack of trust.

While I can understand Martin’s starting point, and that he moves more to the middle of making sure the right things are backed up, I can’t agree with this logic that this is the best approach.

I’ve seen, heard of, and witnessed too many IT war stories.

 

As a consultant, you get attuned to (or as some would have it, “cynical”) certain key phrases and statements when you’re in meetings. Sometimes these statements are innocent and exactly what the person says, but usually they set the alarm bells ringing.

As a bit of winding down after a hectic 7 days, I thought I’d share the top 15 statements that cause me to start immediately trying to get deep qualification of what I’ve just been told…

What they say...What I worry it means...
"Our backup results get filed automatically and someone reviews them.""We have a server that hasn't successfully backed up for 6 months, but no-one's been checking the notifications."
"All our backups fit on a single tape""We upgrade our hardware every time this isn't the case."
"We're very selective about what we backup.""We have critical production systems we forgot to add to our schedule."
"We don't want to get backup notifications.""Backup? Meh."
"Our DBAs do their own backups.""The DBAs don't believe in enterprise backup software and think dumps are better" ... OR ... "The backup administrators have lost control of the system and its spiralling out of control."
"We don't have SLAs""No one wants ownership of establishing SLAs"
"We don't need SLAs""We trust in luck, and hope we don't ever need SLAs"
"Our users are responsible for backing up their laptops""Every day we're losing critical data that may be legally or fiscally required by the company."
"We don't have to do monthly backups.""Even though we know we SHOULD do monthly backups, until someone puts it in writing, we're not going to."
"We've been asked to shrink our backup budget...""The business has this crazy idea that backup is an IT function and problem."
"Tape is dead""Someone with a vested interest in selling lots of HDD storage has visited lately."
"We do per-incident support.""We have an Icarus support contract."
"It's too busy here to do capacity planning.""We're wasting money as fast as we can get the budget for it."
"We don't need to {clone or otherwise duplicate} our backups.""We're going to suffer a critical data loss situation."
"We only backup production data.""A lot of people's work within the company is unprotected."

 

Pumping data

The age-old consideration in backup is the most simple one: how to pump the required data through in the required time frame in such a way that it can be readily recovered. This challenges us to constantly find the best way to achieve the data throughput required. What worked 10 years ago was not always applicable 5 years ago; what worked 5 years ago is not always applicable now. Consider for instance the adage:

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

(Andrew Tanenbaum, 1996.)

What surprises me, to a degree, is that still, in 2011, we’re having discussions about data throughput where people focus on the wrong thing. I would humbly respect, that you shouldn’t give a flying fracas about how fast  you can back your data up when compared to how fast you can recover it.

That’s right: when talking feeds and speeds, the only one to give a damn about in backup is how quickly you can recover the data once it’s been captured.

This is, in fact, why the terms RPO and RTO were invented. In particular for the topic of “pumping data”, RTO – Recovery Time Objective – is most important. How quickly do you need to get the data back?

In this scenario, Andrew Tanenbaum’s caution about a station wagon full of tapes hurtling down the highway is entirely appropriate. In fact, so much so that when companies start talking about how fast they need to backup (or how fast they can backup) without reference to recovery, I unfortunately go into this loop:

Why? Because it’s like when my grandmother wants to tell me a story about how she bumped into someone she hadn’t seen for 57 years in the supermarket, but gets stuck on an irrelevant detail. “Peaches or pears!” I used to say to her as a kid, perhaps a little disrespectfully – it didn’t matter whether she was out shopping for peaches or pears before the important thing happened! Same here – it doesn’t matter how fast you can pump data into the backup system – it’s how fast you can pump data out of it that is the only number worth focusing on.

We have to, as storage industry insiders, experts, advisors, consultants – whatever we want to call ourselves – keep vendors and customers focused on the real important metric: how fast they can recover. We have a duty of care to stand between the FUD and the hype and steer companies on a safe trajectory. The safe trajectory in this case is talking about recovery speeds rather than backup speeds.

This is, for instance, why I rarely get excited about remote office backup strategies. For instance, a current meme in remote office backup strategy is the use of deduplication – most likely source based. The goal? Reduce the amount of data you have to transfer from the remote office to the head office to a small trickle, and all your problems are solved … until, of course, you need to recover that data.

Don’t get me wrong, I’m not against remote office backups – I’m also not against centralised remote office backups, regardless of whether they’re achieved by deduplication, compression, magic pixies or faerie dust. In this example though there’s a simple fact: to talk about remote office backup without discussing remote office recovery is reprehensible.

Yes, reprehensible. I’ll use that term. It’s not a nice term, I know, but nor is the practice of ignoring the elephant in the room – recovery.

Look folks, do you really want me to prance around a stage doing the monkey dance shouting “Recovery! Recovery! Recovery!”? Is that what it has to take? Because, if it is, I’ll do it. (I might, if you don’t mind, try to avoid the flop sweat though.)

What am I asking for? Maybe it’s this simple thought:

Starting this year, let no company (vendor or otherwise) talk about a product’s backup performance without citing real world recovery scenarios and performance in those scenarios.

There is not a guaranteed 1:1 mapping between backup and recovery performance, and to imply there is, either by obfuscation or omission is disrespectful to the data protection industry.

 

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive :-)

 

This is the third post in the four part series, “Data lifecycle management”. The series started with “A basic lifecycle“, and continued with “The importance of being archived (and deleted)“. (An aside, “Stub vs Process Archive” is nominally part of the series.)

Legend has it that the Greek king Sisyphus was a crafty old bloke who managed to elude death several times through all manner of tricks – including chaining up Death when he came to visit.

As punishment, when Sisyphus finally died, he was sent to Hades, where he was given an eternal punishment of trying to roll a rock up over a hill. Only the rock was too heavy (probably thanks to a little hellish mystical magic), and every time he got to the top of the hill, the rock would fall, forcing him to start again.

Homer in the Odyssey described the fate of Sisyphus thusly:

“And I saw Sisyphus at his endless task raising his prodigious stone with both his hands. With hands and feet he tried to roll it up to the top of the hill, but always, just before he could roll it over on to the other side, its weight would be too much for him, and the pitiless stone would come thundering down again on to the plain.”

Companies that don’t delete unnecessary, stagnant data share the same fate as Sisyphus. When you think about it, the parallels are actually quite strong. They task themselves daily with an impossible task – to keep all data generated by the company. It ignores the obvious truth that data sizes have exploded and will continue to grow. It also ignores the obvious truth that some data doesn’t need to be remembered for all time.

A company that consigns itself to the fate of Sisyphus will typically be a heavy investor in archive technology. So we come to the third post in the data lifecycle management – the challenge of only archiving/never deleting data.

The common answer again to this is that “storage is cheap”, but there’s nothing cheap about paying to store data that you don’t need. There’s a basic, common logic to use here – what do you personally keep, and what do you personally throw away? Do you keep every letter you’ve ever received, every newspaper you’ve ever read, every book you’ve ever bought, every item of clothing you’ve ever worn, etc.?

The answer (for the vast majority of people) is no: there’s a useful lifespan of an item, and once that useful lifespan has elapsed, we have to make a decision on whether to keep it or not. I mentioned my own personal experience when I introduced the data lifecycle thread; preparing to move interstate I have to evaluate everything I own and decide whether I need to keep it or ditch it. Similarly, when I moved from Google Mail to MobileMe mail, I finally stopped to think about all the email I’d been storing over the years. Old Uni emails (I finished Uni in 1995/graduated in 1996), trivial email about times for movies, etc. Deleting all the email I’d needlessly kept because “storage is cheap” saved me almost 10GB of storage.

Saying “storage is cheap” is like closing your eyes and hoping the freight train barrelling towards you is an optical illusion. In the end, it’s just going to hurt.

This is not, by any means, an argument that you must only delete/never archive. (Indeed, the next article in this series will be about the perils of taking that route.) However, archive must be tempered with deletion or else it becomes the stone, and the storage administrators become Sisyphus.

Consider a sample enterprise archive arrangement whereby:

  • Servers and NAS uses primary storage.
  • Archive from NAS to single-instance WORM storage
  • Replicate single-instance WORM storage

Like it or not, there is a real, tangible cost to the storage of data at each of those steps. There is, undoubtedly, some data that must be stored on primary storage, an there’s undoubtedly some data that is legitimately required and can be moved to archive storage.

Yet equally keeping data in such an environment that is totally irrelevant, that has no ongoing purpose or legal/fiscal reason to keep will just cost money. If you extend that to the point of always keeping data, your company will need awfully deep pockets. Sure, some vendors will love you for wanting to keep everything forever, but in Shakespeare’s immortal words, “the truth will out”.

Mark Twomey (aka Storagezilla), an EMC employee wrote on his blog when discussing backup, archive and deletion:

“If you don’t need to hold onto data delete it. You don’t hold onto all the mail and fliers that come through your letterbox so why would you hold on to all files that land on your storage? Deletion is as valid a data management policy as retention.”

For proper data lifecycle management, we have to be able to obey the simplest of rules: sometimes, things should be forgotten.

 

This is an adjunct post to the current series, “Data lifecycle management“, and is intended to provide a little more information about types of archiving that can be done.

When we literally talk about archiving (rather than tiering), there are two distinctly different processes in archival operations:

  • Stub based archive – transparent to the end user
  • Process archive – requires access changes by the end user

Stub based archive is an interesting beast. The entire notion is to effectively present a unified, unmodified view of the filesystem(s) to the end user such that data access continues as always, regardless of whether the file currently exists on primary storage, or has been archived. Conceptually, it resembles the following:

Stub based archives

With a stub-based archive system, there is no apparent difference to the end user in accessing a file regardless of whether it still exists on primary storage or whether it’s been archived. When a file is archived, a stub, with the same name and extension, is left behind. The archive system sits between end-user processes and filesystem processes, and detects accesses to stubs. When a user accesses a stub, the archive process intercepts that read and returns the real file. At most, a user will notice a delay in the file access, depending on the speed of the archive storage. If the user subsequently writes to the file, the stub is replaced with the new version of the file, restarting the file usage process. Backup systems, when properly integrated with stub based archive, will backup the stub, rather than retrieve the entire file from archive.

Archive systems such as those described above allow for highly configurable archive policies – simple rules such as “files not accessed in 180 days will be archived”, as well as more complex rules, e.g., “Excel files not accessed in 365 days from finance users AND 180 days by management users will be archived”.

Stub based archiving is paradoxically best suited to large environments. Paradoxically because it has the potential to introduce a new headache for backup administrators: massively dense filesystems. For more information on dense filesystems, read “In-lab review of the impact of dense filesystems“. The stub issue is something I’ve touched on previously in “HSM implications for backup“.

The other archive method is what I’d refer to as “process based archive”. This is used in a lot of smaller businesses, and centres around very simple archive policies where entire collections of data are stored in a formal hierarchy, and periodically archived – for instance:

Process archive

In this scenario, filesystems are configured and data access rules are established such that users know data will either be in location A, or location B, based on the a simple rule – e.g., the date of the file. In this sense, data written to primary storage is written in a structure that allows whole-scale relocation of large portions of it as required. Using the example above, user data structures might be configured to be broken down by year. So rather than a single “human resources” directory on the fileserver, for instance, there would be one under a parent directory of 2010, one under a parent directory of 2009, etc. As data access becomes less common, the older year parent directories (with all their hierarchies) are either taken offline entirely or moved to slower storage – but regardless, receive “final” multiple archive style backups before being taken out of the backup regime entirely.

Irrespective of which archive process is used, the net result should be the same for backup operations – removing stagnant data from the daily backup cycle.

One thing you might want to ponder: is data storage tiering capable of fulfilling archive requirements? I would suggest at the moment that the jury is still out on this one. The primary purpose of data storage tiering is to move less frequently accessed data to slower and cheaper storage. That’s akin to archival operations, but unless it’s very closely integrated with the backup software and processes involved, it may not necessarily remove that lower-tiered data from the actual primary backup cycle. Unless the tiering integrates to that point, my personal opinion is that it is not really archive.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha