Basics: Planning A Recovery Service

 Architecture, Basics, Recovery  Comments Off on Basics: Planning A Recovery Service
Jan 302018
 

Introduction

In Data Protection: Ensuring Data Availability, I talk quite a lot about what you need to understand and plan as part of a data protection environment. I’m often reminded of the old saying from clothing and carpentry – “measure twice, cut once”. The lesson in that statement of course is that rushing into something headlong may make your work more problematic. Taking the time to properly plan what you’re doing though can in a lot of instances (and data protection is one such instance) make the entire process easier. This post isn’t meant to be a replacement to the various planning chapters in my book – but I’m sure it’ll have some useful tips regardless.

We don’t backup just as something to do; in fact, we don’t protect data just as something to do, either. We protect data to either shield our applications and services (and therefore our businesses) from failures, and to ensure we can recover it if necessary. So with that in mind, what are some essential activities in planning a recovery service?

Hard disk and magnifying glass

First: Do you know what the data is?

Data classification isn’t something done during a data protection cycle. Maybe one day it will be when AI and machine learning is sufficiently advanced; in the interim though it requires input from people – IT, the business, and so on. Of course, there’s nothing physically preventing you from planning and implementing a recovery service without performing data classification; I’d go so far as to suggest that an easy majority of businesses do exactly that. That doesn’t mean it’s an ideal approach though.

Data classification is all about understanding the purpose of the data, who cares about it, how it is used, and so on. It’s a collection of seemingly innocuous yet actually highly important questions. It’s something I cover quite a bit in my book, and for the very good reason that I honestly believe a recovery service can be made simpler, cheaper and more efficient if it’s complimented by a data classification process within the organisation.

Second: Does the data need to exist?

That’s right – does it need to exist? This is another essential but oft-overlooked part of achieving a cheaper, simpler and more efficient recovery service: data lifecycle management. Yet, every 1TB you can eliminate from your primary storage systems, for the average business at least, is going to yield anywhere between 10 and 30TB savings in protection storage (RAID, replication, snapshots, backup and recovery, long term recovery, etc.). While for some businesses that number may be smaller, for the majority of mid-sized and higher businesses, that 10-30TB saving is likely to go much, much higher – particularly as the criticality of the data increases.

Without a data lifecycle policy, bad things happen over time:

  • Keeping data becomes habitual rather than based on actual need
  • As ‘owners’ of data disappear (e.g., change roles, leave the company, etc.), reluctance to delete, prune or manage the data tends to increase
  • Apathy or intransigence towards developing a data lifecycle programme increases.

Businesses that avoid data classification and data lifecycle condemn themselves to the torment of Sisyphus – constantly trying to roll a boulder up a hill only to have it fall back down again before they get to the top. This manifests in many ways, of course, but in designing, acquiring and managing a data recovery service it usually hits the hardest.

Third: Does the data need to be protected?

I remain a firm believer that it’s always better to backup too much data than not enough. But that’s a default, catchall position rather than one which should be the blanket rule within the business. Part of data classification and data lifecycle will help you determine whether you need to enact specific (or any) data protection models for a dataset. It may be test database instances that can be recovered at any point from production systems; it might be randomly generated data that has no meaning outside of a very specific use case, or it might be transient data merely flowing from one location to another that does not need to be captured and stored.

Remember the lesson from data lifecycle – every 1TB eliminated from primary storage can eliminate 10-30TB of data from protection storage. The next logical step after that is to be able to accurately answer the question, “do we even need to protect this?”

Fourth: What recovery models are required?

At this point, we’ve not talked about technology. This question gets us a little closer to working out what sort of technology we need, because once we have a fair understanding of the data we need to offer recovery services for, we can start thinking about what types of recovery models will be required.

This will essential involve determining how recoveries are done for the data, such as:

  • Full or image level recoveries?
  • Granular recoveries?
  • Point in time recoveries?

Some data may not need every type of recovery model deployed for it. For some data, granular recoverability is equally important as complete recoverability, for other types of data, it could be that the only way to recover it is image/full – wherein granular recoveries would simply leave data corrupted or useless. Does all data require point in time recovery? Much will, but some may not.

Other recovery models you should consider of course are how much users will be involved in recoveries. Self-service for admins? Self-service for end-users? All operator run? Chances are of course it’ll be a mix depending those previous recovery model questions (e.g., you might allow self-service individual email recovery, but full exchange recovery is not going to be an end-user initiated task.)

Fifth: What SLOs/SLAs are required?

Regardless of whether your business has Service Level Objectives (SLOs) or Service Level Agreements (SLAs), there’ll be the potential you have to meet a variety of them depending on the nature of the failure, the criticality and age of the data, and so on. (For the rest of this section, I’ll use ‘SLA’ as a generic term for both SLA and SLO). In fact, there’ll be up to three different categories of SLAs you have to meet:

  • Online: These types of SLAs are for immediate or near-immediate recoverability from failure; they’re meant to keep the data online rather than having to seek to retrieve it from a copy. This will cover options such as continuous replication (e.g., fully mirrored storage arrays), continuous data protection (CDP), as well as more regular replication and snapshot options.
  • Nearline: This is where backup and recovery, archive, and long term retention (e.g., compliance retention of backups/archives) comes into play. Systems in this area are designed to retrieve the data from a copy (or in the case of archive, a tiered, alternate platform) when required, as opposed to ensuring the original copy remains continuously, or near to continuously available.
  • Disaster: These are your “the chips are down” SLAs, which’ll fall into business continuity and/or isolated recovery. Particularly in the event of business continuity, they may overlap with either online or nearline SLAs – but they can also diverge quite a lot. (For instance, in a business continuity situation, data and systems for ‘tier 3’ and ‘tier 4’ services, which may otherwise require a particular level of online or nearline recoverability during normal operations, might be disregarded entirely until full service levels are restored.

Not all data may require all three of the above, and even if data does, unless you’re in a hyperconverged or converged environment, it’s quite possible if you’re a backup administrator, you only need to consider some of the above, with other aspects being undertaken by storage teams, etc.

Now you can plan the recovery service (and conclusion)

And because you’ve gathered the answers to the above, planning and implementing the recovery service is now the easy bit! Trust me on this – working out what a recovery service should look like for the business is when you’ve gathered the above information is a fraction of the effort compared to when you haven’t. Again: “Measure twice, cut once.”

If you want more in-depth information on above, check out chapters in my book such as “Contextualizing Data Protection”, “Data Life Cycle”, “Business Continuity”, and “Data Discovery” – not to mention the specific chapters on protection methods such as backup and recovery, replication, snapshots, continuous data protection, etc.

Data protection lessons from a tomato

 Architecture, Backup theory  Comments Off on Data protection lessons from a tomato
Jul 252013
 

Data protection lessons from a tomato

Data protection lessons from a tomato? Have I gone mad?

Bear with me.

DIKW ModelIf you’ve done any ITIL training, the above diagram will look familiar to you. Rather unimaginatively, it’s called the DIKW model:

Data > Information > Knowledge > Wisdom

A simple, practical example of what this diagram/model means is the following:

  • Data – Something is red, and round.
  • Information – It’s a tomato.
  • Knowledge – Tomato is a fruit.
  • Wisdom – You don’t put tomato in a fruit salad.

That’s about as complex as DIKW gets. However, being a rather simple concept, it means it can be used in quite a few areas.

When it comes to data protection, its purpose is obvious: the criticality of the data to business wisdom will have a direct impact on the level of protection you need to apply to it.

In this case, I’m expanding the definition of wisdom a little. According to my Apple dashboard dictionary, wisdom is:

the quality of having experience, knowledge, and good judgement; the quality of being wise

Further, we can talk about wisdom in terms of accumulated experience:

the body of knowledge and experience that develops within a specified society or period.

So corporate wisdom is about having the experience and knowledge required to act with good judgement, and represents the sum of the knowledge and experience a corporation has built up over time.

If you think about wisdom in terms of corporate wisdom, then you’ll understand my point. For instance, a key database for a company – or the email system – represents a tangible chunk of corporate wisdom. Core fileservers will also be pretty far up the scale. It’s unlikely, on the other hand (in a business with appropriate storage policies) that the files on a regular end-user’s desktop or laptop will go much beyond information in the DIKW scale.

Of course, there are always exceptions. I’ll get to that in a moment.

What this comes back to pretty quickly is the need for Information Lifecycle Protection. End users and the business overall are typically not interested in data – they’re interested in information. They don’t care, as such, about the backup of /u01/app/oracle/data/CORPTAX/data01.dbf – they care about the corporate tax database. That, of course, means that the IT group and the business need to build service level agreements around business functions, not servers and storage. As ITIL teaches, the agreements about networks, storage, servers, etc., come in the form of operational level agreements between the segments of IT.

Ironically, years before studying ITIL, it’s something I covered in my book in the notion of establishing system dependency maps:

System Maps

(In the diagram, the number in parentheses beside a server or function is it’s reference number; D:X means that it depends on the nominated referenced server/function X.)

What all this boils down to is the criticality of one particular activity when preparing an Information Lifecycle Protection system within an organisation: Data classification. (That of course is where you should catch any of those exceptions I was talking about before.)

In order to properly back something up with the appropriate level of protection and urgency, you need to know what it is.

Or, as Stephen Manley said the other day:

OH at Starbucks – 3 page essay ending with ‘I now have 5 pairs of pants, not 2. That’s 3 more.’ Some data may not need to be protected.

Some data may not need to be protected. Couldn’t have said it better myself. Of course, I do also say that it’s better to backup a little bit too much data than not enough, but that’s not something you should see as carte blanche to just backup everything in your environment at all times, regardless of what it is.

The thing about data classification is that most companies do it without first finding all their data. The first step, possibly the hardest step, is first becoming aware of the data distribution within the enterprise. If you want to skip reading the post linked to in the previous sentence, here’s the key information from it:

  • Data – This is both the core data managed and protected by IT, and all other data throughout the enterprise which is:
    • Known about – The business is aware of it;
    • Managed – This data falls under the purview of a team in terms of storage administration (ILM);
    • Protected – This data falls under the purview of a team in terms of backup and recovery (ILP).
  • Dark Data – To quote [a] previous article, “all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for”.
  • Grey Data – Grey data is previously discovered dark data for which no decision has been made as yet in relation to its management or protection. That is, it’s now known about, but has not been assigned any policy or tier in either ILM or ILP.
  • Utility Data – This is data which is subsequently classified out of grey data state into a state where the data is known to have value, but is not either managed or protected, because it can be recreated. It could be that the decision is made that the cost (in time) of recreating the data is less expensive than the cost (both in literal dollars and in staff-activity time) of managing and protecting it.
  • Noise – This isn’t really data at all, but are all the “bits” (no pun intended) that are left which are neither grey data, data or utility data. In essence, this is irrelevant data, which someone or some group may be keeping for unnecessary reasons, and in actual fact should be considered eligible for either deletion or archival and deletion.

Once you’ve found your data, you can classify it. What’s structured and unstructured? What’s the criticality of the data? (I.e., what level of business wisdom does it relate to?)

But even then, you’re not quite ready to determine what your information lifecycle protection policy will be for the data – well, not until you have a data lifecycle policy, which at its simplest, looks something like this:

Data Lifecycle

 

Of course, there’s a lot of time and a lot of decisions bunched up in that diagram, but the lifecycle of data within an organisation is actually that simple at the conceptual level. Or rather, it should be. If you want to read more about data lifecycle, click here for the intro piece – there’s several accompanying pieces listed at the bottom of the article.

When considered from a backup perspective, the end goal of a data lifecycle policy though is simple:

Backup only that which needs to be backed up.

If data can be deleted, delete it.

If data can be archived, archive it.

The logical implication of course is – if you can’t classify it, if you can’t determine its criticality, then the core backup mantra, “always better to backup a little bit more than not enough” takes precedence, and you should be working out how to back it up. Obviously, as a fall back rule, it works, but it’s best to design your overall environment and data policies to avoid it.

So to summarise:

  1. Following the DIKW model, the closer data is to representing corporate wisdom, the more critical its information lifecycle protection requirements will be.
  2. In order to determine that criticality you first have to find the data within your environment.
  3. Once you’ve found the data in your environment, you have to classify it.
  4. Once you’ve classified it, you can build a data lifecycle policy for it.
  5. And then you can configure the appropriate information lifecycle protection for it.

If you think back to EMC’s work towards mitigating the effects of accidental architectures, you’ll see where I was coming from in talking about the importance of procedural change to arrest further accidental architectures. It’s a classic ER technique – identify, triage and heal.

And we can learn all this from a tomato, sliced and salted with the DIKW model.

Jan 142011
 

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive 🙂

Jan 132011
 

This is the third post in the four part series, “Data lifecycle management”. The series started with “A basic lifecycle“, and continued with “The importance of being archived (and deleted)“. (An aside, “Stub vs Process Archive” is nominally part of the series.)

Legend has it that the Greek king Sisyphus was a crafty old bloke who managed to elude death several times through all manner of tricks – including chaining up Death when he came to visit.

As punishment, when Sisyphus finally died, he was sent to Hades, where he was given an eternal punishment of trying to roll a rock up over a hill. Only the rock was too heavy (probably thanks to a little hellish mystical magic), and every time he got to the top of the hill, the rock would fall, forcing him to start again.

Homer in the Odyssey described the fate of Sisyphus thusly:

“And I saw Sisyphus at his endless task raising his prodigious stone with both his hands. With hands and feet he tried to roll it up to the top of the hill, but always, just before he could roll it over on to the other side, its weight would be too much for him, and the pitiless stone would come thundering down again on to the plain.”

Companies that don’t delete unnecessary, stagnant data share the same fate as Sisyphus. When you think about it, the parallels are actually quite strong. They task themselves daily with an impossible task – to keep all data generated by the company. It ignores the obvious truth that data sizes have exploded and will continue to grow. It also ignores the obvious truth that some data doesn’t need to be remembered for all time.

A company that consigns itself to the fate of Sisyphus will typically be a heavy investor in archive technology. So we come to the third post in the data lifecycle management – the challenge of only archiving/never deleting data.

The common answer again to this is that “storage is cheap”, but there’s nothing cheap about paying to store data that you don’t need. There’s a basic, common logic to use here – what do you personally keep, and what do you personally throw away? Do you keep every letter you’ve ever received, every newspaper you’ve ever read, every book you’ve ever bought, every item of clothing you’ve ever worn, etc.?

The answer (for the vast majority of people) is no: there’s a useful lifespan of an item, and once that useful lifespan has elapsed, we have to make a decision on whether to keep it or not. I mentioned my own personal experience when I introduced the data lifecycle thread; preparing to move interstate I have to evaluate everything I own and decide whether I need to keep it or ditch it. Similarly, when I moved from Google Mail to MobileMe mail, I finally stopped to think about all the email I’d been storing over the years. Old Uni emails (I finished Uni in 1995/graduated in 1996), trivial email about times for movies, etc. Deleting all the email I’d needlessly kept because “storage is cheap” saved me almost 10GB of storage.

Saying “storage is cheap” is like closing your eyes and hoping the freight train barrelling towards you is an optical illusion. In the end, it’s just going to hurt.

This is not, by any means, an argument that you must only delete/never archive. (Indeed, the next article in this series will be about the perils of taking that route.) However, archive must be tempered with deletion or else it becomes the stone, and the storage administrators become Sisyphus.

Consider a sample enterprise archive arrangement whereby:

  • Servers and NAS uses primary storage.
  • Archive from NAS to single-instance WORM storage
  • Replicate single-instance WORM storage

Like it or not, there is a real, tangible cost to the storage of data at each of those steps. There is, undoubtedly, some data that must be stored on primary storage, an there’s undoubtedly some data that is legitimately required and can be moved to archive storage.

Yet equally keeping data in such an environment that is totally irrelevant, that has no ongoing purpose or legal/fiscal reason to keep will just cost money. If you extend that to the point of always keeping data, your company will need awfully deep pockets. Sure, some vendors will love you for wanting to keep everything forever, but in Shakespeare’s immortal words, “the truth will out”.

Mark Twomey (aka Storagezilla), an EMC employee wrote on his blog when discussing backup, archive and deletion:

“If you don’t need to hold onto data delete it. You don’t hold onto all the mail and fliers that come through your letterbox so why would you hold on to all files that land on your storage? Deletion is as valid a data management policy as retention.”

For proper data lifecycle management, we have to be able to obey the simplest of rules: sometimes, things should be forgotten.

Jan 102011
 

This is part 2 in the series, “Data Lifecycle Management“.

Penny-wise data lifecycle management refers to a situation where companies take attitude that spending time and/or money on data lifecycle ageing is costly. It’s the old problem – penny-wise, pound-foolish; losing sight of long-term real cost savings by focusing on avoiding short term expenditure.

Traditional backup techniques centre around periodic full backups with incrementals and/or differentials in-between the fulls. If we evaluate a 6 week retention strategy, it’s easy to see where the majority of the backup space takes. Let’s consider weekly fulls, daily incrementals, with a 3% daily change rate, and around 4TB of actual data.

  • Week 1 Full – 4TB.
  • Week 1 Day 1 Incr – 123 GB
  • Week 1 Day 2 Incr – 123 GB
  • Week 1 Day 3 Incr – 123 GB
  • Week 1 Day 4 Incr – 123 GB
  • Week 1 Day 5 Incr – 123 GB
  • Week 1 Day 6 Incr – 123 GB

Repeat that over 6 weeks, you have:

  • 6 x 4 TB of fulls – 24 TB.
  • 6 x 6 x incrs – 4.3TB.

Now, let’s assume that 30% of the data in the full backups represents stagnant data – data which is no longer being modified. It may be periodically accessed, but it’s certainly not being modified any longer. At just 30%, that’s 1.2TB of a 4TB full, or 7.2TB of the total 24 TB saved in full backups across the 6 week cycle.

Now, since this is a relatively small amount of data, we’ll assume the the backup speed is a sustained maximum throughput of 80MB/s. A 4 TB backup, at 80MB/s will take 14.56 hours to complete. On the other hand, a 2.8 TB backup at 80MB/s will take 10.19 hours to complete.

On any single full backup then, not backing up the stagnant data would save 1.2TB of space and 4.37 hours of time. Over that six week cycle though, it’s a saving of 7.2 TB, and 26.22 hours of backup time. This is not insubstantial.

There are two ways we can deal with the stagnant data:

  • Delete it or
  • Archive it

Contrary to popular opinion, before we look at archiving data, we actually should evaluate what can be deleted. That is – totally irrelevant data should not be archived. As to what data is relevant for archiving and what data is irrelevant will be a site-by-site decision. Some examples you might want to consider would include:

  • Temporary files;
  • Installers for applications whose data is past long-term and archive retention;
  • Installers for operating systems whose required applications (and associated data) are past long-term archive;
  • Personal correspondence that’s “crept into” a system;
  • Unnecessary correspondence (e.g., scanned faxes confirming purchase orders for stationary from 5 years ago).

The notion of deleting stagnant, irrelevant data may seem controversial to some, but only because of the “storage is cheap” notion. When companies paid significant amounts of money for physical document management, with that physical occupied space costing real money (rather than just being a facet in the IT budget), deleting was most certainly a standard business practice.

While data deletion is controversial in many companies, consideration of archive can also cause challenges. The core problem with archive is that when evaluated from the perspective of a bunch of individual fileservers, it doesn’t necessarily seem like a lot of space saving. A few hundred GB here, maybe a TB there, with the savings largely dependent on the size of each fileserver and age of the data on it.

Therefore, when we start talking to businesses about archive, we often start talking about fileserver consolidation – either to a fewer traditional OS fileservers, or NAS units. At this point, a common reason to balk is the perceived cost of such consolidation – so we either have the perception that:

  • Deleting is “fiddly” or “risky”, and
  • Archive is expensive.

Regardless, it effectively comes down to a perceived cost, regardless of whether that’s a literal capital investment or time taken by staff.

Yet we can still talk about this from a cost perspective and show savings for eliminating stagnant data from the backup cycle. To do so we need to talk about human resources – the hidden cost of backing up data.

You see, your backup administrators and backup operators cost your company money. Of course, they draw a salary regardless of what they’re doing, but you ultimately want them to be working on activities of maximum importance. Yes, keeping the backup system running by feeding it media is important, but a backup system is there to provide recoveries, and if your recovery queue has more items in it than the number of staff you have allocated to backup operations, it’s too long.

To calculate the human cost of backing up stagnant data, we have to start categorising the activities that backup administrators do. Let’s assume (based on the above small amounts of data), that it’s a one-stop shop where the backup administrator is also the backup operator. That’s fairly common in a lot of situations anyway. We’ll designate the following categories of tasks:

  • Platinum – Recovery operations.
  • Gold – Configuration and interoperability operations.
  • Silver – Backup operations.
  • Bronze – Media management operations.

About the only thing that’s debatable there is the order in which configuration/interoperability and backup operations should be ordered. My personal preference is the above, for the simple reason that backup operations should be self-managing once configured, but periodic configuration adjustments will be required, as will be ongoing consideration of interoperability requirements with the rest of the environment.

What is not debatable is that recovery operations should always be seen to be the highest priority activity within a backup system, and media management should be considered the lowest priority activity. That’s not to say that media management is unimportant, it’s just that people should be doing more important things than acting as protein based autoloaders.

The task categorisation allows us to rank the efficiency and cost-effectiveness of the work done by a backup administrator. I’d propose the following rankings:

  • Platinum – 100% efficiency, salary-weight of 1.
  • Gold – 90% efficiency, salary-weight of 1.25.
  • Silver – 75% efficiency, salary-weight of 1.5.
  • Bronze – 50% efficiency, salary-weight of 3.

What this allows us to do is calculate the “cost” (in terms of effectiveness, and impact on other potential activities) of the backup administrator spending time on the various tasks within the environment. So, this means:

  • Platinum activities represent maximised efficiency of job function, and should not incur a cost.
  • Gold activities represent reasonably efficient activities that only occur a small cost.
  • Silver activities are still mostly efficient, with a slightly increased cost.
  • Bronze activities are at best a 50/50 split between being inefficient or efficient, and have a much higher cost.

So, if a backup administrator is being paid $30 per hour, and does 1 hour each of the above tasks, we can assign hidden/human resource costs as follows:

  • Platinum – $30 per hour.
  • Gold – 1.1 * 1.25 * $30 – $41.25 per hour.
  • Silver – 1.25 * 1.5 * $30 – $56.25 per hour.
  • Bronze – 1.5 * 3 * $30 – $135 per hour.

Some might argue that the above is not a “literal” cost, and sure, you don’t pay a backup administrator $30 for recoveries and $135 for media management. However, what I’m trying to convey is that not all activities performed by a backup administrator are created equal. Some represent best bang for buck, while others progressively represent less palatable activities for the backup administrator (and for the company to pay the backup administrator to do).

You might consider it thusly – if a backup administrator can’t work on a platinum task because a bronze task is “taking priority”, then that’s the penalty – $105 per hour of the person’s time. Of course though, that’s just the penalty for paying the person to do a less important activity. Additional penalties come into play when we consider that other people may not be able to complete work because they can’t get access to the data they need, etc. (E.g., consider the cost of a situation where 3 people can’t work because they need data to be recovered, but the backup administrator is currently swapping media in the tape library to ensure the weekend’s backups run…)

Once we know the penalty though, we can start to factor in additional costs of having a sub-optimal environment. Assume for instance, a backup administrator spends 1 hour on media management tasks per TB backed up per week. If 1.2TB of data doesn’t need to be backed up each week, that’s 1.2 hours of wasted activity by the backup administrator. With a $105 per hour penalty, that’s $126 per week wasted, or over $6,552 per year.

So far then, we have the following costs of not deleting/archiving:

  • Impact on backup window;
  • Impact on media usage requirements (i.e., what you’re backing up to);
  • Immediate penalty of excessive media management by backup administrator;
  • Potential penalty of backup administrator managing media instead of higher priority tasks.

The ironic thing is that deleting and archiving is something that smaller businesses seem to get better than larger businesses. For smaller, workgroup style businesses, where there’s no dedicated IT staff, the people who do handle the backups don’t have the luxury of tape changers, large capacity disk backup or cloud (ha!) – every GB of backup space has to be careful apportioned, and therefore the notion of data deletion and archive is well entrenched. Yearly projects are closed off, multiple duplicates are written, but then those chunks of data are removed from the backup pool.

When we start evaluating the real cost, in terms of time and money, of continually backing up stagnant data, the reasons against deleting or archiving data seem far less compelling. Ultimately, for safe and healthy IT operations, the entire data lifecycle must be followed.

In the next posts, we’ll consider the risks and challenges created by only archiving, or only deleting.

Jan 042011
 

I’m going to run a few posts about overall data management, and central to the notion of data management is the data lifecycle. While this is a relatively simple concept, it’s one that a lot of businesses actually lose sight of.

Here’s the lifecycle of data, expressed as plainly as possible:

Data Lifecycle

Data, once created, is used for a specific period of time (the length will depend on the purpose of the data, and is not necessary for consideration in this discussion), and once primary usage is done, the future of the data must be considered.

Once the primary use for data is complete, there are two potential options for it – and the order of those options are important:

  • The data is deleted; or
  • The data is archived.

Last year my partner and I decided that it was time to uproot and move cities. Not just a small move, but to go from Gosford to Melbourne. That’s around a 1000km relocation, scheduled for June 2011, and with it comes some big decisions. You see, we’ve had 7 years where we’re currently living, and having been together for 14 years so far, we’ve accumulated a lot of stuff. I inherited strong hoarder tendencies from my father, and Darren has certainly had some strong hoarding tendencies himself in the past. Up until now, storage has been cheap (sound familiar?), but that’s no longer the case – we’ll be renting in Melbourne, and the removalists will charge us by the cubic metre, so all those belongings need to be evaluated. Do we still use them? If not, what do we do with them?

Taking the decision that we’d commence a major purge of material possessions lead me to the next unpleasant realisation: I’m a data-hoarder too. Give me a choice between keeping data and deleting it, or even archiving it, and I’d always keep it. However, having decided at the start of the year to transition from Google Mail to MobileMe, I started to look at all the email I’d kept over the years. Storage is cheap, you know. But that mentality lead to me accumulating over 10GB of email, going back to 1992. For what purpose? Why did I still need emails about University assignments? Why did I still need emails about price inquiries on PC133 RAM for a SunBlade 100? Why did I still need … well, you get the picture.

In short, I’ve realised that I’ve been failing data management #101 at a personal level, keeping everything I ever created or received in primary storage rather than seriously evaluating it based on the following criteria:

  • Am I still accessing this regularly?
  • Do I have a financial or legal reason to keep the data?
  • Do I have a sufficient emotional reason to keep the data?
  • Do I need to archive the data, or can it be deleted?

The third question is not the sort that a business should be evaluating on, but the other reasons are the same for any enterprise, of any size, as they were for me.

The net result, when I looked at those considerations was that I transferred around 1GB of email into MobileMe. I archived less than 500MB of email, and then I deleted the rest. That’s right – I, a professional data hoarder, did the unthinkable and deleted all those emails about university assignments, PC133 RAM price inquiries, discussions with friends about movie times for Lord of the Rings in 2001, etc.

Data hoarding is an insidious problem well entrenched in many enterprises. Since “storage is cheap” has been a defining mentality, online storage and storage management costs have skyrocketed within businesses. As a result, we’ve now got complex technologies to provide footprint minimisation (e.g., data deduplication) and single-instance archive. Neither of these options are cheap.

That’s not to say those options are wrong; but the most obvious fact is that money is spent on a daily basis within a significant number of organisations retaining or archiving data that is no longer required.

There are three key ways that businesses can fail to understand the data lifecycle process. These are:

  • Get stuck in the “Use” cycle for all data. (The “penny-wise” problem.)
  • Archive, but never delete data. (The “hoarder” problem.)
  • Delete, rather than archive data. (The “reckless” problem.)

Any three failure can prove significantly challenging to a business, and in upcoming articles I’ll discuss each one in more detail.

The articles in the series are:

There’s also an aside article, that discusses Stub vs Process Archives.

Basics – NetWorker Data Lifecycle

 Basics, NetWorker  Comments Off on Basics – NetWorker Data Lifecycle
Oct 262009
 

Within NetWorker, data (savesets) can go through several stages in its lifecycle. Here’s a simple overview of those stages:

Basic data lifecycle

Basic data lifecycle

The first stage, obviously, is when data is initially being written – the “in progress” stage.

After the backup completes, data enters two stages – a browsable period and a retention period. These periods may have 100% overlap, or they may be distinctly different. For instance, the “standard” browse/retention policies chosen by NetWorker when you create a new client are:

  • Browse period – 1 month
  • Retention period – 1 year

A common mistake people make with NetWorker is to assume that the retention period starts when the browse period finishes; in actual fact, the retention and browse period start at the same time, but the browse period can finish before the retention period. So using that standard setting as an example, the saveset is browsable for the first 1 month of the 12 months that it is retained – it is not the case that the saveset is browsable for 1 month, then retained for another 12.

Once data is no longer within the retention period, and there are no backups that depend on it still within the retention period, data is considered to be recyclable.

When data is recyclable:

  • If it is on tape:
    • The data will remain available until the media is recycled. This will only happen once all the backups on the media is also recyclable, and either the administrator manually recycles the media or NetWorker re-uses it.
  • If it is on a disk backup unit (ADV_FILE) device:
    • The data will be erased from the disk backup unit the next time a volume clean operation is run, or nsrim is run (either as a overnight standard event by NetWorker, or manually via nsrim -X).

This isn’t the “whole picture” for data lifecycle within NetWorker, but it is a good brief overview to give you an idea of how data is managed within the environment.

%d bloggers like this: