Basics: Planning A Recovery Service

 Architecture, Basics, Recovery  Comments Off on Basics: Planning A Recovery Service
Jan 302018


In Data Protection: Ensuring Data Availability, I talk quite a lot about what you need to understand and plan as part of a data protection environment. I’m often reminded of the old saying from clothing and carpentry – “measure twice, cut once”. The lesson in that statement of course is that rushing into something headlong may make your work more problematic. Taking the time to properly plan what you’re doing though can in a lot of instances (and data protection is one such instance) make the entire process easier. This post isn’t meant to be a replacement to the various planning chapters in my book – but I’m sure it’ll have some useful tips regardless.

We don’t backup just as something to do; in fact, we don’t protect data just as something to do, either. We protect data to either shield our applications and services (and therefore our businesses) from failures, and to ensure we can recover it if necessary. So with that in mind, what are some essential activities in planning a recovery service?

Hard disk and magnifying glass

First: Do you know what the data is?

Data classification isn’t something done during a data protection cycle. Maybe one day it will be when AI and machine learning is sufficiently advanced; in the interim though it requires input from people – IT, the business, and so on. Of course, there’s nothing physically preventing you from planning and implementing a recovery service without performing data classification; I’d go so far as to suggest that an easy majority of businesses do exactly that. That doesn’t mean it’s an ideal approach though.

Data classification is all about understanding the purpose of the data, who cares about it, how it is used, and so on. It’s a collection of seemingly innocuous yet actually highly important questions. It’s something I cover quite a bit in my book, and for the very good reason that I honestly believe a recovery service can be made simpler, cheaper and more efficient if it’s complimented by a data classification process within the organisation.

Second: Does the data need to exist?

That’s right – does it need to exist? This is another essential but oft-overlooked part of achieving a cheaper, simpler and more efficient recovery service: data lifecycle management. Yet, every 1TB you can eliminate from your primary storage systems, for the average business at least, is going to yield anywhere between 10 and 30TB savings in protection storage (RAID, replication, snapshots, backup and recovery, long term recovery, etc.). While for some businesses that number may be smaller, for the majority of mid-sized and higher businesses, that 10-30TB saving is likely to go much, much higher – particularly as the criticality of the data increases.

Without a data lifecycle policy, bad things happen over time:

  • Keeping data becomes habitual rather than based on actual need
  • As ‘owners’ of data disappear (e.g., change roles, leave the company, etc.), reluctance to delete, prune or manage the data tends to increase
  • Apathy or intransigence towards developing a data lifecycle programme increases.

Businesses that avoid data classification and data lifecycle condemn themselves to the torment of Sisyphus – constantly trying to roll a boulder up a hill only to have it fall back down again before they get to the top. This manifests in many ways, of course, but in designing, acquiring and managing a data recovery service it usually hits the hardest.

Third: Does the data need to be protected?

I remain a firm believer that it’s always better to backup too much data than not enough. But that’s a default, catchall position rather than one which should be the blanket rule within the business. Part of data classification and data lifecycle will help you determine whether you need to enact specific (or any) data protection models for a dataset. It may be test database instances that can be recovered at any point from production systems; it might be randomly generated data that has no meaning outside of a very specific use case, or it might be transient data merely flowing from one location to another that does not need to be captured and stored.

Remember the lesson from data lifecycle – every 1TB eliminated from primary storage can eliminate 10-30TB of data from protection storage. The next logical step after that is to be able to accurately answer the question, “do we even need to protect this?”

Fourth: What recovery models are required?

At this point, we’ve not talked about technology. This question gets us a little closer to working out what sort of technology we need, because once we have a fair understanding of the data we need to offer recovery services for, we can start thinking about what types of recovery models will be required.

This will essential involve determining how recoveries are done for the data, such as:

  • Full or image level recoveries?
  • Granular recoveries?
  • Point in time recoveries?

Some data may not need every type of recovery model deployed for it. For some data, granular recoverability is equally important as complete recoverability, for other types of data, it could be that the only way to recover it is image/full – wherein granular recoveries would simply leave data corrupted or useless. Does all data require point in time recovery? Much will, but some may not.

Other recovery models you should consider of course are how much users will be involved in recoveries. Self-service for admins? Self-service for end-users? All operator run? Chances are of course it’ll be a mix depending those previous recovery model questions (e.g., you might allow self-service individual email recovery, but full exchange recovery is not going to be an end-user initiated task.)

Fifth: What SLOs/SLAs are required?

Regardless of whether your business has Service Level Objectives (SLOs) or Service Level Agreements (SLAs), there’ll be the potential you have to meet a variety of them depending on the nature of the failure, the criticality and age of the data, and so on. (For the rest of this section, I’ll use ‘SLA’ as a generic term for both SLA and SLO). In fact, there’ll be up to three different categories of SLAs you have to meet:

  • Online: These types of SLAs are for immediate or near-immediate recoverability from failure; they’re meant to keep the data online rather than having to seek to retrieve it from a copy. This will cover options such as continuous replication (e.g., fully mirrored storage arrays), continuous data protection (CDP), as well as more regular replication and snapshot options.
  • Nearline: This is where backup and recovery, archive, and long term retention (e.g., compliance retention of backups/archives) comes into play. Systems in this area are designed to retrieve the data from a copy (or in the case of archive, a tiered, alternate platform) when required, as opposed to ensuring the original copy remains continuously, or near to continuously available.
  • Disaster: These are your “the chips are down” SLAs, which’ll fall into business continuity and/or isolated recovery. Particularly in the event of business continuity, they may overlap with either online or nearline SLAs – but they can also diverge quite a lot. (For instance, in a business continuity situation, data and systems for ‘tier 3’ and ‘tier 4’ services, which may otherwise require a particular level of online or nearline recoverability during normal operations, might be disregarded entirely until full service levels are restored.

Not all data may require all three of the above, and even if data does, unless you’re in a hyperconverged or converged environment, it’s quite possible if you’re a backup administrator, you only need to consider some of the above, with other aspects being undertaken by storage teams, etc.

Now you can plan the recovery service (and conclusion)

And because you’ve gathered the answers to the above, planning and implementing the recovery service is now the easy bit! Trust me on this – working out what a recovery service should look like for the business is when you’ve gathered the above information is a fraction of the effort compared to when you haven’t. Again: “Measure twice, cut once.”

If you want more in-depth information on above, check out chapters in my book such as “Contextualizing Data Protection”, “Data Life Cycle”, “Business Continuity”, and “Data Discovery” – not to mention the specific chapters on protection methods such as backup and recovery, replication, snapshots, continuous data protection, etc.

10 Things Still Wrong with Data Protection Attitudes

 Architecture, Backup theory, NetWorker  Comments Off on 10 Things Still Wrong with Data Protection Attitudes
Mar 072012

When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

Jan 142011

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive 🙂

Jan 132011

This is the third post in the four part series, “Data lifecycle management”. The series started with “A basic lifecycle“, and continued with “The importance of being archived (and deleted)“. (An aside, “Stub vs Process Archive” is nominally part of the series.)

Legend has it that the Greek king Sisyphus was a crafty old bloke who managed to elude death several times through all manner of tricks – including chaining up Death when he came to visit.

As punishment, when Sisyphus finally died, he was sent to Hades, where he was given an eternal punishment of trying to roll a rock up over a hill. Only the rock was too heavy (probably thanks to a little hellish mystical magic), and every time he got to the top of the hill, the rock would fall, forcing him to start again.

Homer in the Odyssey described the fate of Sisyphus thusly:

“And I saw Sisyphus at his endless task raising his prodigious stone with both his hands. With hands and feet he tried to roll it up to the top of the hill, but always, just before he could roll it over on to the other side, its weight would be too much for him, and the pitiless stone would come thundering down again on to the plain.”

Companies that don’t delete unnecessary, stagnant data share the same fate as Sisyphus. When you think about it, the parallels are actually quite strong. They task themselves daily with an impossible task – to keep all data generated by the company. It ignores the obvious truth that data sizes have exploded and will continue to grow. It also ignores the obvious truth that some data doesn’t need to be remembered for all time.

A company that consigns itself to the fate of Sisyphus will typically be a heavy investor in archive technology. So we come to the third post in the data lifecycle management – the challenge of only archiving/never deleting data.

The common answer again to this is that “storage is cheap”, but there’s nothing cheap about paying to store data that you don’t need. There’s a basic, common logic to use here – what do you personally keep, and what do you personally throw away? Do you keep every letter you’ve ever received, every newspaper you’ve ever read, every book you’ve ever bought, every item of clothing you’ve ever worn, etc.?

The answer (for the vast majority of people) is no: there’s a useful lifespan of an item, and once that useful lifespan has elapsed, we have to make a decision on whether to keep it or not. I mentioned my own personal experience when I introduced the data lifecycle thread; preparing to move interstate I have to evaluate everything I own and decide whether I need to keep it or ditch it. Similarly, when I moved from Google Mail to MobileMe mail, I finally stopped to think about all the email I’d been storing over the years. Old Uni emails (I finished Uni in 1995/graduated in 1996), trivial email about times for movies, etc. Deleting all the email I’d needlessly kept because “storage is cheap” saved me almost 10GB of storage.

Saying “storage is cheap” is like closing your eyes and hoping the freight train barrelling towards you is an optical illusion. In the end, it’s just going to hurt.

This is not, by any means, an argument that you must only delete/never archive. (Indeed, the next article in this series will be about the perils of taking that route.) However, archive must be tempered with deletion or else it becomes the stone, and the storage administrators become Sisyphus.

Consider a sample enterprise archive arrangement whereby:

  • Servers and NAS uses primary storage.
  • Archive from NAS to single-instance WORM storage
  • Replicate single-instance WORM storage

Like it or not, there is a real, tangible cost to the storage of data at each of those steps. There is, undoubtedly, some data that must be stored on primary storage, an there’s undoubtedly some data that is legitimately required and can be moved to archive storage.

Yet equally keeping data in such an environment that is totally irrelevant, that has no ongoing purpose or legal/fiscal reason to keep will just cost money. If you extend that to the point of always keeping data, your company will need awfully deep pockets. Sure, some vendors will love you for wanting to keep everything forever, but in Shakespeare’s immortal words, “the truth will out”.

Mark Twomey (aka Storagezilla), an EMC employee wrote on his blog when discussing backup, archive and deletion:

“If you don’t need to hold onto data delete it. You don’t hold onto all the mail and fliers that come through your letterbox so why would you hold on to all files that land on your storage? Deletion is as valid a data management policy as retention.”

For proper data lifecycle management, we have to be able to obey the simplest of rules: sometimes, things should be forgotten.

Jan 112011

This is an adjunct post to the current series, “Data lifecycle management“, and is intended to provide a little more information about types of archiving that can be done.

When we literally talk about archiving (rather than tiering), there are two distinctly different processes in archival operations:

  • Stub based archive – transparent to the end user
  • Process archive – requires access changes by the end user

Stub based archive is an interesting beast. The entire notion is to effectively present a unified, unmodified view of the filesystem(s) to the end user such that data access continues as always, regardless of whether the file currently exists on primary storage, or has been archived. Conceptually, it resembles the following:

Stub based archives

With a stub-based archive system, there is no apparent difference to the end user in accessing a file regardless of whether it still exists on primary storage or whether it’s been archived. When a file is archived, a stub, with the same name and extension, is left behind. The archive system sits between end-user processes and filesystem processes, and detects accesses to stubs. When a user accesses a stub, the archive process intercepts that read and returns the real file. At most, a user will notice a delay in the file access, depending on the speed of the archive storage. If the user subsequently writes to the file, the stub is replaced with the new version of the file, restarting the file usage process. Backup systems, when properly integrated with stub based archive, will backup the stub, rather than retrieve the entire file from archive.

Archive systems such as those described above allow for highly configurable archive policies – simple rules such as “files not accessed in 180 days will be archived”, as well as more complex rules, e.g., “Excel files not accessed in 365 days from finance users AND 180 days by management users will be archived”.

Stub based archiving is paradoxically best suited to large environments. Paradoxically because it has the potential to introduce a new headache for backup administrators: massively dense filesystems. For more information on dense filesystems, read “In-lab review of the impact of dense filesystems“. The stub issue is something I’ve touched on previously in “HSM implications for backup“.

The other archive method is what I’d refer to as “process based archive”. This is used in a lot of smaller businesses, and centres around very simple archive policies where entire collections of data are stored in a formal hierarchy, and periodically archived – for instance:

Process archive

In this scenario, filesystems are configured and data access rules are established such that users know data will either be in location A, or location B, based on the a simple rule – e.g., the date of the file. In this sense, data written to primary storage is written in a structure that allows whole-scale relocation of large portions of it as required. Using the example above, user data structures might be configured to be broken down by year. So rather than a single “human resources” directory on the fileserver, for instance, there would be one under a parent directory of 2010, one under a parent directory of 2009, etc. As data access becomes less common, the older year parent directories (with all their hierarchies) are either taken offline entirely or moved to slower storage – but regardless, receive “final” multiple archive style backups before being taken out of the backup regime entirely.

Irrespective of which archive process is used, the net result should be the same for backup operations – removing stagnant data from the daily backup cycle.

One thing you might want to ponder: is data storage tiering capable of fulfilling archive requirements? I would suggest at the moment that the jury is still out on this one. The primary purpose of data storage tiering is to move less frequently accessed data to slower and cheaper storage. That’s akin to archival operations, but unless it’s very closely integrated with the backup software and processes involved, it may not necessarily remove that lower-tiered data from the actual primary backup cycle. Unless the tiering integrates to that point, my personal opinion is that it is not really archive.

Jan 102011

This is part 2 in the series, “Data Lifecycle Management“.

Penny-wise data lifecycle management refers to a situation where companies take attitude that spending time and/or money on data lifecycle ageing is costly. It’s the old problem – penny-wise, pound-foolish; losing sight of long-term real cost savings by focusing on avoiding short term expenditure.

Traditional backup techniques centre around periodic full backups with incrementals and/or differentials in-between the fulls. If we evaluate a 6 week retention strategy, it’s easy to see where the majority of the backup space takes. Let’s consider weekly fulls, daily incrementals, with a 3% daily change rate, and around 4TB of actual data.

  • Week 1 Full – 4TB.
  • Week 1 Day 1 Incr – 123 GB
  • Week 1 Day 2 Incr – 123 GB
  • Week 1 Day 3 Incr – 123 GB
  • Week 1 Day 4 Incr – 123 GB
  • Week 1 Day 5 Incr – 123 GB
  • Week 1 Day 6 Incr – 123 GB

Repeat that over 6 weeks, you have:

  • 6 x 4 TB of fulls – 24 TB.
  • 6 x 6 x incrs – 4.3TB.

Now, let’s assume that 30% of the data in the full backups represents stagnant data – data which is no longer being modified. It may be periodically accessed, but it’s certainly not being modified any longer. At just 30%, that’s 1.2TB of a 4TB full, or 7.2TB of the total 24 TB saved in full backups across the 6 week cycle.

Now, since this is a relatively small amount of data, we’ll assume the the backup speed is a sustained maximum throughput of 80MB/s. A 4 TB backup, at 80MB/s will take 14.56 hours to complete. On the other hand, a 2.8 TB backup at 80MB/s will take 10.19 hours to complete.

On any single full backup then, not backing up the stagnant data would save 1.2TB of space and 4.37 hours of time. Over that six week cycle though, it’s a saving of 7.2 TB, and 26.22 hours of backup time. This is not insubstantial.

There are two ways we can deal with the stagnant data:

  • Delete it or
  • Archive it

Contrary to popular opinion, before we look at archiving data, we actually should evaluate what can be deleted. That is – totally irrelevant data should not be archived. As to what data is relevant for archiving and what data is irrelevant will be a site-by-site decision. Some examples you might want to consider would include:

  • Temporary files;
  • Installers for applications whose data is past long-term and archive retention;
  • Installers for operating systems whose required applications (and associated data) are past long-term archive;
  • Personal correspondence that’s “crept into” a system;
  • Unnecessary correspondence (e.g., scanned faxes confirming purchase orders for stationary from 5 years ago).

The notion of deleting stagnant, irrelevant data may seem controversial to some, but only because of the “storage is cheap” notion. When companies paid significant amounts of money for physical document management, with that physical occupied space costing real money (rather than just being a facet in the IT budget), deleting was most certainly a standard business practice.

While data deletion is controversial in many companies, consideration of archive can also cause challenges. The core problem with archive is that when evaluated from the perspective of a bunch of individual fileservers, it doesn’t necessarily seem like a lot of space saving. A few hundred GB here, maybe a TB there, with the savings largely dependent on the size of each fileserver and age of the data on it.

Therefore, when we start talking to businesses about archive, we often start talking about fileserver consolidation – either to a fewer traditional OS fileservers, or NAS units. At this point, a common reason to balk is the perceived cost of such consolidation – so we either have the perception that:

  • Deleting is “fiddly” or “risky”, and
  • Archive is expensive.

Regardless, it effectively comes down to a perceived cost, regardless of whether that’s a literal capital investment or time taken by staff.

Yet we can still talk about this from a cost perspective and show savings for eliminating stagnant data from the backup cycle. To do so we need to talk about human resources – the hidden cost of backing up data.

You see, your backup administrators and backup operators cost your company money. Of course, they draw a salary regardless of what they’re doing, but you ultimately want them to be working on activities of maximum importance. Yes, keeping the backup system running by feeding it media is important, but a backup system is there to provide recoveries, and if your recovery queue has more items in it than the number of staff you have allocated to backup operations, it’s too long.

To calculate the human cost of backing up stagnant data, we have to start categorising the activities that backup administrators do. Let’s assume (based on the above small amounts of data), that it’s a one-stop shop where the backup administrator is also the backup operator. That’s fairly common in a lot of situations anyway. We’ll designate the following categories of tasks:

  • Platinum – Recovery operations.
  • Gold – Configuration and interoperability operations.
  • Silver – Backup operations.
  • Bronze – Media management operations.

About the only thing that’s debatable there is the order in which configuration/interoperability and backup operations should be ordered. My personal preference is the above, for the simple reason that backup operations should be self-managing once configured, but periodic configuration adjustments will be required, as will be ongoing consideration of interoperability requirements with the rest of the environment.

What is not debatable is that recovery operations should always be seen to be the highest priority activity within a backup system, and media management should be considered the lowest priority activity. That’s not to say that media management is unimportant, it’s just that people should be doing more important things than acting as protein based autoloaders.

The task categorisation allows us to rank the efficiency and cost-effectiveness of the work done by a backup administrator. I’d propose the following rankings:

  • Platinum – 100% efficiency, salary-weight of 1.
  • Gold – 90% efficiency, salary-weight of 1.25.
  • Silver – 75% efficiency, salary-weight of 1.5.
  • Bronze – 50% efficiency, salary-weight of 3.

What this allows us to do is calculate the “cost” (in terms of effectiveness, and impact on other potential activities) of the backup administrator spending time on the various tasks within the environment. So, this means:

  • Platinum activities represent maximised efficiency of job function, and should not incur a cost.
  • Gold activities represent reasonably efficient activities that only occur a small cost.
  • Silver activities are still mostly efficient, with a slightly increased cost.
  • Bronze activities are at best a 50/50 split between being inefficient or efficient, and have a much higher cost.

So, if a backup administrator is being paid $30 per hour, and does 1 hour each of the above tasks, we can assign hidden/human resource costs as follows:

  • Platinum – $30 per hour.
  • Gold – 1.1 * 1.25 * $30 – $41.25 per hour.
  • Silver – 1.25 * 1.5 * $30 – $56.25 per hour.
  • Bronze – 1.5 * 3 * $30 – $135 per hour.

Some might argue that the above is not a “literal” cost, and sure, you don’t pay a backup administrator $30 for recoveries and $135 for media management. However, what I’m trying to convey is that not all activities performed by a backup administrator are created equal. Some represent best bang for buck, while others progressively represent less palatable activities for the backup administrator (and for the company to pay the backup administrator to do).

You might consider it thusly – if a backup administrator can’t work on a platinum task because a bronze task is “taking priority”, then that’s the penalty – $105 per hour of the person’s time. Of course though, that’s just the penalty for paying the person to do a less important activity. Additional penalties come into play when we consider that other people may not be able to complete work because they can’t get access to the data they need, etc. (E.g., consider the cost of a situation where 3 people can’t work because they need data to be recovered, but the backup administrator is currently swapping media in the tape library to ensure the weekend’s backups run…)

Once we know the penalty though, we can start to factor in additional costs of having a sub-optimal environment. Assume for instance, a backup administrator spends 1 hour on media management tasks per TB backed up per week. If 1.2TB of data doesn’t need to be backed up each week, that’s 1.2 hours of wasted activity by the backup administrator. With a $105 per hour penalty, that’s $126 per week wasted, or over $6,552 per year.

So far then, we have the following costs of not deleting/archiving:

  • Impact on backup window;
  • Impact on media usage requirements (i.e., what you’re backing up to);
  • Immediate penalty of excessive media management by backup administrator;
  • Potential penalty of backup administrator managing media instead of higher priority tasks.

The ironic thing is that deleting and archiving is something that smaller businesses seem to get better than larger businesses. For smaller, workgroup style businesses, where there’s no dedicated IT staff, the people who do handle the backups don’t have the luxury of tape changers, large capacity disk backup or cloud (ha!) – every GB of backup space has to be careful apportioned, and therefore the notion of data deletion and archive is well entrenched. Yearly projects are closed off, multiple duplicates are written, but then those chunks of data are removed from the backup pool.

When we start evaluating the real cost, in terms of time and money, of continually backing up stagnant data, the reasons against deleting or archiving data seem far less compelling. Ultimately, for safe and healthy IT operations, the entire data lifecycle must be followed.

In the next posts, we’ll consider the risks and challenges created by only archiving, or only deleting.

Jan 042011

I’m going to run a few posts about overall data management, and central to the notion of data management is the data lifecycle. While this is a relatively simple concept, it’s one that a lot of businesses actually lose sight of.

Here’s the lifecycle of data, expressed as plainly as possible:

Data Lifecycle

Data, once created, is used for a specific period of time (the length will depend on the purpose of the data, and is not necessary for consideration in this discussion), and once primary usage is done, the future of the data must be considered.

Once the primary use for data is complete, there are two potential options for it – and the order of those options are important:

  • The data is deleted; or
  • The data is archived.

Last year my partner and I decided that it was time to uproot and move cities. Not just a small move, but to go from Gosford to Melbourne. That’s around a 1000km relocation, scheduled for June 2011, and with it comes some big decisions. You see, we’ve had 7 years where we’re currently living, and having been together for 14 years so far, we’ve accumulated a lot of stuff. I inherited strong hoarder tendencies from my father, and Darren has certainly had some strong hoarding tendencies himself in the past. Up until now, storage has been cheap (sound familiar?), but that’s no longer the case – we’ll be renting in Melbourne, and the removalists will charge us by the cubic metre, so all those belongings need to be evaluated. Do we still use them? If not, what do we do with them?

Taking the decision that we’d commence a major purge of material possessions lead me to the next unpleasant realisation: I’m a data-hoarder too. Give me a choice between keeping data and deleting it, or even archiving it, and I’d always keep it. However, having decided at the start of the year to transition from Google Mail to MobileMe, I started to look at all the email I’d kept over the years. Storage is cheap, you know. But that mentality lead to me accumulating over 10GB of email, going back to 1992. For what purpose? Why did I still need emails about University assignments? Why did I still need emails about price inquiries on PC133 RAM for a SunBlade 100? Why did I still need … well, you get the picture.

In short, I’ve realised that I’ve been failing data management #101 at a personal level, keeping everything I ever created or received in primary storage rather than seriously evaluating it based on the following criteria:

  • Am I still accessing this regularly?
  • Do I have a financial or legal reason to keep the data?
  • Do I have a sufficient emotional reason to keep the data?
  • Do I need to archive the data, or can it be deleted?

The third question is not the sort that a business should be evaluating on, but the other reasons are the same for any enterprise, of any size, as they were for me.

The net result, when I looked at those considerations was that I transferred around 1GB of email into MobileMe. I archived less than 500MB of email, and then I deleted the rest. That’s right – I, a professional data hoarder, did the unthinkable and deleted all those emails about university assignments, PC133 RAM price inquiries, discussions with friends about movie times for Lord of the Rings in 2001, etc.

Data hoarding is an insidious problem well entrenched in many enterprises. Since “storage is cheap” has been a defining mentality, online storage and storage management costs have skyrocketed within businesses. As a result, we’ve now got complex technologies to provide footprint minimisation (e.g., data deduplication) and single-instance archive. Neither of these options are cheap.

That’s not to say those options are wrong; but the most obvious fact is that money is spent on a daily basis within a significant number of organisations retaining or archiving data that is no longer required.

There are three key ways that businesses can fail to understand the data lifecycle process. These are:

  • Get stuck in the “Use” cycle for all data. (The “penny-wise” problem.)
  • Archive, but never delete data. (The “hoarder” problem.)
  • Delete, rather than archive data. (The “reckless” problem.)

Any three failure can prove significantly challenging to a business, and in upcoming articles I’ll discuss each one in more detail.

The articles in the series are:

There’s also an aside article, that discusses Stub vs Process Archives.

Archive is not Backup

 Architecture, Backup theory  Comments Off on Archive is not Backup
Sep 092010

Periodically there’ll be a post about storage that counsels the more obvious fact that “Backup is not Archive”. Less frequently discussed, but perhaps more important, is the fact that archive is not backup. To focus on why, and how this is the case, I want to look at email archive.

If we look at a standard email archive model – say, something like SourceOne, then it can, if you squint a bit, look a little like an email backup product – but it’s not really. SourceOne can not only discover and handle archive storage for existing email when it’s installed, but it has the option of automatically ingesting email into the archive as soon as it’s received. Users can then, if they want to, retrieve email directly from the archive rather than asking for a “brick level” recovery.

But is the email archive a backup?

While the short answer is “no”, the long answer is a little more complex than you might think.

Consider the definition of a backup:

A backup is a copy of any data that can be used to restore the data as/when required to its original form. That is, a backup is a valid copy of data, files, applications, or operating systems that can be used for the purposes of recovery.

(From “Enterprise Systems Backup and Recovery: A corporate insurance policy“)

Now, if we consider an email system from the perspective of end user requests for item level recovery, then in that narrow instance, we would be forced to declare the archive to indeed be a backup. However, if the email archive system is unable to restore the entire system state of the email server – from the OS right through to the email database – then from a broader, disaster recovery and system recovery perspective, archive is not backup.

As archive systems grow in complexity and offer more rich feature sets, there’s a blurry line where some people struggle to understand why they’d backup and archive the same system(s). So we provide the litmus test:

Regardless of what the archive system allows recovery of, if it does not allow recovery of the entire system, it’s not a backup.

So in that sense, an email archive system that allows brick level recovery, but can’t facilitate reconstructing the entire email server functionality is not a backup.

Who is your backup administrator, and who is your archive administrator?

 Backup theory, General Technology, General thoughts, Policies  Comments Off on Who is your backup administrator, and who is your archive administrator?
Mar 022010

My boss, on his blog, has raised a pertinent question – if it’s so important, according to some vendors, that backup and archive are all achieved through the same product interface, then how many companies out there assign the role of archive administrator to the backup administrator? (Or vice versa).

I like this question; it’s kind of like the old conundrum of whether the dog wags the tail, or whether the tail wags the dog. That is, are companies that heavily push an integrated backup and archive interface:

  • Responding to the needs of IT to meet current desired business functionality, or,
  • Are they trying to drive IT in a way that perhaps doesn’t meet desired business functionality?

(Or indeed, something else entirely).

[Edit, further thoughts, 2010-03-03] I’ve been thinking more about this, and I have to say I can’t think of a single customer environment off-hand where the backup administrator is also responsible for archiving. Archiving seems to remain primarily the purdue of the storage administration teams in sites that I’m aware of, so it does beg the question – how beneficial is an integrated backup and archive administration process?

[Original wrap-up] So if you’ve got any thoughts on the integration of backup and archive administration, either at the software or the human resources layer, I’d encourage you to jump across to Mike’s blog and make your voice heard.

(As a first, I’ve disabled comments on this blog posting, so as to encourage discussion to remain in one location – the source article.)

Sep 222009

When it comes to backup and data protection, I like to think of myself as being somewhat of a stickler for accuracy. After all, without accuracy, you don’t have specificity, and without specificity, you can’t reliably say that you have what you think you have.

So on the basis of wanting vendors to be more accurate, I really do wish vendors would stop talking about archive when they actually mean hierarchical storage management (HSM). It confuses journalists, technologists, managers and storage administrators, and (I must admit to some level of cynicism here) appears to be mainly driven from some thinking that “HSM” sounds either too scary or too complex.

HSM is neither scary nor complex – it’s just a variant of tiered storage, which is something that any site with 3+ TB of presented primary production data should be at least aware of, if not actively implementing and using. (Indeed, one might argue that HSM is the original form of tiered storage.)

By “presented primary production”, I’m referring to available-to-the-OS high speed, high cost storage presented in high performance LUN configurations. At this point, storage costs are high enough that tiered storage solutions start to make sense. (Bear in mind that 3+ TB of presented storage in such configurations may represent between 6 and 10TB of raw high speed, high cost storage. Thus, while it may not sound all that expensive initially, the disk-to-data ratio increases the cost substantially.) It should be noted that whether that tiering is done with a combination of different speeds of disks and levels of RAID, or with disk vs tape, or some combination of the two, is largely irrelevant to the notion of HSM.

Not only is HSM easy to understand and shouldn’t have any fear associated with it, the difference between HSM and archive is also equally easy to understand. It can even be explained with diagrams.

Here’s what archive looks like:

The archive process and subsequent data access

The archive process and subsequent data access

So, when we archive files, we first copy them out to archive media, then delete them from the source. Thus, if we need to access the archived data, we must read it back directly from the archive media. There is no reference left to the archived data on the filesystem, and data access must be managed independently from previous access methods.

On the other hand, here’s what the HSM process looks like:

The HSM process and subsequent data access

The HSM process and subsequent data access

So when we use HSM on files, we first copy them out to HSM media, then delete (or truncate) the original file but put in its place a stub file. This stub file has the same file name as the original file, and should a user attempt to access the stub, the HSM system silently and invisibly retrieves the original file from the HSM media, providing it back to the end user. If the user saves the file back to the same source, the stub is replaced with the original+updated data; if the user doesn’t save the file, the stub is left in place.

Or if you’re looking for an even simpler distinction: archive deletes, HSM leaves a stub. If a vendor talks to you about archive, but their product leaves a stub, you can know for sure that they actually mean HSM.

Honestly, these two concepts aren’t difficult, and they aren’t the same. In the never ending quest to save user bytes, you’d think vendors would appreciate that it’s cheaper to refer to HSM as HSM rather than Archive. Honestly, that’s a 4 byte space saving alone, every time the correct term is used!

[Edit – 2009-09-23]

OK, so it’s been pointed out by Scott Waterhouse that the official SNIA definition for archive doesn’t mention having to delete the source files, so I’ll accept that I was being stubbornly NetWorker-centric on this blog article. So I’ll accept that I’m wrong and (grudgingly yes) be prepared to refer to HSM as archive. But I won’t like it. Is that a fair compromise? 🙂

I won’t give up on ILP though!

%d bloggers like this: