One of the core concepts I try to drive home in my book is that you don’t get a backup system by installing enterprise backup software.

Here’s a diagram to help explain what really goes into making a backup system:

Backup system

In short, you can have as much technology as you want, but without the rest of those pieces all you’ve got is a budget sink-hole.

If you want to understand how all these concepts fit together, you really should take the time to invest in my book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy“.

 

When IDATA was beta testing NetWorker 7.6 SP1, my colleagues in New Zealand were responsible for testing the DD/Boost functionality. This, as you may have heard, allows for tighter integration between NetWorker and Data Domain systems, in much the same way that Data Domain has previously integrated with NetBackup.

I’m now doing a DD/Boost implementation, and I’ve got to say, I’m pretty impressed at the integration. At the moment, this is a standalone Data Domain 670, which we’ll be cloning out to physical tape from, so my satisfaction with the integration level has nothing to do with replication. I’ll cover that off when I implement Boost replication.

The first thing that impressed me was that under Boost, the Data Domain device types can be configured with parallelism greater than 1 without affecting the deduplication ratio. That means that a datazone won’t end up with so many devices as it would have to under a normal VTL or ADV_FILE dedupe configuration, which is a nice bonus. (And also better for licensing, too.)

The thing that really gave me a head spin though was the reporting integration. Having done some target based dedupe work before this in NetWorker, I’d been finding it frustrating that I couldn’t drill down and find out what sort of dedupe ratios clients and filesystems were getting. Boost is the answer:

Dedupe ratio

This, as you can imagine, is pretty cool reporting. Not only can you see what systems are getting great deduplication ratios, it’ll make it easy as pie to find the ones that aren’t.

Going up a level, the same applies to clients, too:

Client dedupe summary

When you can isolate clients, filesystems and data that doesn’t deduplicate well, you can do any or all of the following:

  • Send data direct to physical tape if necessary;
  • Send data to slower, non-deduplicating disk backup;
  • Send data to the deduplication device, but immediately clone and stage out as a priority.

I think I’m going to have a long and productive affair with Boost.

 

I’m pleased to say that IDATA NZ is also looking for new staff. Here’s the advert, for anyone interested in working out of New Zealand’s capital city:

IDATA is a leading data protection, storage, virtualisation and high availability solutions provider.  Our customer base includes many high profile companies throughout Australasia.

Due to a high demand for our services and solutions IDATA are seeking an additional senior consultant to work from our Wellington office.  You will have extensive experience in data storage and protection solutions in a predominantly enterprise environment.

To be successful in this role you will also have experience in Virtualisation, Storage, Backup, and Archiving products

Candidates with experience in the above technologies with one or more of the following technology providers: EMC, VMware, IBM, Symantec, NetApp can apply here: careers@idataresolutions.com

 

There was an interesting discussion being held on Twitter a day or so ago – the question of designing systems that are “good enough”, or systems that are “perfect”.

I used to believe in the pursuit of perfection in systems, until I realised it was trying to mould engineering concepts to reality, and is therefore an often misguided approach. I also think trying to mould cruel business concepts to reality (i.e., “make it as cheap as possible and hope for the best”) is as equally a misguided approach as a meaningless quest for perfection.

Now, undoubtedly that’ll upset a few people, so I want to lay some ground-work here. When referring to a built system architecture, or even a proposed system architecture:

  • “Good enough” should mean “within design specifications”;
  • “Good enough” should mean “within acceptable failure tolerances”.

The term “good enough” suffers from, well, a distinct lack of bad press, and rightly so. If anything, what I’m suggesting is that we need to reform the usage of the term. This was perhaps best summarised in a discussion I briefly jumped into on Twitter. Phil Jaenke (aka @rootwyrm) said:

“Good enough” is rarely “good enough.” Instead, it’s “meh, I don’t feel like trying harder.”

That’s actually the core of the problem for me: “good enough” is being abused. Phil’s attitude comes, rightly so, from seeing undoubtedly a long set of examples of something termed as being “good enough” that actually wasn’t. It’s also somewhat reminiscent of that leaked Sun marketing video from a few years criticising a competitor’s systems, with a customer of the competitor repeatedly saying “But good enough is still ‘good’, right?”

I think we need to classify “good enough” solutions that currently exist into three separate categories:

  1. Those that are genuinely “good enough”.
  2. Those that are cobbled together/work so long as nothing goes wrong because the person implementing couldn’t do a better job (either out of apathy or time constraints – the reason is irrelevant).
  3. Those that are not properly specified in the first place.

There’s a lot of systems out there that are genuinely good enough. They should not be disparaged or lumped in with the other two categories described above.

In my experience, the above 3 categories share about a 1/3 split each of the entire pie. So that means there’s a lot of inadequate systems that are classified erroneously as “good enough”.

So, of the above list of three categories, only one is “good enough”. The second category should be rightly described as “not conforming to specifications”, or “incomplete”, if you want to make it short. The third category is “inadequately scoped”.

In reality, no-one should ever attempt to describe something that is incomplete/fails to conform to specifications, or something that was inadequately scoped as being “good enough”.

Good enough should not be a dirty expression.

Now, returning to something I said earlier, about a “meaningless quest for perfection”, I want to qualify that: I don’t mean that a pursuit of perfection is meaningless. However, “perfect” is a nebulous term that can, if misapplied result in significant over-engineering, significantly beyond the desired or required scope. Is this bad? Well, yes, it actually is if it hampers the completion of a project, or causes a significant blow-out in costs.

Misapplied, a quest for perfection can result in a worse system or environment than “good enough” would have delivered. Take Toyota as an example. Toyota have a general rule that if you can see a way to improve efficiency by just a few percent, you should get it implemented. Other car companies won’t shift on anything unless they can see it resulting in a 15 or 20% improvement in the bottom line. Cumulatively though, Toyota builds a more improved environment by allowing “good enough” changes to their system.

Perfection is nice; perfection is sometimes even required, and to be admired when it can be achieved, but if “good enough” is all that is required, then “perfect” may actually be inappropriate.

Next time you see the term “good enough” being bandied around, ask yourself this: is it really “good enough”, or is it incomplete, or inadequately scoped?

 

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive :-)

 

This is the third post in the four part series, “Data lifecycle management”. The series started with “A basic lifecycle“, and continued with “The importance of being archived (and deleted)“. (An aside, “Stub vs Process Archive” is nominally part of the series.)

Legend has it that the Greek king Sisyphus was a crafty old bloke who managed to elude death several times through all manner of tricks – including chaining up Death when he came to visit.

As punishment, when Sisyphus finally died, he was sent to Hades, where he was given an eternal punishment of trying to roll a rock up over a hill. Only the rock was too heavy (probably thanks to a little hellish mystical magic), and every time he got to the top of the hill, the rock would fall, forcing him to start again.

Homer in the Odyssey described the fate of Sisyphus thusly:

“And I saw Sisyphus at his endless task raising his prodigious stone with both his hands. With hands and feet he tried to roll it up to the top of the hill, but always, just before he could roll it over on to the other side, its weight would be too much for him, and the pitiless stone would come thundering down again on to the plain.”

Companies that don’t delete unnecessary, stagnant data share the same fate as Sisyphus. When you think about it, the parallels are actually quite strong. They task themselves daily with an impossible task – to keep all data generated by the company. It ignores the obvious truth that data sizes have exploded and will continue to grow. It also ignores the obvious truth that some data doesn’t need to be remembered for all time.

A company that consigns itself to the fate of Sisyphus will typically be a heavy investor in archive technology. So we come to the third post in the data lifecycle management – the challenge of only archiving/never deleting data.

The common answer again to this is that “storage is cheap”, but there’s nothing cheap about paying to store data that you don’t need. There’s a basic, common logic to use here – what do you personally keep, and what do you personally throw away? Do you keep every letter you’ve ever received, every newspaper you’ve ever read, every book you’ve ever bought, every item of clothing you’ve ever worn, etc.?

The answer (for the vast majority of people) is no: there’s a useful lifespan of an item, and once that useful lifespan has elapsed, we have to make a decision on whether to keep it or not. I mentioned my own personal experience when I introduced the data lifecycle thread; preparing to move interstate I have to evaluate everything I own and decide whether I need to keep it or ditch it. Similarly, when I moved from Google Mail to MobileMe mail, I finally stopped to think about all the email I’d been storing over the years. Old Uni emails (I finished Uni in 1995/graduated in 1996), trivial email about times for movies, etc. Deleting all the email I’d needlessly kept because “storage is cheap” saved me almost 10GB of storage.

Saying “storage is cheap” is like closing your eyes and hoping the freight train barrelling towards you is an optical illusion. In the end, it’s just going to hurt.

This is not, by any means, an argument that you must only delete/never archive. (Indeed, the next article in this series will be about the perils of taking that route.) However, archive must be tempered with deletion or else it becomes the stone, and the storage administrators become Sisyphus.

Consider a sample enterprise archive arrangement whereby:

  • Servers and NAS uses primary storage.
  • Archive from NAS to single-instance WORM storage
  • Replicate single-instance WORM storage

Like it or not, there is a real, tangible cost to the storage of data at each of those steps. There is, undoubtedly, some data that must be stored on primary storage, an there’s undoubtedly some data that is legitimately required and can be moved to archive storage.

Yet equally keeping data in such an environment that is totally irrelevant, that has no ongoing purpose or legal/fiscal reason to keep will just cost money. If you extend that to the point of always keeping data, your company will need awfully deep pockets. Sure, some vendors will love you for wanting to keep everything forever, but in Shakespeare’s immortal words, “the truth will out”.

Mark Twomey (aka Storagezilla), an EMC employee wrote on his blog when discussing backup, archive and deletion:

“If you don’t need to hold onto data delete it. You don’t hold onto all the mail and fliers that come through your letterbox so why would you hold on to all files that land on your storage? Deletion is as valid a data management policy as retention.”

For proper data lifecycle management, we have to be able to obey the simplest of rules: sometimes, things should be forgotten.

 

This is an adjunct post to the current series, “Data lifecycle management“, and is intended to provide a little more information about types of archiving that can be done.

When we literally talk about archiving (rather than tiering), there are two distinctly different processes in archival operations:

  • Stub based archive – transparent to the end user
  • Process archive – requires access changes by the end user

Stub based archive is an interesting beast. The entire notion is to effectively present a unified, unmodified view of the filesystem(s) to the end user such that data access continues as always, regardless of whether the file currently exists on primary storage, or has been archived. Conceptually, it resembles the following:

Stub based archives

With a stub-based archive system, there is no apparent difference to the end user in accessing a file regardless of whether it still exists on primary storage or whether it’s been archived. When a file is archived, a stub, with the same name and extension, is left behind. The archive system sits between end-user processes and filesystem processes, and detects accesses to stubs. When a user accesses a stub, the archive process intercepts that read and returns the real file. At most, a user will notice a delay in the file access, depending on the speed of the archive storage. If the user subsequently writes to the file, the stub is replaced with the new version of the file, restarting the file usage process. Backup systems, when properly integrated with stub based archive, will backup the stub, rather than retrieve the entire file from archive.

Archive systems such as those described above allow for highly configurable archive policies – simple rules such as “files not accessed in 180 days will be archived”, as well as more complex rules, e.g., “Excel files not accessed in 365 days from finance users AND 180 days by management users will be archived”.

Stub based archiving is paradoxically best suited to large environments. Paradoxically because it has the potential to introduce a new headache for backup administrators: massively dense filesystems. For more information on dense filesystems, read “In-lab review of the impact of dense filesystems“. The stub issue is something I’ve touched on previously in “HSM implications for backup“.

The other archive method is what I’d refer to as “process based archive”. This is used in a lot of smaller businesses, and centres around very simple archive policies where entire collections of data are stored in a formal hierarchy, and periodically archived – for instance:

Process archive

In this scenario, filesystems are configured and data access rules are established such that users know data will either be in location A, or location B, based on the a simple rule – e.g., the date of the file. In this sense, data written to primary storage is written in a structure that allows whole-scale relocation of large portions of it as required. Using the example above, user data structures might be configured to be broken down by year. So rather than a single “human resources” directory on the fileserver, for instance, there would be one under a parent directory of 2010, one under a parent directory of 2009, etc. As data access becomes less common, the older year parent directories (with all their hierarchies) are either taken offline entirely or moved to slower storage – but regardless, receive “final” multiple archive style backups before being taken out of the backup regime entirely.

Irrespective of which archive process is used, the net result should be the same for backup operations – removing stagnant data from the daily backup cycle.

One thing you might want to ponder: is data storage tiering capable of fulfilling archive requirements? I would suggest at the moment that the jury is still out on this one. The primary purpose of data storage tiering is to move less frequently accessed data to slower and cheaper storage. That’s akin to archival operations, but unless it’s very closely integrated with the backup software and processes involved, it may not necessarily remove that lower-tiered data from the actual primary backup cycle. Unless the tiering integrates to that point, my personal opinion is that it is not really archive.

 

This is part 2 in the series, “Data Lifecycle Management“.

Penny-wise data lifecycle management refers to a situation where companies take attitude that spending time and/or money on data lifecycle ageing is costly. It’s the old problem – penny-wise, pound-foolish; losing sight of long-term real cost savings by focusing on avoiding short term expenditure.

Traditional backup techniques centre around periodic full backups with incrementals and/or differentials in-between the fulls. If we evaluate a 6 week retention strategy, it’s easy to see where the majority of the backup space takes. Let’s consider weekly fulls, daily incrementals, with a 3% daily change rate, and around 4TB of actual data.

  • Week 1 Full – 4TB.
  • Week 1 Day 1 Incr – 123 GB
  • Week 1 Day 2 Incr – 123 GB
  • Week 1 Day 3 Incr – 123 GB
  • Week 1 Day 4 Incr – 123 GB
  • Week 1 Day 5 Incr – 123 GB
  • Week 1 Day 6 Incr – 123 GB

Repeat that over 6 weeks, you have:

  • 6 x 4 TB of fulls – 24 TB.
  • 6 x 6 x incrs – 4.3TB.

Now, let’s assume that 30% of the data in the full backups represents stagnant data – data which is no longer being modified. It may be periodically accessed, but it’s certainly not being modified any longer. At just 30%, that’s 1.2TB of a 4TB full, or 7.2TB of the total 24 TB saved in full backups across the 6 week cycle.

Now, since this is a relatively small amount of data, we’ll assume the the backup speed is a sustained maximum throughput of 80MB/s. A 4 TB backup, at 80MB/s will take 14.56 hours to complete. On the other hand, a 2.8 TB backup at 80MB/s will take 10.19 hours to complete.

On any single full backup then, not backing up the stagnant data would save 1.2TB of space and 4.37 hours of time. Over that six week cycle though, it’s a saving of 7.2 TB, and 26.22 hours of backup time. This is not insubstantial.

There are two ways we can deal with the stagnant data:

  • Delete it or
  • Archive it

Contrary to popular opinion, before we look at archiving data, we actually should evaluate what can be deleted. That is – totally irrelevant data should not be archived. As to what data is relevant for archiving and what data is irrelevant will be a site-by-site decision. Some examples you might want to consider would include:

  • Temporary files;
  • Installers for applications whose data is past long-term and archive retention;
  • Installers for operating systems whose required applications (and associated data) are past long-term archive;
  • Personal correspondence that’s “crept into” a system;
  • Unnecessary correspondence (e.g., scanned faxes confirming purchase orders for stationary from 5 years ago).

The notion of deleting stagnant, irrelevant data may seem controversial to some, but only because of the “storage is cheap” notion. When companies paid significant amounts of money for physical document management, with that physical occupied space costing real money (rather than just being a facet in the IT budget), deleting was most certainly a standard business practice.

While data deletion is controversial in many companies, consideration of archive can also cause challenges. The core problem with archive is that when evaluated from the perspective of a bunch of individual fileservers, it doesn’t necessarily seem like a lot of space saving. A few hundred GB here, maybe a TB there, with the savings largely dependent on the size of each fileserver and age of the data on it.

Therefore, when we start talking to businesses about archive, we often start talking about fileserver consolidation – either to a fewer traditional OS fileservers, or NAS units. At this point, a common reason to balk is the perceived cost of such consolidation – so we either have the perception that:

  • Deleting is “fiddly” or “risky”, and
  • Archive is expensive.

Regardless, it effectively comes down to a perceived cost, regardless of whether that’s a literal capital investment or time taken by staff.

Yet we can still talk about this from a cost perspective and show savings for eliminating stagnant data from the backup cycle. To do so we need to talk about human resources – the hidden cost of backing up data.

You see, your backup administrators and backup operators cost your company money. Of course, they draw a salary regardless of what they’re doing, but you ultimately want them to be working on activities of maximum importance. Yes, keeping the backup system running by feeding it media is important, but a backup system is there to provide recoveries, and if your recovery queue has more items in it than the number of staff you have allocated to backup operations, it’s too long.

To calculate the human cost of backing up stagnant data, we have to start categorising the activities that backup administrators do. Let’s assume (based on the above small amounts of data), that it’s a one-stop shop where the backup administrator is also the backup operator. That’s fairly common in a lot of situations anyway. We’ll designate the following categories of tasks:

  • Platinum – Recovery operations.
  • Gold – Configuration and interoperability operations.
  • Silver – Backup operations.
  • Bronze – Media management operations.

About the only thing that’s debatable there is the order in which configuration/interoperability and backup operations should be ordered. My personal preference is the above, for the simple reason that backup operations should be self-managing once configured, but periodic configuration adjustments will be required, as will be ongoing consideration of interoperability requirements with the rest of the environment.

What is not debatable is that recovery operations should always be seen to be the highest priority activity within a backup system, and media management should be considered the lowest priority activity. That’s not to say that media management is unimportant, it’s just that people should be doing more important things than acting as protein based autoloaders.

The task categorisation allows us to rank the efficiency and cost-effectiveness of the work done by a backup administrator. I’d propose the following rankings:

  • Platinum – 100% efficiency, salary-weight of 1.
  • Gold – 90% efficiency, salary-weight of 1.25.
  • Silver – 75% efficiency, salary-weight of 1.5.
  • Bronze – 50% efficiency, salary-weight of 3.

What this allows us to do is calculate the “cost” (in terms of effectiveness, and impact on other potential activities) of the backup administrator spending time on the various tasks within the environment. So, this means:

  • Platinum activities represent maximised efficiency of job function, and should not incur a cost.
  • Gold activities represent reasonably efficient activities that only occur a small cost.
  • Silver activities are still mostly efficient, with a slightly increased cost.
  • Bronze activities are at best a 50/50 split between being inefficient or efficient, and have a much higher cost.

So, if a backup administrator is being paid $30 per hour, and does 1 hour each of the above tasks, we can assign hidden/human resource costs as follows:

  • Platinum – $30 per hour.
  • Gold – 1.1 * 1.25 * $30 – $41.25 per hour.
  • Silver – 1.25 * 1.5 * $30 – $56.25 per hour.
  • Bronze – 1.5 * 3 * $30 – $135 per hour.

Some might argue that the above is not a “literal” cost, and sure, you don’t pay a backup administrator $30 for recoveries and $135 for media management. However, what I’m trying to convey is that not all activities performed by a backup administrator are created equal. Some represent best bang for buck, while others progressively represent less palatable activities for the backup administrator (and for the company to pay the backup administrator to do).

You might consider it thusly – if a backup administrator can’t work on a platinum task because a bronze task is “taking priority”, then that’s the penalty – $105 per hour of the person’s time. Of course though, that’s just the penalty for paying the person to do a less important activity. Additional penalties come into play when we consider that other people may not be able to complete work because they can’t get access to the data they need, etc. (E.g., consider the cost of a situation where 3 people can’t work because they need data to be recovered, but the backup administrator is currently swapping media in the tape library to ensure the weekend’s backups run…)

Once we know the penalty though, we can start to factor in additional costs of having a sub-optimal environment. Assume for instance, a backup administrator spends 1 hour on media management tasks per TB backed up per week. If 1.2TB of data doesn’t need to be backed up each week, that’s 1.2 hours of wasted activity by the backup administrator. With a $105 per hour penalty, that’s $126 per week wasted, or over $6,552 per year.

So far then, we have the following costs of not deleting/archiving:

  • Impact on backup window;
  • Impact on media usage requirements (i.e., what you’re backing up to);
  • Immediate penalty of excessive media management by backup administrator;
  • Potential penalty of backup administrator managing media instead of higher priority tasks.

The ironic thing is that deleting and archiving is something that smaller businesses seem to get better than larger businesses. For smaller, workgroup style businesses, where there’s no dedicated IT staff, the people who do handle the backups don’t have the luxury of tape changers, large capacity disk backup or cloud (ha!) – every GB of backup space has to be careful apportioned, and therefore the notion of data deletion and archive is well entrenched. Yearly projects are closed off, multiple duplicates are written, but then those chunks of data are removed from the backup pool.

When we start evaluating the real cost, in terms of time and money, of continually backing up stagnant data, the reasons against deleting or archiving data seem far less compelling. Ultimately, for safe and healthy IT operations, the entire data lifecycle must be followed.

In the next posts, we’ll consider the risks and challenges created by only archiving, or only deleting.

 

I’m going to run a few posts about overall data management, and central to the notion of data management is the data lifecycle. While this is a relatively simple concept, it’s one that a lot of businesses actually lose sight of.

Here’s the lifecycle of data, expressed as plainly as possible:

Data Lifecycle

Data, once created, is used for a specific period of time (the length will depend on the purpose of the data, and is not necessary for consideration in this discussion), and once primary usage is done, the future of the data must be considered.

Once the primary use for data is complete, there are two potential options for it – and the order of those options are important:

  • The data is deleted; or
  • The data is archived.

Last year my partner and I decided that it was time to uproot and move cities. Not just a small move, but to go from Gosford to Melbourne. That’s around a 1000km relocation, scheduled for June 2011, and with it comes some big decisions. You see, we’ve had 7 years where we’re currently living, and having been together for 14 years so far, we’ve accumulated a lot of stuff. I inherited strong hoarder tendencies from my father, and Darren has certainly had some strong hoarding tendencies himself in the past. Up until now, storage has been cheap (sound familiar?), but that’s no longer the case – we’ll be renting in Melbourne, and the removalists will charge us by the cubic metre, so all those belongings need to be evaluated. Do we still use them? If not, what do we do with them?

Taking the decision that we’d commence a major purge of material possessions lead me to the next unpleasant realisation: I’m a data-hoarder too. Give me a choice between keeping data and deleting it, or even archiving it, and I’d always keep it. However, having decided at the start of the year to transition from Google Mail to MobileMe, I started to look at all the email I’d kept over the years. Storage is cheap, you know. But that mentality lead to me accumulating over 10GB of email, going back to 1992. For what purpose? Why did I still need emails about University assignments? Why did I still need emails about price inquiries on PC133 RAM for a SunBlade 100? Why did I still need … well, you get the picture.

In short, I’ve realised that I’ve been failing data management #101 at a personal level, keeping everything I ever created or received in primary storage rather than seriously evaluating it based on the following criteria:

  • Am I still accessing this regularly?
  • Do I have a financial or legal reason to keep the data?
  • Do I have a sufficient emotional reason to keep the data?
  • Do I need to archive the data, or can it be deleted?

The third question is not the sort that a business should be evaluating on, but the other reasons are the same for any enterprise, of any size, as they were for me.

The net result, when I looked at those considerations was that I transferred around 1GB of email into MobileMe. I archived less than 500MB of email, and then I deleted the rest. That’s right – I, a professional data hoarder, did the unthinkable and deleted all those emails about university assignments, PC133 RAM price inquiries, discussions with friends about movie times for Lord of the Rings in 2001, etc.

Data hoarding is an insidious problem well entrenched in many enterprises. Since “storage is cheap” has been a defining mentality, online storage and storage management costs have skyrocketed within businesses. As a result, we’ve now got complex technologies to provide footprint minimisation (e.g., data deduplication) and single-instance archive. Neither of these options are cheap.

That’s not to say those options are wrong; but the most obvious fact is that money is spent on a daily basis within a significant number of organisations retaining or archiving data that is no longer required.

There are three key ways that businesses can fail to understand the data lifecycle process. These are:

  • Get stuck in the “Use” cycle for all data. (The “penny-wise” problem.)
  • Archive, but never delete data. (The “hoarder” problem.)
  • Delete, rather than archive data. (The “reckless” problem.)

Any three failure can prove significantly challenging to a business, and in upcoming articles I’ll discuss each one in more detail.

The articles in the series are:

There’s also an aside article, that discusses Stub vs Process Archives.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha