The hard questions

 Aside, General Technology  Comments Off on The hard questions
Jul 312012

There are three hard questions that every company must be prepared to ask when it comes to data:

  1. Why do you care about your data?
  2. When do you care about your data?
  3. Who cares most about your data?

Sometimes these are not pleasant questions, and the answers may be very unpleasant. If they are, it’s time to revisit how you deal with data at your company.

Why do you care about your data?

…Do you care about your data because you’re tasked to care about it?

…Do you care about your data because you’re legally required to care about it?

…Or do you care about your data because it’s the right thing to do?

There’s no doubt that the first two reasons – being tasked, and being legally required to care about data are compelling, and valid reasons to do so. Chances are, if you’re in IT, then at some layer, being asked with data protection, or legally required to ensure data protection will play some factor in your job.

Yet neither reason is actually sufficiently compelling at all times. If everything we did in IT came down to job description or legal requirements, every job would be just as ‘glamorous’ as every other, and as many people would be eager to work in data protection as are in say, security, or application development.

Ultimately, people will care the most about data when they feel it’s the right thing to do. That is, when there’s an intrinsically felt moral obligation to care about it.

When do you care about your data?

…Do you care about your data when it is in transit within the network?

…Do you care about your data when it is at rest on your storage systems?

…Or do you care about your data when it’s been compromised?

The answer of course, should be always. At every part of the data lifecycle – at every location data can be found, it should have a custodian, and a custodian who cares because it’s the right thing to do. Yet, depressingly, we see clear examples time and time again where companies apparently only care about data when it’s been compromised.

(In this scenario, by compromise, I’m not referring solely to the classic security usage of the word, but to any situation where data is in some way lost or inappropriately modified.)

Who cares most about your data?

…Your management team?

…Your technical staff?

…Your users?

…Or external consultants?

For all intents and purposes, I’ve been an external consultant for the last 12+ years of my career. Ever since I left standard system administration behind, I’ve been working for system integrators, and as such when I walk into a business I’ve got that C-word title: consultant.

However, on several occasions over the course of my career, one thing has been abundantly, terrifyingly clear to me: I’ve cared more about the customer data than their own staff. Not all the staff, but typically more than two of the sub-groups mentioned above. This should not – this should never be the case. Now, I’m not saying I shouldn’t have to care about customer data: far from it. Anyone who calls themselves a consultant should have a deep and profound respect and care about the data of each customer he or she deals with. Yet, the users, management and technical staff at a company should always care more about their data than someone external to that customer.

Back to the hard questions

So let’s revisit those hard questions:

  1. Why do you care about your data?
  2. When do you care about your data?
  3. Who cares most about your data?

If your business has not asked those questions before, the key stakeholders may not like the answers, but I promise this: not asking them doesn’t change those answers. Until they’re answered, and addressed, a higher level of risk will exist in the business than should do so.

Jan 272012

Continuing on my post relating to dark data last week, I want to spend a little more about data awareness classification and distribution within an enterprise environment.

Dark data isn’t the end of the story, and it’s time to introduce the entire family of data-awareness concepts. These are:

  • Data – This is both the core data managed and protected by IT, and all other data throughout the enterprise which is:
    • Known about – The business is aware of it;
    • Managed – This data falls under the purview of a team in terms of storage administration (ILM);
    • Protected – This data falls under the purview of a team in terms of backup and recovery (ILP).
  • Dark Data – To quote the previous article, “all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for”.
  • Grey Data – Grey data is previously discovered dark data for which no decision has been made as yet in relation to its management or protection. That is, it’s now known about, but has not been assigned any policy or tier in either ILM or ILP.
  • Utility Data – This is data which is subsequently classified out of grey data state into a state where the data is known to have value, but is not either managed or protected, because it can be recreated. It could be that the decision is made that the cost (in time) of recreating the data is less expensive than the cost (both in literal dollars and in staff-activity time) of managing and protecting it.
  • Noise – This isn’t really data at all, but are all the “bits” (no pun intended) that are left which are neither grey data, data or utility data. In essence, this is irrelevant data, which someone or some group may be keeping for unnecessary reasons, and in actual fact should be considered eligible for either deletion or archival and deletion.

The distribution of data by awareness within the enterprise may resemble something along the following lines:

Data Awareness Percentage Distribution

That is, ideally the largest percentage of data should be regular data which is known, managed and protected. In all likelihood for most organisations, the next biggest percentage of data is going to be dark data – the data that hasn’t been discovered yet. Ideally however, after regular and dark data have been removed from the distribution, there should be at most 20% of data left, and this should be broken up such that at least half of that remaining data is utility data, with the last 10% split evenly between grey data and noise.

The logical implications of this layout should be reasonably straight forward:

  1. At all times the majority of data within an organisation should be known, managed and protected.
  2. It should be expected that at least 20% of the data within an organisation is undiscovered, or decentralised.
  3. Once data is discovered, it should exist in a ‘grey’ state for a very short period of time; ideally it should be reclassified as soon as possible into data, utility data or noise. In particular, data left in a grey state for an extended period of time represents just as dangerous a potential data loss situation as dark data.

It should be noted that regular data, even in this awareness classification scheme, will still be subject to regular data lifecycle decisions (archive, tiering, deletion, etc.) In that sense, primary data eligible for deletion isn’t really noise, because it’s previously been managed and protected; noise really is ex dark-data that will end up being deleted, either as an explicit decision, or due to a failure at some future point after the decision to classify it as ‘noise’, having never been managed or protected in a centralised, coordinated manner.

Equally, utility data won’t refer to say, Q/A or test databases that replicate the content of production databases. These types of databases will again have fallen under the standard data umbrella in that there will have been information lifecycle management and protection policies established for them, regardless of what those policies actually were.

If we bring this back to roles, then it’s clear that a pivotal role of both the DPAs (Data Protection Advocates) and the IPAC (Information Protection Advisory Council) within an organisation should be the rapid coordination of classification of dark data as it is discovered into one of the data, utility data or noise states.

Jan 222012

Dark Data

We’ve all heard the term Big Data – it’s something the vendors have been ramming down our throats with the same level of enthusiasm as Cloud. Personally, I think Big Data is a problem that shouldn’t exist: it serves for me as a stark criticism of OS, Application, Storage and Software companies for failing to anticipate the high end of the data growth arena and developing suitable mechanisms for dealing with it as part of the regular tool sets. After all, why should the end user have to ask him/herself: “Hmmm, do I have data or big data?”

Moving right along, recently another term has been starting to popup, and it’s far a more interesting – and legitimate – a problem.

It’s dark data.

If you haven’t heard of the term, I’m betting that you’ve either guessed the meaning or have a bit of an idea about it.

Dark data refers to all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for. Such as:

  • All those user PST files on desktops and notebooks;
  • That server a small workgroup deployed for testing purposes that’s not centrally managed or officially known about;
  • That research data an academic is storing on a 2TB USB drive connected to her laptop;
  • That offline copy of a chunk of the fileserver someone grabbed before going overseas that’s now sufficiently different from the real content of the fileserver;
  • and so on.

Dark data is a real issue within the business environment, because there’s potentially a large amount of critical information “out there” in the business but not necessarily under the control of the IT department.

You might call it decentralised data.

As we know from data protection, decentralised backups are particularly dangerous; they increase the cost of control and maintenance, they decrease the reliability of the process, and they can be a security nightmare. It’s exactly the same for dark data – in fact, worse, because by the very nature of the definition, it’s also data that’s unlikely to be backed up.

To try to control the spread of dark data, some companies will institute rigorous local storage policies, but these often present bigger headaches than they’re worth. For instance, locking down user desktops to make local storage not writeable isn’t always successful, and the added network load by shifting user profiles across to fileservers can be painful. Further, pushing these files across to centralised storage can make for extremely dense filesystems (or at least contribute towards them), trading one problem for another. Finally, it introduces new risk to the business, making users extremely unproductive if there are network or central storage issues.

There’s a few things a business can do in relation to dark data so as to decrease the headache and challenges created by it. These are acceptance, anticipation, and discovery.

  1. Acceptance – Acknowledge that dark data will find its way into the organisation. Keeping the corporate head in the sand over the existence of dark data, or blindly adhering to the (false) notion that rigorous security policies will prevent storage of data anywhere in the organisation except centrally, is foolish. Now, this doesn’t mean that you have to accept that data will become dark. Instead, acknowledging that there will be dark data out there will keep it as a known issue. What’s more, because it’s actually acknowledged by the business, it can be discussed by the business. Discussion will facilitate two key factors: keeping users aware of the dangers of dark data, and encouraging users to report dark data.
  2. Anticipation – Accepting that dark data exists is one thing; anticipating what can be done about it, and how it might be found allows a company to actually start dealing with dark data. Anticipating dark data can’t happen unless someone is responsible for it. Now, I’m not suggesting that being responsible for dark data means getting in trouble if there are issues with unprotected dark data going missing – if that were the case, not a single person in a company would want to be responsible for it. (And any person who did want to be responsible under those circumstances would likely not understand the scope of the issue.) The obvious person for this responsibility is the Data Protection Advisor. (See here and here.) You might argue that the dark data problem explicitly points out the need for one or more DPAs at every business.
  3. Discovery – No discovery process for dark data will be fully automated. There will be a level of automation that can be achieved via indexing and search engines deployed from central IT, but given dark data may be on systems which are only intermittently connected, or outside of the domain authority of IT, there will be a human element as well. This will consist of the DPA(s), end users, and team leaders, viz:
    • The DPA will be tasked with not only periodic visual inspections of his/her area of responsibility, but will also be responsible for issuing periodic reminders to staff, requesting notification of any local data storage.
    • End users should be aware (via induction, and company policies) of the need to avoid, as much as possible, the creation of data outside of the control and management of central IT. But they should equally be aware that in situations where this happens, a policy can be followed to notify IT to ensure that the data is protected or reviewed.
    • Team leaders should equally be aware of the potential for dark data creation, as per end users, but should also be tasked with liaising with IT to ensure dark data, once discovered, is appropriately classified, managed and protected. This may sometimes necessitate moving the data under IT control, but it may also at times be an acknowledgement that the data is best left local, with appropriate protection measures implemented and agreed upon.

Dark data is a real problem that will exist in practically every business; however, it doesn’t have to be a serious problem, when carefully dealt with. The above three rules – acceptance, anticipation, and discovery, will ensure it stays managed.

[2012-01-27 Addendum]

There’s now a followup to this article – “Data Awareness Distribution in the Enterprise“.

Jan 132011

This is the third post in the four part series, “Data lifecycle management”. The series started with “A basic lifecycle“, and continued with “The importance of being archived (and deleted)“. (An aside, “Stub vs Process Archive” is nominally part of the series.)

Legend has it that the Greek king Sisyphus was a crafty old bloke who managed to elude death several times through all manner of tricks – including chaining up Death when he came to visit.

As punishment, when Sisyphus finally died, he was sent to Hades, where he was given an eternal punishment of trying to roll a rock up over a hill. Only the rock was too heavy (probably thanks to a little hellish mystical magic), and every time he got to the top of the hill, the rock would fall, forcing him to start again.

Homer in the Odyssey described the fate of Sisyphus thusly:

“And I saw Sisyphus at his endless task raising his prodigious stone with both his hands. With hands and feet he tried to roll it up to the top of the hill, but always, just before he could roll it over on to the other side, its weight would be too much for him, and the pitiless stone would come thundering down again on to the plain.”

Companies that don’t delete unnecessary, stagnant data share the same fate as Sisyphus. When you think about it, the parallels are actually quite strong. They task themselves daily with an impossible task – to keep all data generated by the company. It ignores the obvious truth that data sizes have exploded and will continue to grow. It also ignores the obvious truth that some data doesn’t need to be remembered for all time.

A company that consigns itself to the fate of Sisyphus will typically be a heavy investor in archive technology. So we come to the third post in the data lifecycle management – the challenge of only archiving/never deleting data.

The common answer again to this is that “storage is cheap”, but there’s nothing cheap about paying to store data that you don’t need. There’s a basic, common logic to use here – what do you personally keep, and what do you personally throw away? Do you keep every letter you’ve ever received, every newspaper you’ve ever read, every book you’ve ever bought, every item of clothing you’ve ever worn, etc.?

The answer (for the vast majority of people) is no: there’s a useful lifespan of an item, and once that useful lifespan has elapsed, we have to make a decision on whether to keep it or not. I mentioned my own personal experience when I introduced the data lifecycle thread; preparing to move interstate I have to evaluate everything I own and decide whether I need to keep it or ditch it. Similarly, when I moved from Google Mail to MobileMe mail, I finally stopped to think about all the email I’d been storing over the years. Old Uni emails (I finished Uni in 1995/graduated in 1996), trivial email about times for movies, etc. Deleting all the email I’d needlessly kept because “storage is cheap” saved me almost 10GB of storage.

Saying “storage is cheap” is like closing your eyes and hoping the freight train barrelling towards you is an optical illusion. In the end, it’s just going to hurt.

This is not, by any means, an argument that you must only delete/never archive. (Indeed, the next article in this series will be about the perils of taking that route.) However, archive must be tempered with deletion or else it becomes the stone, and the storage administrators become Sisyphus.

Consider a sample enterprise archive arrangement whereby:

  • Servers and NAS uses primary storage.
  • Archive from NAS to single-instance WORM storage
  • Replicate single-instance WORM storage

Like it or not, there is a real, tangible cost to the storage of data at each of those steps. There is, undoubtedly, some data that must be stored on primary storage, an there’s undoubtedly some data that is legitimately required and can be moved to archive storage.

Yet equally keeping data in such an environment that is totally irrelevant, that has no ongoing purpose or legal/fiscal reason to keep will just cost money. If you extend that to the point of always keeping data, your company will need awfully deep pockets. Sure, some vendors will love you for wanting to keep everything forever, but in Shakespeare’s immortal words, “the truth will out”.

Mark Twomey (aka Storagezilla), an EMC employee wrote on his blog when discussing backup, archive and deletion:

“If you don’t need to hold onto data delete it. You don’t hold onto all the mail and fliers that come through your letterbox so why would you hold on to all files that land on your storage? Deletion is as valid a data management policy as retention.”

For proper data lifecycle management, we have to be able to obey the simplest of rules: sometimes, things should be forgotten.

Jan 102011

This is part 2 in the series, “Data Lifecycle Management“.

Penny-wise data lifecycle management refers to a situation where companies take attitude that spending time and/or money on data lifecycle ageing is costly. It’s the old problem – penny-wise, pound-foolish; losing sight of long-term real cost savings by focusing on avoiding short term expenditure.

Traditional backup techniques centre around periodic full backups with incrementals and/or differentials in-between the fulls. If we evaluate a 6 week retention strategy, it’s easy to see where the majority of the backup space takes. Let’s consider weekly fulls, daily incrementals, with a 3% daily change rate, and around 4TB of actual data.

  • Week 1 Full – 4TB.
  • Week 1 Day 1 Incr – 123 GB
  • Week 1 Day 2 Incr – 123 GB
  • Week 1 Day 3 Incr – 123 GB
  • Week 1 Day 4 Incr – 123 GB
  • Week 1 Day 5 Incr – 123 GB
  • Week 1 Day 6 Incr – 123 GB

Repeat that over 6 weeks, you have:

  • 6 x 4 TB of fulls – 24 TB.
  • 6 x 6 x incrs – 4.3TB.

Now, let’s assume that 30% of the data in the full backups represents stagnant data – data which is no longer being modified. It may be periodically accessed, but it’s certainly not being modified any longer. At just 30%, that’s 1.2TB of a 4TB full, or 7.2TB of the total 24 TB saved in full backups across the 6 week cycle.

Now, since this is a relatively small amount of data, we’ll assume the the backup speed is a sustained maximum throughput of 80MB/s. A 4 TB backup, at 80MB/s will take 14.56 hours to complete. On the other hand, a 2.8 TB backup at 80MB/s will take 10.19 hours to complete.

On any single full backup then, not backing up the stagnant data would save 1.2TB of space and 4.37 hours of time. Over that six week cycle though, it’s a saving of 7.2 TB, and 26.22 hours of backup time. This is not insubstantial.

There are two ways we can deal with the stagnant data:

  • Delete it or
  • Archive it

Contrary to popular opinion, before we look at archiving data, we actually should evaluate what can be deleted. That is – totally irrelevant data should not be archived. As to what data is relevant for archiving and what data is irrelevant will be a site-by-site decision. Some examples you might want to consider would include:

  • Temporary files;
  • Installers for applications whose data is past long-term and archive retention;
  • Installers for operating systems whose required applications (and associated data) are past long-term archive;
  • Personal correspondence that’s “crept into” a system;
  • Unnecessary correspondence (e.g., scanned faxes confirming purchase orders for stationary from 5 years ago).

The notion of deleting stagnant, irrelevant data may seem controversial to some, but only because of the “storage is cheap” notion. When companies paid significant amounts of money for physical document management, with that physical occupied space costing real money (rather than just being a facet in the IT budget), deleting was most certainly a standard business practice.

While data deletion is controversial in many companies, consideration of archive can also cause challenges. The core problem with archive is that when evaluated from the perspective of a bunch of individual fileservers, it doesn’t necessarily seem like a lot of space saving. A few hundred GB here, maybe a TB there, with the savings largely dependent on the size of each fileserver and age of the data on it.

Therefore, when we start talking to businesses about archive, we often start talking about fileserver consolidation – either to a fewer traditional OS fileservers, or NAS units. At this point, a common reason to balk is the perceived cost of such consolidation – so we either have the perception that:

  • Deleting is “fiddly” or “risky”, and
  • Archive is expensive.

Regardless, it effectively comes down to a perceived cost, regardless of whether that’s a literal capital investment or time taken by staff.

Yet we can still talk about this from a cost perspective and show savings for eliminating stagnant data from the backup cycle. To do so we need to talk about human resources – the hidden cost of backing up data.

You see, your backup administrators and backup operators cost your company money. Of course, they draw a salary regardless of what they’re doing, but you ultimately want them to be working on activities of maximum importance. Yes, keeping the backup system running by feeding it media is important, but a backup system is there to provide recoveries, and if your recovery queue has more items in it than the number of staff you have allocated to backup operations, it’s too long.

To calculate the human cost of backing up stagnant data, we have to start categorising the activities that backup administrators do. Let’s assume (based on the above small amounts of data), that it’s a one-stop shop where the backup administrator is also the backup operator. That’s fairly common in a lot of situations anyway. We’ll designate the following categories of tasks:

  • Platinum – Recovery operations.
  • Gold – Configuration and interoperability operations.
  • Silver – Backup operations.
  • Bronze – Media management operations.

About the only thing that’s debatable there is the order in which configuration/interoperability and backup operations should be ordered. My personal preference is the above, for the simple reason that backup operations should be self-managing once configured, but periodic configuration adjustments will be required, as will be ongoing consideration of interoperability requirements with the rest of the environment.

What is not debatable is that recovery operations should always be seen to be the highest priority activity within a backup system, and media management should be considered the lowest priority activity. That’s not to say that media management is unimportant, it’s just that people should be doing more important things than acting as protein based autoloaders.

The task categorisation allows us to rank the efficiency and cost-effectiveness of the work done by a backup administrator. I’d propose the following rankings:

  • Platinum – 100% efficiency, salary-weight of 1.
  • Gold – 90% efficiency, salary-weight of 1.25.
  • Silver – 75% efficiency, salary-weight of 1.5.
  • Bronze – 50% efficiency, salary-weight of 3.

What this allows us to do is calculate the “cost” (in terms of effectiveness, and impact on other potential activities) of the backup administrator spending time on the various tasks within the environment. So, this means:

  • Platinum activities represent maximised efficiency of job function, and should not incur a cost.
  • Gold activities represent reasonably efficient activities that only occur a small cost.
  • Silver activities are still mostly efficient, with a slightly increased cost.
  • Bronze activities are at best a 50/50 split between being inefficient or efficient, and have a much higher cost.

So, if a backup administrator is being paid $30 per hour, and does 1 hour each of the above tasks, we can assign hidden/human resource costs as follows:

  • Platinum – $30 per hour.
  • Gold – 1.1 * 1.25 * $30 – $41.25 per hour.
  • Silver – 1.25 * 1.5 * $30 – $56.25 per hour.
  • Bronze – 1.5 * 3 * $30 – $135 per hour.

Some might argue that the above is not a “literal” cost, and sure, you don’t pay a backup administrator $30 for recoveries and $135 for media management. However, what I’m trying to convey is that not all activities performed by a backup administrator are created equal. Some represent best bang for buck, while others progressively represent less palatable activities for the backup administrator (and for the company to pay the backup administrator to do).

You might consider it thusly – if a backup administrator can’t work on a platinum task because a bronze task is “taking priority”, then that’s the penalty – $105 per hour of the person’s time. Of course though, that’s just the penalty for paying the person to do a less important activity. Additional penalties come into play when we consider that other people may not be able to complete work because they can’t get access to the data they need, etc. (E.g., consider the cost of a situation where 3 people can’t work because they need data to be recovered, but the backup administrator is currently swapping media in the tape library to ensure the weekend’s backups run…)

Once we know the penalty though, we can start to factor in additional costs of having a sub-optimal environment. Assume for instance, a backup administrator spends 1 hour on media management tasks per TB backed up per week. If 1.2TB of data doesn’t need to be backed up each week, that’s 1.2 hours of wasted activity by the backup administrator. With a $105 per hour penalty, that’s $126 per week wasted, or over $6,552 per year.

So far then, we have the following costs of not deleting/archiving:

  • Impact on backup window;
  • Impact on media usage requirements (i.e., what you’re backing up to);
  • Immediate penalty of excessive media management by backup administrator;
  • Potential penalty of backup administrator managing media instead of higher priority tasks.

The ironic thing is that deleting and archiving is something that smaller businesses seem to get better than larger businesses. For smaller, workgroup style businesses, where there’s no dedicated IT staff, the people who do handle the backups don’t have the luxury of tape changers, large capacity disk backup or cloud (ha!) – every GB of backup space has to be careful apportioned, and therefore the notion of data deletion and archive is well entrenched. Yearly projects are closed off, multiple duplicates are written, but then those chunks of data are removed from the backup pool.

When we start evaluating the real cost, in terms of time and money, of continually backing up stagnant data, the reasons against deleting or archiving data seem far less compelling. Ultimately, for safe and healthy IT operations, the entire data lifecycle must be followed.

In the next posts, we’ll consider the risks and challenges created by only archiving, or only deleting.

Jan 042011

I’m going to run a few posts about overall data management, and central to the notion of data management is the data lifecycle. While this is a relatively simple concept, it’s one that a lot of businesses actually lose sight of.

Here’s the lifecycle of data, expressed as plainly as possible:

Data Lifecycle

Data, once created, is used for a specific period of time (the length will depend on the purpose of the data, and is not necessary for consideration in this discussion), and once primary usage is done, the future of the data must be considered.

Once the primary use for data is complete, there are two potential options for it – and the order of those options are important:

  • The data is deleted; or
  • The data is archived.

Last year my partner and I decided that it was time to uproot and move cities. Not just a small move, but to go from Gosford to Melbourne. That’s around a 1000km relocation, scheduled for June 2011, and with it comes some big decisions. You see, we’ve had 7 years where we’re currently living, and having been together for 14 years so far, we’ve accumulated a lot of stuff. I inherited strong hoarder tendencies from my father, and Darren has certainly had some strong hoarding tendencies himself in the past. Up until now, storage has been cheap (sound familiar?), but that’s no longer the case – we’ll be renting in Melbourne, and the removalists will charge us by the cubic metre, so all those belongings need to be evaluated. Do we still use them? If not, what do we do with them?

Taking the decision that we’d commence a major purge of material possessions lead me to the next unpleasant realisation: I’m a data-hoarder too. Give me a choice between keeping data and deleting it, or even archiving it, and I’d always keep it. However, having decided at the start of the year to transition from Google Mail to MobileMe, I started to look at all the email I’d kept over the years. Storage is cheap, you know. But that mentality lead to me accumulating over 10GB of email, going back to 1992. For what purpose? Why did I still need emails about University assignments? Why did I still need emails about price inquiries on PC133 RAM for a SunBlade 100? Why did I still need … well, you get the picture.

In short, I’ve realised that I’ve been failing data management #101 at a personal level, keeping everything I ever created or received in primary storage rather than seriously evaluating it based on the following criteria:

  • Am I still accessing this regularly?
  • Do I have a financial or legal reason to keep the data?
  • Do I have a sufficient emotional reason to keep the data?
  • Do I need to archive the data, or can it be deleted?

The third question is not the sort that a business should be evaluating on, but the other reasons are the same for any enterprise, of any size, as they were for me.

The net result, when I looked at those considerations was that I transferred around 1GB of email into MobileMe. I archived less than 500MB of email, and then I deleted the rest. That’s right – I, a professional data hoarder, did the unthinkable and deleted all those emails about university assignments, PC133 RAM price inquiries, discussions with friends about movie times for Lord of the Rings in 2001, etc.

Data hoarding is an insidious problem well entrenched in many enterprises. Since “storage is cheap” has been a defining mentality, online storage and storage management costs have skyrocketed within businesses. As a result, we’ve now got complex technologies to provide footprint minimisation (e.g., data deduplication) and single-instance archive. Neither of these options are cheap.

That’s not to say those options are wrong; but the most obvious fact is that money is spent on a daily basis within a significant number of organisations retaining or archiving data that is no longer required.

There are three key ways that businesses can fail to understand the data lifecycle process. These are:

  • Get stuck in the “Use” cycle for all data. (The “penny-wise” problem.)
  • Archive, but never delete data. (The “hoarder” problem.)
  • Delete, rather than archive data. (The “reckless” problem.)

Any three failure can prove significantly challenging to a business, and in upcoming articles I’ll discuss each one in more detail.

The articles in the series are:

There’s also an aside article, that discusses Stub vs Process Archives.

%d bloggers like this: