Continuing on my post relating to dark data last week, I want to spend a little more about data awareness classification and distribution within an enterprise environment.

Dark data isn’t the end of the story, and it’s time to introduce the entire family of data-awareness concepts. These are:

  • Data – This is both the core data managed and protected by IT, and all other data throughout the enterprise which is:
    • Known about – The business is aware of it;
    • Managed – This data falls under the purview of a team in terms of storage administration (ILM);
    • Protected – This data falls under the purview of a team in terms of backup and recovery (ILP).
  • Dark Data – To quote the previous article, “all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for”.
  • Grey Data – Grey data is previously discovered dark data for which no decision has been made as yet in relation to its management or protection. That is, it’s now known about, but has not been assigned any policy or tier in either ILM or ILP.
  • Utility Data – This is data which is subsequently classified out of grey data state into a state where the data is known to have value, but is not either managed or protected, because it can be recreated. It could be that the decision is made that the cost (in time) of recreating the data is less expensive than the cost (both in literal dollars and in staff-activity time) of managing and protecting it.
  • Noise – This isn’t really data at all, but are all the “bits” (no pun intended) that are left which are neither grey data, data or utility data. In essence, this is irrelevant data, which someone or some group may be keeping for unnecessary reasons, and in actual fact should be considered eligible for either deletion or archival and deletion.

The distribution of data by awareness within the enterprise may resemble something along the following lines:

Data Awareness Percentage Distribution

That is, ideally the largest percentage of data should be regular data which is known, managed and protected. In all likelihood for most organisations, the next biggest percentage of data is going to be dark data – the data that hasn’t been discovered yet. Ideally however, after regular and dark data have been removed from the distribution, there should be at most 20% of data left, and this should be broken up such that at least half of that remaining data is utility data, with the last 10% split evenly between grey data and noise.

The logical implications of this layout should be reasonably straight forward:

  1. At all times the majority of data within an organisation should be known, managed and protected.
  2. It should be expected that at least 20% of the data within an organisation is undiscovered, or decentralised.
  3. Once data is discovered, it should exist in a ‘grey’ state for a very short period of time; ideally it should be reclassified as soon as possible into data, utility data or noise. In particular, data left in a grey state for an extended period of time represents just as dangerous a potential data loss situation as dark data.

It should be noted that regular data, even in this awareness classification scheme, will still be subject to regular data lifecycle decisions (archive, tiering, deletion, etc.) In that sense, primary data eligible for deletion isn’t really noise, because it’s previously been managed and protected; noise really is ex dark-data that will end up being deleted, either as an explicit decision, or due to a failure at some future point after the decision to classify it as ‘noise’, having never been managed or protected in a centralised, coordinated manner.

Equally, utility data won’t refer to say, Q/A or test databases that replicate the content of production databases. These types of databases will again have fallen under the standard data umbrella in that there will have been information lifecycle management and protection policies established for them, regardless of what those policies actually were.

If we bring this back to roles, then it’s clear that a pivotal role of both the DPAs (Data Protection Advocates) and the IPAC (Information Protection Advisory Council) within an organisation should be the rapid coordination of classification of dark data as it is discovered into one of the data, utility data or noise states.

 

There’s a report over at iTWire that has two highly pertinent details. (iTWire – Aussie storage growth above average: Gartner.)

The article is about how Australian spending on storage is growing faster than the rest of the world (IMHO that’s just further proof of how helpful the government stimulus package was), and has two particular points of interest.

First:

The big winner was EMC, which saw its revenue from the region grow from $US533.9 million to $US716.0 million. Most other vendors also saw improved revenues…

That doesn’t surprise me. As an employee of an EMC partner, I know EMC have been very strongly pushing in the Australian market over the last 12 months. I fully believe that other vendors have been pushing hard and (for the most part) achieving good results, but EMC has had a really solid story during this spending cycle, and it’s been paying off – time and time again.

What really didn’t surprise me though was the “but” following that above quote:

…but the biggest loser was Oracle. In 2009, Sun had $US134.4 million revenue in 2009. Now part of Oracle, it only recorded $US82.1 million revenue in 2010

Since the Oracle acquisition of Sun, every single one of my customers who had previously been a large Sun customer has either been resolutely turning away from the vendor, or eyeing them with firm displeasure. Why? Oracle’s higher prices for maintenance and product has had a significant impact on the budgetary options available to one of Sun’s biggest previous customer bases – the educational market. (This, for what it’s worth, is why I penned the article last year, “RIP Solaris“.)

While I’m not normally one to put much stock in analyst reports, this one seems to gel with what I’ve been seeing for the past 12 months.

 

I’m stepping out of my normal NetWorker zone here to briefly discuss what I think is a fundamental flaw with the current state of thin provisioning.

The notion of thin provisioning has effectively been around for ages, since it’s effectively from the mainframe age, but we started to see it come back into focus a while ago with the notion of “expanding disks” for virtualisation products. Ironically these started initially in the workstation products (VMware Workstation, Parallels Desktop, etc.) before starting to gain popularity at the enterprise virtualisation layer.

Yet thin provisioning doesn’t stop there – it’s also available at the array level, particularly in NAS devices as well. So what happens when you mix guest thin provisioning in a hypervisor with thin provisioning at the array/NAS level providing storage to the hypervisor?

Chaos.

Multiple layers of thin provisioning is potentially a major management headache in systems storage allocation. Why? It makes determining what storage you have available and allocated, when looking at any one layer, practically impossible. vSphere for instance may see that you’ve got 2TB of free space in storage that’s currently unallocated, and your NAS may be telling it there’s 2TB of free space, but it may actually only have 500GB free. Compounding the issue, the individual operating systems leveraging that storage as guests will also each have their own ideas about how much storage is available for use. One system suffering unexpected data growth (e.g., a patch provided by a vendor without warning that it’ll generate thousands of log messages a minute) might cause the entire thin provisioning sand castle to collapse around you.

This leads me to my concern about what’s missing in thin provisioning: a consolidated dashboard. A cross platform, cross vendor dashboard where every product that advertises “thin provisioning” can share information in the storage realm so that you, the storage administrator, can instantly see an exact display of allocated vs available real capacity.

This isn’t something that’s going to appear tomorrow, but I’d suggest that if all the vendors currently running around shouting about “thin provisioning” are really serious about it, they’d come up with a common, published API that can be used by any product to query through the entire storage-access vertical. I regret to say the C-word, but it’s clear there needs to be an inter-vendor Committee to discuss this requirement. That’s right, NetApp and EMC, HDS and HP, VMware and Microsoft (just to name a few) all need to sit at the same table and agree on a common framework that can be leveraged.

Without this, we’ll just keep going down the current rather chaotic and hazardous thin provisioning pathway. It’s like an uncleared minefield – you may manage to stagger through it without being blown up, but the odds are against you.

Surely even the vendors can see the logical imperative to reduce those odds.

Disclaimer: I’m prepared to admit that I’m completely wrong, and that vendors have already tackled this and I missed the announcement. Someone, please prove me wrong.

 

Over at Xiotech’s blog, there’s an interesting piece about the evolution of 2.5″ drives in enterprise storage titled The Great Shrinking Disk Drive.

I’m not 100% convinced of Xiotech’s argument, but over the years I’ve seen increasing use of 2.5″ drives in enterprise computing – particularly to decrease the footprint and power requirements for DAS in rack-mount servers, etc.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha