Dark Data

Dark Data

We’ve all heard the term Big Data – it’s something the vendors have been ramming down our throats with the same level of enthusiasm as Cloud. Personally, I think Big Data is a problem that shouldn’t exist: it serves for me as a stark criticism of OS, Application, Storage and Software companies for failing to anticipate the high end of the data growth arena and developing suitable mechanisms for dealing with it as part of the regular tool sets. After all, why should the end user have to ask him/herself: “Hmmm, do I have data or big data?”

Moving right along, recently another term has been starting to popup, and it’s far a more interesting – and legitimate – a problem.

It’s dark data.

If you haven’t heard of the term, I’m betting that you’ve either guessed the meaning or have a bit of an idea about it.

Dark data refers to all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for. Such as:

  • All those user PST files on desktops and notebooks;
  • That server a small workgroup deployed for testing purposes that’s not centrally managed or officially known about;
  • That research data an academic is storing on a 2TB USB drive connected to her laptop;
  • That offline copy of a chunk of the fileserver someone grabbed before going overseas that’s now sufficiently different from the real content of the fileserver;
  • and so on.

Dark data is a real issue within the business environment, because there’s potentially a large amount of critical information “out there” in the business but not necessarily under the control of the IT department.

You might call it decentralised data.

As we know from data protection, decentralised backups are particularly dangerous; they increase the cost of control and maintenance, they decrease the reliability of the process, and they can be a security nightmare. It’s exactly the same for dark data – in fact, worse, because by the very nature of the definition, it’s also data that’s unlikely to be backed up.

To try to control the spread of dark data, some companies will institute rigorous local storage policies, but these often present bigger headaches than they’re worth. For instance, locking down user desktops to make local storage not writeable isn’t always successful, and the added network load by shifting user profiles across to fileservers can be painful. Further, pushing these files across to centralised storage can make for extremely dense filesystems (or at least contribute towards them), trading one problem for another. Finally, it introduces new risk to the business, making users extremely unproductive if there are network or central storage issues.

There’s a few things a business can do in relation to dark data so as to decrease the headache and challenges created by it. These are acceptance, anticipation, and discovery.

  1. Acceptance – Acknowledge that dark data will find its way into the organisation. Keeping the corporate head in the sand over the existence of dark data, or blindly adhering to the (false) notion that rigorous security policies will prevent storage of data anywhere in the organisation except centrally, is foolish. Now, this doesn’t mean that you have to accept that data will become dark. Instead, acknowledging that there will be dark data out there will keep it as a known issue. What’s more, because it’s actually acknowledged by the business, it can be discussed by the business. Discussion will facilitate two key factors: keeping users aware of the dangers of dark data, and encouraging users to report dark data.
  2. Anticipation – Accepting that dark data exists is one thing; anticipating what can be done about it, and how it might be found allows a company to actually start dealing with dark data. Anticipating dark data can’t happen unless someone is responsible for it. Now, I’m not suggesting that being responsible for dark data means getting in trouble if there are issues with unprotected dark data going missing – if that were the case, not a single person in a company would want to be responsible for it. (And any person who did want to be responsible under those circumstances would likely not understand the scope of the issue.) The obvious person for this responsibility is the Data Protection Advisor. (See here and here.) You might argue that the dark data problem explicitly points out the need for one or more DPAs at every business.
  3. Discovery – No discovery process for dark data will be fully automated. There will be a level of automation that can be achieved via indexing and search engines deployed from central IT, but given dark data may be on systems which are only intermittently connected, or outside of the domain authority of IT, there will be a human element as well. This will consist of the DPA(s), end users, and team leaders, viz:
    • The DPA will be tasked with not only periodic visual inspections of his/her area of responsibility, but will also be responsible for issuing periodic reminders to staff, requesting notification of any local data storage.
    • End users should be aware (via induction, and company policies) of the need to avoid, as much as possible, the creation of data outside of the control and management of central IT. But they should equally be aware that in situations where this happens, a policy can be followed to notify IT to ensure that the data is protected or reviewed.
    • Team leaders should equally be aware of the potential for dark data creation, as per end users, but should also be tasked with liaising with IT to ensure dark data, once discovered, is appropriately classified, managed and protected. This may sometimes necessitate moving the data under IT control, but it may also at times be an acknowledgement that the data is best left local, with appropriate protection measures implemented and agreed upon.

Dark data is a real problem that will exist in practically every business; however, it doesn’t have to be a serious problem, when carefully dealt with. The above three rules – acceptance, anticipation, and discovery, will ensure it stays managed.

[2012-01-27 Addendum]

There’s now a followup to this article – “Data Awareness Distribution in the Enterprise“.

2 thoughts on “Dark Data”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.