Jan 242017

In 2013 I undertook the endeavour to revisit some of the topics from my first book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy”, and expand it based on the changes that had happened in the industry since the publication of the original in 2008.

A lot had happened since that time. At the point I was writing my first book, deduplication was an emerging trend, but tape was still entrenched in the datacentre. While backup to disk was an increasingly common scenario, it was (for the most part) mainly used as a staging activity (“disk to disk to tape”), and backup to disk use was either dumb filesystems or Virtual Tape Libraries (VTL).

The Cloud, seemingly ubiquitous now, was still emerging. Many (myself included) struggled to see how the Cloud was any different from outsourcing with a bit of someone else’s hardware thrown in. Now, core tenets of Cloud computing that made it so popular (e.g., agility and scaleability) have been well and truly adopted as essential tenets of the modern datacentre, as well. Indeed, for on-premises IT to compete against Cloud, on-premises IT has increasingly focused on delivering a private-Cloud or hybrid-Cloud experience to their businesses.

When I started as a Unix System Administrator in 1996, at least in Australia, SANs were relatively new. In fact, I remember around 1998 or 1999 having a couple of sales executives from this company called EMC come in to talk about their Symmetrix arrays. At the time the datacentre I worked in was mostly DAS with a little JBOD and just the start of very, very basic SANs.

When I was writing my first book the pinnacle of storage performance was the 15,000 RPM drive, and flash memory storage was something you (primarily) used in digital cameras only, with storage capacities measured in the hundreds of megabytes more than gigabytes (or now, terabytes).

When the first book was published, x86 virtualisation was well and truly growing into the datacentre, but traditional Unix platforms were still heavily used. Their decline and fall started when Oracle acquired Sun and killed low-cost Unix, with Linux and Windows gaining the ascendency – with virtualisation a significant driving force by adding an economy of scale that couldn’t be found in the old model. (Ironically, it had been found in an older model – the mainframe. Guess what folks, mainframe won.)

When the first book was published, we were still thinking of silo-like infrastructure within IT. Networking, compute, storage, security and data protection all as seperate functions – separately administered functions. But business, having spent a decade or two hammering into IT the need for governance and process, became hamstrung by IT governance and process and needed things done faster, cheaper, more efficiently. Cloud was one approach – hyperconvergence in particular was another: switch to a more commodity, unit-based approach, using software to virtualise and automate everything.

Where are we now?

Cloud. Virtualisation. Big Data. Converged and hyperconverged systems. Automation everywhere (guess what? Unix system administrators won, too). The need to drive costs down – IT is no longer allowed to be a sunk cost for the business, but has to deliver innovation and for many businesses, profit too. Flash systems are now offering significantly more IOPs than a traditional array could – Dell EMC for instance can now drop a 5RU system into your datacentre capable of delivering 10,000,000+ IOPs. To achieve ten million IOPs on a traditional spinning-disk array you’d need … I don’t even want to think about how many disks, rack units, racks and kilowatts of power you’d need.

The old model of backup and recovery can’t cut it in the modern environment.

The old model of backup and recovery is dead. Sort of. It’s dead as a standalone topic. When we plan or think about data protection any more, we don’t have the luxury of thinking of backup and recovery alone. We need holistic data protection strategies and a whole-of-infrastructure approach to achieving data continuity.

And that, my friends, is where Data Protection: Ensuring Data Availability is born from. It’s not just backup and recovery any more. It’s not just replication and snapshots, or continuous data protection. It’s all the technology married with business awareness, data lifecycle management and the recognition that Professor Moody in Harry Potter was right, too: “constant vigilance!”

Data Protection: Ensuring Data Availability

This isn’t a book about just backup and recovery because that’s just not enough any more. You need other data protection functions deployed holistically with a business focus and an eye on data management in order to truly have an effective data protection strategy for your business.

To give you an idea of the topics I’m covering in this book, here’s the chapter list:

  1. Introduction
  2. Contextualizing Data Protection
  3. Data Lifecycle
  4. Elements of a Protection System
  5. IT Governance and Data Protection
  6. Monitoring and Reporting
  7. Business Continuity
  8. Data Discovery
  9. Continuous Availability and Replication
  10. Snapshots
  11. Backup and Recovery
  12. The Cloud
  13. Deduplication
  14. Protecting Virtual Infrastructure
  15. Big Data
  16. Data Storage Protection
  17. Tape
  18. Converged Infrastructure
  19. Data Protection Service Catalogues
  20. Holistic Data Protection Strategies
  21. Data Recovery
  22. Choosing Protection Infrastructure
  23. The Impact of Flash on Data Protection
  24. In Closing

There’s a lot there – you’ll see the first eight chapters are not about technology, and for a good reason: you must have a grasp on the other bits before you can start considering everything else, otherwise you’re just doing point-solutions, and eventually just doing point-solutions will cost you more in time, money and risk than they give you in return.

I’m pleased to say that Data Protection: Ensuring Data Availability is released next month. You can find out more and order direct from the publisher, CRC Press, or order from Amazon, too. I hope you find it enjoyable.

Jan 222012

Dark Data

We’ve all heard the term Big Data – it’s something the vendors have been ramming down our throats with the same level of enthusiasm as Cloud. Personally, I think Big Data is a problem that shouldn’t exist: it serves for me as a stark criticism of OS, Application, Storage and Software companies for failing to anticipate the high end of the data growth arena and developing suitable mechanisms for dealing with it as part of the regular tool sets. After all, why should the end user have to ask him/herself: “Hmmm, do I have data or big data?”

Moving right along, recently another term has been starting to popup, and it’s far a more interesting – and legitimate – a problem.

It’s dark data.

If you haven’t heard of the term, I’m betting that you’ve either guessed the meaning or have a bit of an idea about it.

Dark data refers to all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for. Such as:

  • All those user PST files on desktops and notebooks;
  • That server a small workgroup deployed for testing purposes that’s not centrally managed or officially known about;
  • That research data an academic is storing on a 2TB USB drive connected to her laptop;
  • That offline copy of a chunk of the fileserver someone grabbed before going overseas that’s now sufficiently different from the real content of the fileserver;
  • and so on.

Dark data is a real issue within the business environment, because there’s potentially a large amount of critical information “out there” in the business but not necessarily under the control of the IT department.

You might call it decentralised data.

As we know from data protection, decentralised backups are particularly dangerous; they increase the cost of control and maintenance, they decrease the reliability of the process, and they can be a security nightmare. It’s exactly the same for dark data – in fact, worse, because by the very nature of the definition, it’s also data that’s unlikely to be backed up.

To try to control the spread of dark data, some companies will institute rigorous local storage policies, but these often present bigger headaches than they’re worth. For instance, locking down user desktops to make local storage not writeable isn’t always successful, and the added network load by shifting user profiles across to fileservers can be painful. Further, pushing these files across to centralised storage can make for extremely dense filesystems (or at least contribute towards them), trading one problem for another. Finally, it introduces new risk to the business, making users extremely unproductive if there are network or central storage issues.

There’s a few things a business can do in relation to dark data so as to decrease the headache and challenges created by it. These are acceptance, anticipation, and discovery.

  1. Acceptance – Acknowledge that dark data will find its way into the organisation. Keeping the corporate head in the sand over the existence of dark data, or blindly adhering to the (false) notion that rigorous security policies will prevent storage of data anywhere in the organisation except centrally, is foolish. Now, this doesn’t mean that you have to accept that data will become dark. Instead, acknowledging that there will be dark data out there will keep it as a known issue. What’s more, because it’s actually acknowledged by the business, it can be discussed by the business. Discussion will facilitate two key factors: keeping users aware of the dangers of dark data, and encouraging users to report dark data.
  2. Anticipation – Accepting that dark data exists is one thing; anticipating what can be done about it, and how it might be found allows a company to actually start dealing with dark data. Anticipating dark data can’t happen unless someone is responsible for it. Now, I’m not suggesting that being responsible for dark data means getting in trouble if there are issues with unprotected dark data going missing – if that were the case, not a single person in a company would want to be responsible for it. (And any person who did want to be responsible under those circumstances would likely not understand the scope of the issue.) The obvious person for this responsibility is the Data Protection Advisor. (See here and here.) You might argue that the dark data problem explicitly points out the need for one or more DPAs at every business.
  3. Discovery – No discovery process for dark data will be fully automated. There will be a level of automation that can be achieved via indexing and search engines deployed from central IT, but given dark data may be on systems which are only intermittently connected, or outside of the domain authority of IT, there will be a human element as well. This will consist of the DPA(s), end users, and team leaders, viz:
    • The DPA will be tasked with not only periodic visual inspections of his/her area of responsibility, but will also be responsible for issuing periodic reminders to staff, requesting notification of any local data storage.
    • End users should be aware (via induction, and company policies) of the need to avoid, as much as possible, the creation of data outside of the control and management of central IT. But they should equally be aware that in situations where this happens, a policy can be followed to notify IT to ensure that the data is protected or reviewed.
    • Team leaders should equally be aware of the potential for dark data creation, as per end users, but should also be tasked with liaising with IT to ensure dark data, once discovered, is appropriately classified, managed and protected. This may sometimes necessitate moving the data under IT control, but it may also at times be an acknowledgement that the data is best left local, with appropriate protection measures implemented and agreed upon.

Dark data is a real problem that will exist in practically every business; however, it doesn’t have to be a serious problem, when carefully dealt with. The above three rules – acceptance, anticipation, and discovery, will ensure it stays managed.

[2012-01-27 Addendum]

There’s now a followup to this article – “Data Awareness Distribution in the Enterprise“.

Questions about “big data”

 Aside, General Technology, General thoughts  Comments Off on Questions about “big data”
Nov 062011

I’ve been watching the “big data” discussion happen in a variety of circles, with a slightly cynical concern that this may be like Cloud 2.0 – another sad meme for technology that’s already been in use for some time, but with an excuse to slap a 30% mark up on it.

So, the simple question really is this – is “big data” a legitimate or an illegitimate problem?

By legitimate – is it a problem which truly exists in and of itself? Has data growth in places hit a sufficiently exponential curve that existing technology and approaches can’t keep up …


… is it an illegitimate problem, in that it speaks of (a) a dumbing down of computer science which has resulted in a lack of developmental foresight into problems which we’ve seen coming for some time and/or (b) a failure of IT companies (from base component manufacturers through to vendors across the board) failing to sufficiently innovate?

For me, the jury is still out, and I’ll use a simple example as to why. I deal with big data regularly – since “big data” is defined as being anything outside of a normal technical scope, if I get say, a 20 GB log file from a customer that I have to analyse, none of my standard tools assist with this. So instead, I have to start working on pattern analysis – rather than trying to extract what may be key terms or manually read the file, I’ll skim through it – I’ll literally start by “cat”ting the file and just letting it stream in front of me. At that level, if the software has been written correctly, you’ll notice oddities in the logs that start you pointing to the area you have to delve into. You can then refine the skimming, and eventually drill down to the point where you actually just analyse a very small fragment of the file.

So I look at big data and think – is this a problem caused by a lack of AI being applied to standard data processing techniques? Of admitting – we need to build a level of heuristic decision making into standard products so they can scale up to deal with ever increasing data sets? That the solution is more intelligence and self-management capabilities in the software and hardware? And equally, of developers failing to produce systems that generate data in such a way that it’s susceptible to automated types of pattern analysis?

Of course, this is, to a good degree, what people are talking about when they’re talking about big data.

But why? Do we gain any better management and analysis by cleaving “data” and “big data” into two separate categories?

Or is this a self-fulfilling meme that came out as a result of poor approaches to information science?