Data protection lessons from a tomato

 Architecture, Backup theory  Comments Off on Data protection lessons from a tomato
Jul 252013
 

Data protection lessons from a tomato

Data protection lessons from a tomato? Have I gone mad?

Bear with me.

DIKW ModelIf you’ve done any ITIL training, the above diagram will look familiar to you. Rather unimaginatively, it’s called the DIKW model:

Data > Information > Knowledge > Wisdom

A simple, practical example of what this diagram/model means is the following:

  • Data – Something is red, and round.
  • Information – It’s a tomato.
  • Knowledge – Tomato is a fruit.
  • Wisdom – You don’t put tomato in a fruit salad.

That’s about as complex as DIKW gets. However, being a rather simple concept, it means it can be used in quite a few areas.

When it comes to data protection, its purpose is obvious: the criticality of the data to business wisdom will have a direct impact on the level of protection you need to apply to it.

In this case, I’m expanding the definition of wisdom a little. According to my Apple dashboard dictionary, wisdom is:

the quality of having experience, knowledge, and good judgement; the quality of being wise

Further, we can talk about wisdom in terms of accumulated experience:

the body of knowledge and experience that develops within a specified society or period.

So corporate wisdom is about having the experience and knowledge required to act with good judgement, and represents the sum of the knowledge and experience a corporation has built up over time.

If you think about wisdom in terms of corporate wisdom, then you’ll understand my point. For instance, a key database for a company – or the email system – represents a tangible chunk of corporate wisdom. Core fileservers will also be pretty far up the scale. It’s unlikely, on the other hand (in a business with appropriate storage policies) that the files on a regular end-user’s desktop or laptop will go much beyond information in the DIKW scale.

Of course, there are always exceptions. I’ll get to that in a moment.

What this comes back to pretty quickly is the need for Information Lifecycle Protection. End users and the business overall are typically not interested in data – they’re interested in information. They don’t care, as such, about the backup of /u01/app/oracle/data/CORPTAX/data01.dbf – they care about the corporate tax database. That, of course, means that the IT group and the business need to build service level agreements around business functions, not servers and storage. As ITIL teaches, the agreements about networks, storage, servers, etc., come in the form of operational level agreements between the segments of IT.

Ironically, years before studying ITIL, it’s something I covered in my book in the notion of establishing system dependency maps:

System Maps

(In the diagram, the number in parentheses beside a server or function is it’s reference number; D:X means that it depends on the nominated referenced server/function X.)

What all this boils down to is the criticality of one particular activity when preparing an Information Lifecycle Protection system within an organisation: Data classification. (That of course is where you should catch any of those exceptions I was talking about before.)

In order to properly back something up with the appropriate level of protection and urgency, you need to know what it is.

Or, as Stephen Manley said the other day:

OH at Starbucks – 3 page essay ending with ‘I now have 5 pairs of pants, not 2. That’s 3 more.’ Some data may not need to be protected.

Some data may not need to be protected. Couldn’t have said it better myself. Of course, I do also say that it’s better to backup a little bit too much data than not enough, but that’s not something you should see as carte blanche to just backup everything in your environment at all times, regardless of what it is.

The thing about data classification is that most companies do it without first finding all their data. The first step, possibly the hardest step, is first becoming aware of the data distribution within the enterprise. If you want to skip reading the post linked to in the previous sentence, here’s the key information from it:

  • Data – This is both the core data managed and protected by IT, and all other data throughout the enterprise which is:
    • Known about – The business is aware of it;
    • Managed – This data falls under the purview of a team in terms of storage administration (ILM);
    • Protected – This data falls under the purview of a team in terms of backup and recovery (ILP).
  • Dark Data – To quote [a] previous article, “all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for”.
  • Grey Data – Grey data is previously discovered dark data for which no decision has been made as yet in relation to its management or protection. That is, it’s now known about, but has not been assigned any policy or tier in either ILM or ILP.
  • Utility Data – This is data which is subsequently classified out of grey data state into a state where the data is known to have value, but is not either managed or protected, because it can be recreated. It could be that the decision is made that the cost (in time) of recreating the data is less expensive than the cost (both in literal dollars and in staff-activity time) of managing and protecting it.
  • Noise – This isn’t really data at all, but are all the “bits” (no pun intended) that are left which are neither grey data, data or utility data. In essence, this is irrelevant data, which someone or some group may be keeping for unnecessary reasons, and in actual fact should be considered eligible for either deletion or archival and deletion.

Once you’ve found your data, you can classify it. What’s structured and unstructured? What’s the criticality of the data? (I.e., what level of business wisdom does it relate to?)

But even then, you’re not quite ready to determine what your information lifecycle protection policy will be for the data – well, not until you have a data lifecycle policy, which at its simplest, looks something like this:

Data Lifecycle

 

Of course, there’s a lot of time and a lot of decisions bunched up in that diagram, but the lifecycle of data within an organisation is actually that simple at the conceptual level. Or rather, it should be. If you want to read more about data lifecycle, click here for the intro piece – there’s several accompanying pieces listed at the bottom of the article.

When considered from a backup perspective, the end goal of a data lifecycle policy though is simple:

Backup only that which needs to be backed up.

If data can be deleted, delete it.

If data can be archived, archive it.

The logical implication of course is – if you can’t classify it, if you can’t determine its criticality, then the core backup mantra, “always better to backup a little bit more than not enough” takes precedence, and you should be working out how to back it up. Obviously, as a fall back rule, it works, but it’s best to design your overall environment and data policies to avoid it.

So to summarise:

  1. Following the DIKW model, the closer data is to representing corporate wisdom, the more critical its information lifecycle protection requirements will be.
  2. In order to determine that criticality you first have to find the data within your environment.
  3. Once you’ve found the data in your environment, you have to classify it.
  4. Once you’ve classified it, you can build a data lifecycle policy for it.
  5. And then you can configure the appropriate information lifecycle protection for it.

If you think back to EMC’s work towards mitigating the effects of accidental architectures, you’ll see where I was coming from in talking about the importance of procedural change to arrest further accidental architectures. It’s a classic ER technique – identify, triage and heal.

And we can learn all this from a tomato, sliced and salted with the DIKW model.

Snapshots and Backups, Part 2

 Backup theory, General thoughts  Comments Off on Snapshots and Backups, Part 2
Feb 082010
 

Over the weekend I wrote up a piece about how snapshots are not a valid replacement to enterprise backup. The timing of this was in response to NetApp recently abandoning development of their VTL systems, and subsequent discussions this triggered, but it was something that I’d had sitting in the wings for a while.

It’s fair to say that discussions on snapshots and backups polarise a lot of people; I’ll fully admit that I side with the “snapshots can’t replace backups” side of the argument.

I want to go into this in a little more detail. First I’ll point out in fairness that there are people willing to argue the other side that don’t work for NetApp, in the same way that I don’t work for EMC. One of those is the other Preston – W. Curtis Preston, and you can read his articulate case here. I’m not going to spend this article going point for point against Curtis – it’s not the primary point of discussion I want to make in this entry.

Moving away from vendors and consultants, another and very interesting opinion, from the customer perspective, comes from Martin Glassborow’s Storagebod blog. Martin brings up some valid customer points – that being snapshot and replication represents extreme hardware lock-in. Some would argue that any vendor’s backup product represents vendor lock in as well, and this is partly right – though remember it’s not so difficult to keep a virtual machine around with the “last state” of the previous backup application available for recovery purposes. Keeping old and potentially obsolete NAS technology running to facilitate older recoveries after a vendor switch can be a little more challenging.

To get onto what I want to raise today, I need to revisit a previous topic as a means of further explaining my position. Let’s look for instance at my previous coverage of Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP). You can read the entire piece here, but the main point I want to focus on is my ILP ‘diagram’:

Components of ILP

One of the first points I want to make from that diagram is that I don’t exclude snapshots (and their subsequent replication) from an overall information lifecycle protection mechanism. Indeed, depending on the SLAs involved, they’re going to be practically mandatory. But, to use the analogy offered by the above diagram, they’re just pieces of the pie rather than the entire pie.

I’m going to extend my argument a little now, and go beyond just snapshots and replication, so I can elucidate the core reasons why I don’t like replicated snapshots as a permanent backup solution. Here’s a few other things I don’t like as a permanent backup solution:

  • VTLs replicated between a primary and disaster recovery site, with no tape out.
  • ADV_FILE (or other products disk backup solutions) cloned/duplicated between the primary and disaster recovery site, with no tape out.
  • Source based deduplication products with replication between two locations, with no tape out.

My fundamental objection in all of these solutions is the long term failure caused by keeping everything “online”. Maybe I’m a pessimist, but when I’m considering backup/recovery and disaster recovery solutions, I firmly believe that I’m being paid to consider all likely scenarios. I don’t personally believe in luck, and I won’t trust a backup/disaster recovery solution on luck either. The old Clint Eastwood quote comes to mind here:

You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do ya, punk?

When it comes to your data, no, no I don’t. I don’t feel lucky, I don’t encourage you to feel lucky. Instead I rely on solid, well protected systems with offline capabilities. Thus, I plan for at least some level of cascading failures.

It’s the offline component that’s most critical. Do I want all my backups for a year online, only online, even with replication? Even more importantly – do I want all your backups online, only online, even with replication? The answer remains a big fat no.

The simple problem with any solution that doesn’t provide for offline storage is that (in my opinion), it brings the risk of cascading failures into play too easily. It’s like putting all storage for your company on a single RAID-5 LUN and not having a hot spare. Sure you’re protected against that first failure, but it’s shortly after the first failure that Murphy will make an appearance in your computer room. (And I’ll qualify here: I don’t believe in luck, but I’ve observed over the years in many occasions that Murphy’s Law rules in computer rooms as well as in other places.) Or to put it another way: you may hope for the best, but you should plan for the worst. Let’s imagine a “worst case scenario”: a fire starts in your primary datacentre 10 minutes after upgrade work has commenced on the array that receives replicated snapshots in your disaster recovery runs into problems with firmware, leaving that array inaccessible until vendor upgrades are complete. Or worse again, it leaves storage corrupted.

Or if that seems too extreme, consider a more basic failure: a contractor near to your primary datacentre digs through the cables linking your production and disaster recovery sites, and it’s going to take 3 days to repair. Suddenly you’ve got snapshots and no replication. Just how lucky does that leave you feeling? Personally, I feel slightly naked and vulnerable when I have a single backup that’s not cloned. If suddenly none of my backups were getting duplicated, and I had no easy access to my clones, I’d feel much, much worse. (And that full body shiver I do from time to time would get very pronounced.)

Usually all this talk of a single instance failure frequently leads proponents of snapshots+replication only to suggest that a good design will see 3-way replication, so there’s always two backup instances. This doubles a lot of costs while merely moving the failure point just a jump to the left. On the other hand, offline backup where there’s the backup from today, the backup from yesterday, the backup from the day before … the backup from last week, the backup from last month, etc., all offline, all likely on different media – now that’s failure mitigation. Even if something happens and I can’t recover the most recent backup, in many recovery scenarios I can go back one day, two days, three days, etc. Oh yes, you can do that with snapshots too, but not if the array is a smoking pile of metal and plastic fused to the floor after a fire. In some senses, it’s similar to the old issue of trying to get away from cloning by backing up from the production site to media on the disaster recovery site. It just doesn’t provide adequate protection. If you’re thinking of using 3-way replication, why not instead have a solution that uses two entirely different types of data protection to mitigate against extreme levels of failure?

It’s possible I’ll have more to say on this in the coming weeks, as I think it’s important, regardless of your personal view point, to be aware of all of the arguments on both sides of the fence.

Nov 242009
 

Over at StorageNerve, and on Twitter, Devang Panchigar has been asking Is Storage Tiering ILM or a subset of ILM, but where is ILM? I think it’s an important question with some interesting answers.

Devang starts with defining ILM from a storage perspective:

1) A user or an application creates data and possibly over time that data is modified.
2) The data needs to be stored and possibly be protected through RAID, snaps, clones, replication and backups.
3) The data now needs to be archived as it gets old, and retention policies & laws kick in.
4) The data needs to be search-able and retrievable NOW.
5) Finally the data needs to be deleted.

I agree with items 1, 3, 4 and 5 – as per previous posts, for what it’s worth, I believe that 2 belongs to a sister activity which I define as Information Lifecycle Protection (ILP) – something that Devang acknowledges as an alternative theory. (I liken the logic to separation between ILM and ILP to that between operational production servers and support production servers.)

The above list, for what it’s worth, is actually a fairly astute/accurate summary of the involvement of the storage industry thus far in ILM. Devang rightly points out that Storage Tiering (migrating data between different speed/capacity/cost storage based on usage, etc.), doesn’t address all of the above points – in particular, data creation and data deletion. That’s certainly true.

What’s missing from ILM from a storage perspective are the components that storage can only peripherally control. Perhaps that’s not entirely accurate – the storage industry can certainly participate in the remaining components (indeed, particularly in NAS systems it’s absolutely necessary, as a prime example) – but it’s more than just the storage industry. It’s operating system vendors. It’s application vendors. It’s database vendors. It is, quite frankly, the whole kit and caboodle.

What’s missing in the storage-centric approach to ILM is identity management – or to be more accurate in this context, identity management systems. The brief outline of identity management is that it’s about moving access control and content control out of the hands of the system, application and database administrators, and into the hands of human resources/corporate management. So a system administrator could have total systems access over an entire host and all its data but not be able to open files that (from a corporate management perspective) they have no right to access. A database administrator can fully control the corporate database, but can’t access commercially sensitive or staff salary details, etc.

Most typically though, it’s about corporate roles, as defined in human resources, being reflected from the ground up in system access options. That is, human resources, when they setup a new employee as having a particular role within the organisation (e.g., “personal assistant”), triggering the appropriate workflows to setup that person’s accounts and access privileges for IT systems as well.

If you think that’s insane, you probably don’t appreciate the purpose of it. System/app/database administrators I talk to about identity management frequently raise trust (or the perceived lack thereof) involved in such systems. I.e., they think that if the company they work for wants to implement identity management they don’t trust the people who are tasked with protecting the systems. I won’t lie, I think in a very small number of instances, this may be the case. Maybe 1%, maybe as high as 2%. But let’s look at the bigger picture here – we, as system/application/database administrators currently have access to such data not because we should have access to such data but because until recently there’s been very few options in place to limit data access to only those who, from a corporate governance perspective, should have access to that data. As such, most system/app/database administrators are highly ethical – they know that being able to access data doesn’t equate to actually accessing that data. (Case in point: as the engineering manager and sysadmin at my last job, if I’d been less ethical, I would have seen the writing on the wall long before the company fell down under financial stresses around my ears!)

Trust doesn’t wash in legal proceedings. Trust doesn’t wash in financial auditing. Particularly in situations where accurate logs aren’t maintained in an appropriately secured manner to prove that person A didn’t access data X. The fact that the system was designed to permit A to access X (even as part of A’s job) is in some financial, legal and data sensitivity areas, significant cause for concern.

Returning to the primary point though, it’s about ensuring that the people who have authority over someone’s role within a company (human resources/management) having control over the the processes that configure the access permissions that person has. It’s also about making sure that those work flows are properly configured and automated so there’s no room for error.

So what’s missing – or what’s only at the barest starting point, is the integration of identity/access control with ILM (including storage tiering) and ILP. This, as you can imagine, is not an easy task. Hell, it’s not even a hard task – it’s a monumentally difficult task. It involves a level of cooperation and coordination between different technical tiers (storage, backup, operating systems, applications) that we rarely, if ever see beyond the basic “must all work together or else it will just spend all the time crashing” perspective.

That’s the bit that gives the extra components – control over content creation and destruction. The storage industry on its own does not have the correct levels of exposure to an organisation in order to provide this functionality of ILM. Nor do the operating system vendors. Nor do the database vendors or the application vendors – they all have to work together to provide a total solution on this front.

I think this answers (indirectly) Devang’s question/comment on why storage vendors, and indeed, most of the storage industry, has stopped talking about ILM – the easy parts are well established, but the hard parts are only in their infancy. We are after all seeing some very early processes around integrating identity management and ILM/ILP. For instance, key management on backups, if handled correctly, can allow for situations where backup administrators can’t by themselves perform the recovery of sensitive systems or data – it requires corporate permissions (e.g., the input of a data access key by someone in HR, etc.) Various operating systems and databases/applications are now providing hooks for identity management (to name just one, here’s Oracle’s details on it.)

So no, I think we can confidently say that storage tiering in and of itself is not the answer to ILM. As to why the storage industry has for the most part stopped talking about ILM, we’re left with one of two choices – it’s hard enough that they don’t want to progress it further, or it’s sufficiently commercially sensitive that it’s not something discussed without the strongest of NDAs.

We’ve seen in the past that the storage industry can cooperate on shared formats and standards. We wouldn’t be in the era of pervasive storage we currently are without that cooperation. Fibre-channel, SCSI, iSCSI, FCoE, NDMP, etc., are proof positive that cooperation is possible. What’s different this time is the cooperation extends over a much larger realm to also encompass operating systems, applications, databases, etc., as well as all the storage components in ILM and ILP. (It makes backups seem to have a small footprint, and backups are amongst the most pervasive of technologies you can deploy within an enterprise environment.)

So we can hope that the reason we’re not hearing a lot of talk about ILM any more is that all the interested parties are either working on this level of integration, or even making the appropriate preparations themselves in order to start working together on this level of integration.

Fingers crossed people, but don’t hold your breath – no matter how closely they’re talking, it’s a long way off.

Sep 122009
 

In my opinion (and after all, this is my blog), there’s a fundamental misconception in the storage industry that backup is a part of Information Lifecycle Management (ILM).

My take is that backup has nothing to do with ILM. Backup instead belongs to a sister (or shadow) activity, Information Lifecycle Protection – ILP. The comparison between the two is somewhat analogous to the comparison I made in “Backup is a Production Activity” between operational production systems and infrastructure support production systems; that is, one is directly related to the operational aspects of the data, and the other exists to support the data.

Here’s an example of what Information Lifecycle Protection would look like:

Information Lifecycle Protection

Information Lifecycle Protection

Obviously there’s some simplification going on in the above diagram – for instance, I’ve encapsulated any online storage based fault-protection into “RAID”, but it does serve to get the basic message across.

If we look at say, Wikipedia’s entry on Information Lifecycle Management, backup is mentioned as being part of the operational aspects of ILM – this is actually a fairly standard definition of the perceived position of backup within ILM; however, standard definition or not, I have to disagree.

At its heart, ILM is about ensuring correct access and lifecycle retention policies for data: neither of these core principles encapsulate the activities in information lifecycle protection. ILP on the other hand is about making sure the data remains available to meet the ILM policies. If you think this is a fine distinction to make, you’re not necessarily wrong. My point is not that there’s a huge difference, but there’s an important difference.

To me, it all boils down to a fundamental need to separate access from protection/availability, and the reason I like to maintain this separation is how it affects end users, and the level of awareness they need to have for it. In their day-to-day activities, users should have an awareness of ILM – they should know what they can and can’t access, they should know what they can and can’t delete, and they should know where they will need to access data from. They shouldn’t however need to concern themselves with RAID, they shouldn’t need to concern themselves with snapshots, they shouldn’t need to concern themselves with replication, and they shouldn’t need to concern themselves with backup.

NOTE: I do, in my book, make it quite clear that end users have a role in backup in that they must know that backup doesn’t represent a blank cheque for them to delete data willy-nilly, and that they should know how to request a recovery; however, in their day to day job activities, backups should not play a part in what they do.

Ultimately, that’s my distinction: ILM is about activities that end-users do, and ILP is about activities that are done for end-users.