Snapshots and Backups, Part 2

 Backup theory, General thoughts  Comments Off on Snapshots and Backups, Part 2
Feb 082010

Over the weekend I wrote up a piece about how snapshots are not a valid replacement to enterprise backup. The timing of this was in response to NetApp recently abandoning development of their VTL systems, and subsequent discussions this triggered, but it was something that I’d had sitting in the wings for a while.

It’s fair to say that discussions on snapshots and backups polarise a lot of people; I’ll fully admit that I side with the “snapshots can’t replace backups” side of the argument.

I want to go into this in a little more detail. First I’ll point out in fairness that there are people willing to argue the other side that don’t work for NetApp, in the same way that I don’t work for EMC. One of those is the other Preston – W. Curtis Preston, and you can read his articulate case here. I’m not going to spend this article going point for point against Curtis – it’s not the primary point of discussion I want to make in this entry.

Moving away from vendors and consultants, another and very interesting opinion, from the customer perspective, comes from Martin Glassborow’s Storagebod blog. Martin brings up some valid customer points – that being snapshot and replication represents extreme hardware lock-in. Some would argue that any vendor’s backup product represents vendor lock in as well, and this is partly right – though remember it’s not so difficult to keep a virtual machine around with the “last state” of the previous backup application available for recovery purposes. Keeping old and potentially obsolete NAS technology running to facilitate older recoveries after a vendor switch can be a little more challenging.

To get onto what I want to raise today, I need to revisit a previous topic as a means of further explaining my position. Let’s look for instance at my previous coverage of Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP). You can read the entire piece here, but the main point I want to focus on is my ILP ‘diagram’:

Components of ILP

One of the first points I want to make from that diagram is that I don’t exclude snapshots (and their subsequent replication) from an overall information lifecycle protection mechanism. Indeed, depending on the SLAs involved, they’re going to be practically mandatory. But, to use the analogy offered by the above diagram, they’re just pieces of the pie rather than the entire pie.

I’m going to extend my argument a little now, and go beyond just snapshots and replication, so I can elucidate the core reasons why I don’t like replicated snapshots as a permanent backup solution. Here’s a few other things I don’t like as a permanent backup solution:

  • VTLs replicated between a primary and disaster recovery site, with no tape out.
  • ADV_FILE (or other products disk backup solutions) cloned/duplicated between the primary and disaster recovery site, with no tape out.
  • Source based deduplication products with replication between two locations, with no tape out.

My fundamental objection in all of these solutions is the long term failure caused by keeping everything “online”. Maybe I’m a pessimist, but when I’m considering backup/recovery and disaster recovery solutions, I firmly believe that I’m being paid to consider all likely scenarios. I don’t personally believe in luck, and I won’t trust a backup/disaster recovery solution on luck either. The old Clint Eastwood quote comes to mind here:

You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do ya, punk?

When it comes to your data, no, no I don’t. I don’t feel lucky, I don’t encourage you to feel lucky. Instead I rely on solid, well protected systems with offline capabilities. Thus, I plan for at least some level of cascading failures.

It’s the offline component that’s most critical. Do I want all my backups for a year online, only online, even with replication? Even more importantly – do I want all your backups online, only online, even with replication? The answer remains a big fat no.

The simple problem with any solution that doesn’t provide for offline storage is that (in my opinion), it brings the risk of cascading failures into play too easily. It’s like putting all storage for your company on a single RAID-5 LUN and not having a hot spare. Sure you’re protected against that first failure, but it’s shortly after the first failure that Murphy will make an appearance in your computer room. (And I’ll qualify here: I don’t believe in luck, but I’ve observed over the years in many occasions that Murphy’s Law rules in computer rooms as well as in other places.) Or to put it another way: you may hope for the best, but you should plan for the worst. Let’s imagine a “worst case scenario”: a fire starts in your primary datacentre 10 minutes after upgrade work has commenced on the array that receives replicated snapshots in your disaster recovery runs into problems with firmware, leaving that array inaccessible until vendor upgrades are complete. Or worse again, it leaves storage corrupted.

Or if that seems too extreme, consider a more basic failure: a contractor near to your primary datacentre digs through the cables linking your production and disaster recovery sites, and it’s going to take 3 days to repair. Suddenly you’ve got snapshots and no replication. Just how lucky does that leave you feeling? Personally, I feel slightly naked and vulnerable when I have a single backup that’s not cloned. If suddenly none of my backups were getting duplicated, and I had no easy access to my clones, I’d feel much, much worse. (And that full body shiver I do from time to time would get very pronounced.)

Usually all this talk of a single instance failure frequently leads proponents of snapshots+replication only to suggest that a good design will see 3-way replication, so there’s always two backup instances. This doubles a lot of costs while merely moving the failure point just a jump to the left. On the other hand, offline backup where there’s the backup from today, the backup from yesterday, the backup from the day before … the backup from last week, the backup from last month, etc., all offline, all likely on different media – now that’s failure mitigation. Even if something happens and I can’t recover the most recent backup, in many recovery scenarios I can go back one day, two days, three days, etc. Oh yes, you can do that with snapshots too, but not if the array is a smoking pile of metal and plastic fused to the floor after a fire. In some senses, it’s similar to the old issue of trying to get away from cloning by backing up from the production site to media on the disaster recovery site. It just doesn’t provide adequate protection. If you’re thinking of using 3-way replication, why not instead have a solution that uses two entirely different types of data protection to mitigate against extreme levels of failure?

It’s possible I’ll have more to say on this in the coming weeks, as I think it’s important, regardless of your personal view point, to be aware of all of the arguments on both sides of the fence.

Nov 242009

Over at StorageNerve, and on Twitter, Devang Panchigar has been asking Is Storage Tiering ILM or a subset of ILM, but where is ILM? I think it’s an important question with some interesting answers.

Devang starts with defining ILM from a storage perspective:

1) A user or an application creates data and possibly over time that data is modified.
2) The data needs to be stored and possibly be protected through RAID, snaps, clones, replication and backups.
3) The data now needs to be archived as it gets old, and retention policies & laws kick in.
4) The data needs to be search-able and retrievable NOW.
5) Finally the data needs to be deleted.

I agree with items 1, 3, 4 and 5 – as per previous posts, for what it’s worth, I believe that 2 belongs to a sister activity which I define as Information Lifecycle Protection (ILP) – something that Devang acknowledges as an alternative theory. (I liken the logic to separation between ILM and ILP to that between operational production servers and support production servers.)

The above list, for what it’s worth, is actually a fairly astute/accurate summary of the involvement of the storage industry thus far in ILM. Devang rightly points out that Storage Tiering (migrating data between different speed/capacity/cost storage based on usage, etc.), doesn’t address all of the above points – in particular, data creation and data deletion. That’s certainly true.

What’s missing from ILM from a storage perspective are the components that storage can only peripherally control. Perhaps that’s not entirely accurate – the storage industry can certainly participate in the remaining components (indeed, particularly in NAS systems it’s absolutely necessary, as a prime example) – but it’s more than just the storage industry. It’s operating system vendors. It’s application vendors. It’s database vendors. It is, quite frankly, the whole kit and caboodle.

What’s missing in the storage-centric approach to ILM is identity management – or to be more accurate in this context, identity management systems. The brief outline of identity management is that it’s about moving access control and content control out of the hands of the system, application and database administrators, and into the hands of human resources/corporate management. So a system administrator could have total systems access over an entire host and all its data but not be able to open files that (from a corporate management perspective) they have no right to access. A database administrator can fully control the corporate database, but can’t access commercially sensitive or staff salary details, etc.

Most typically though, it’s about corporate roles, as defined in human resources, being reflected from the ground up in system access options. That is, human resources, when they setup a new employee as having a particular role within the organisation (e.g., “personal assistant”), triggering the appropriate workflows to setup that person’s accounts and access privileges for IT systems as well.

If you think that’s insane, you probably don’t appreciate the purpose of it. System/app/database administrators I talk to about identity management frequently raise trust (or the perceived lack thereof) involved in such systems. I.e., they think that if the company they work for wants to implement identity management they don’t trust the people who are tasked with protecting the systems. I won’t lie, I think in a very small number of instances, this may be the case. Maybe 1%, maybe as high as 2%. But let’s look at the bigger picture here – we, as system/application/database administrators currently have access to such data not because we should have access to such data but because until recently there’s been very few options in place to limit data access to only those who, from a corporate governance perspective, should have access to that data. As such, most system/app/database administrators are highly ethical – they know that being able to access data doesn’t equate to actually accessing that data. (Case in point: as the engineering manager and sysadmin at my last job, if I’d been less ethical, I would have seen the writing on the wall long before the company fell down under financial stresses around my ears!)

Trust doesn’t wash in legal proceedings. Trust doesn’t wash in financial auditing. Particularly in situations where accurate logs aren’t maintained in an appropriately secured manner to prove that person A didn’t access data X. The fact that the system was designed to permit A to access X (even as part of A’s job) is in some financial, legal and data sensitivity areas, significant cause for concern.

Returning to the primary point though, it’s about ensuring that the people who have authority over someone’s role within a company (human resources/management) having control over the the processes that configure the access permissions that person has. It’s also about making sure that those work flows are properly configured and automated so there’s no room for error.

So what’s missing – or what’s only at the barest starting point, is the integration of identity/access control with ILM (including storage tiering) and ILP. This, as you can imagine, is not an easy task. Hell, it’s not even a hard task – it’s a monumentally difficult task. It involves a level of cooperation and coordination between different technical tiers (storage, backup, operating systems, applications) that we rarely, if ever see beyond the basic “must all work together or else it will just spend all the time crashing” perspective.

That’s the bit that gives the extra components – control over content creation and destruction. The storage industry on its own does not have the correct levels of exposure to an organisation in order to provide this functionality of ILM. Nor do the operating system vendors. Nor do the database vendors or the application vendors – they all have to work together to provide a total solution on this front.

I think this answers (indirectly) Devang’s question/comment on why storage vendors, and indeed, most of the storage industry, has stopped talking about ILM – the easy parts are well established, but the hard parts are only in their infancy. We are after all seeing some very early processes around integrating identity management and ILM/ILP. For instance, key management on backups, if handled correctly, can allow for situations where backup administrators can’t by themselves perform the recovery of sensitive systems or data – it requires corporate permissions (e.g., the input of a data access key by someone in HR, etc.) Various operating systems and databases/applications are now providing hooks for identity management (to name just one, here’s Oracle’s details on it.)

So no, I think we can confidently say that storage tiering in and of itself is not the answer to ILM. As to why the storage industry has for the most part stopped talking about ILM, we’re left with one of two choices – it’s hard enough that they don’t want to progress it further, or it’s sufficiently commercially sensitive that it’s not something discussed without the strongest of NDAs.

We’ve seen in the past that the storage industry can cooperate on shared formats and standards. We wouldn’t be in the era of pervasive storage we currently are without that cooperation. Fibre-channel, SCSI, iSCSI, FCoE, NDMP, etc., are proof positive that cooperation is possible. What’s different this time is the cooperation extends over a much larger realm to also encompass operating systems, applications, databases, etc., as well as all the storage components in ILM and ILP. (It makes backups seem to have a small footprint, and backups are amongst the most pervasive of technologies you can deploy within an enterprise environment.)

So we can hope that the reason we’re not hearing a lot of talk about ILM any more is that all the interested parties are either working on this level of integration, or even making the appropriate preparations themselves in order to start working together on this level of integration.

Fingers crossed people, but don’t hold your breath – no matter how closely they’re talking, it’s a long way off.

Sep 122009

In my opinion (and after all, this is my blog), there’s a fundamental misconception in the storage industry that backup is a part of Information Lifecycle Management (ILM).

My take is that backup has nothing to do with ILM. Backup instead belongs to a sister (or shadow) activity, Information Lifecycle Protection – ILP. The comparison between the two is somewhat analogous to the comparison I made in “Backup is a Production Activity” between operational production systems and infrastructure support production systems; that is, one is directly related to the operational aspects of the data, and the other exists to support the data.

Here’s an example of what Information Lifecycle Protection would look like:

Information Lifecycle Protection

Information Lifecycle Protection

Obviously there’s some simplification going on in the above diagram – for instance, I’ve encapsulated any online storage based fault-protection into “RAID”, but it does serve to get the basic message across.

If we look at say, Wikipedia’s entry on Information Lifecycle Management, backup is mentioned as being part of the operational aspects of ILM – this is actually a fairly standard definition of the perceived position of backup within ILM; however, standard definition or not, I have to disagree.

At its heart, ILM is about ensuring correct access and lifecycle retention policies for data: neither of these core principles encapsulate the activities in information lifecycle protection. ILP on the other hand is about making sure the data remains available to meet the ILM policies. If you think this is a fine distinction to make, you’re not necessarily wrong. My point is not that there’s a huge difference, but there’s an important difference.

To me, it all boils down to a fundamental need to separate access from protection/availability, and the reason I like to maintain this separation is how it affects end users, and the level of awareness they need to have for it. In their day-to-day activities, users should have an awareness of ILM – they should know what they can and can’t access, they should know what they can and can’t delete, and they should know where they will need to access data from. They shouldn’t however need to concern themselves with RAID, they shouldn’t need to concern themselves with snapshots, they shouldn’t need to concern themselves with replication, and they shouldn’t need to concern themselves with backup.

NOTE: I do, in my book, make it quite clear that end users have a role in backup in that they must know that backup doesn’t represent a blank cheque for them to delete data willy-nilly, and that they should know how to request a recovery; however, in their day to day job activities, backups should not play a part in what they do.

Ultimately, that’s my distinction: ILM is about activities that end-users do, and ILP is about activities that are done for end-users.