Jan 242017

In 2013 I undertook the endeavour to revisit some of the topics from my first book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy”, and expand it based on the changes that had happened in the industry since the publication of the original in 2008.

A lot had happened since that time. At the point I was writing my first book, deduplication was an emerging trend, but tape was still entrenched in the datacentre. While backup to disk was an increasingly common scenario, it was (for the most part) mainly used as a staging activity (“disk to disk to tape”), and backup to disk use was either dumb filesystems or Virtual Tape Libraries (VTL).

The Cloud, seemingly ubiquitous now, was still emerging. Many (myself included) struggled to see how the Cloud was any different from outsourcing with a bit of someone else’s hardware thrown in. Now, core tenets of Cloud computing that made it so popular (e.g., agility and scaleability) have been well and truly adopted as essential tenets of the modern datacentre, as well. Indeed, for on-premises IT to compete against Cloud, on-premises IT has increasingly focused on delivering a private-Cloud or hybrid-Cloud experience to their businesses.

When I started as a Unix System Administrator in 1996, at least in Australia, SANs were relatively new. In fact, I remember around 1998 or 1999 having a couple of sales executives from this company called EMC come in to talk about their Symmetrix arrays. At the time the datacentre I worked in was mostly DAS with a little JBOD and just the start of very, very basic SANs.

When I was writing my first book the pinnacle of storage performance was the 15,000 RPM drive, and flash memory storage was something you (primarily) used in digital cameras only, with storage capacities measured in the hundreds of megabytes more than gigabytes (or now, terabytes).

When the first book was published, x86 virtualisation was well and truly growing into the datacentre, but traditional Unix platforms were still heavily used. Their decline and fall started when Oracle acquired Sun and killed low-cost Unix, with Linux and Windows gaining the ascendency – with virtualisation a significant driving force by adding an economy of scale that couldn’t be found in the old model. (Ironically, it had been found in an older model – the mainframe. Guess what folks, mainframe won.)

When the first book was published, we were still thinking of silo-like infrastructure within IT. Networking, compute, storage, security and data protection all as seperate functions – separately administered functions. But business, having spent a decade or two hammering into IT the need for governance and process, became hamstrung by IT governance and process and needed things done faster, cheaper, more efficiently. Cloud was one approach – hyperconvergence in particular was another: switch to a more commodity, unit-based approach, using software to virtualise and automate everything.

Where are we now?

Cloud. Virtualisation. Big Data. Converged and hyperconverged systems. Automation everywhere (guess what? Unix system administrators won, too). The need to drive costs down – IT is no longer allowed to be a sunk cost for the business, but has to deliver innovation and for many businesses, profit too. Flash systems are now offering significantly more IOPs than a traditional array could – Dell EMC for instance can now drop a 5RU system into your datacentre capable of delivering 10,000,000+ IOPs. To achieve ten million IOPs on a traditional spinning-disk array you’d need … I don’t even want to think about how many disks, rack units, racks and kilowatts of power you’d need.

The old model of backup and recovery can’t cut it in the modern environment.

The old model of backup and recovery is dead. Sort of. It’s dead as a standalone topic. When we plan or think about data protection any more, we don’t have the luxury of thinking of backup and recovery alone. We need holistic data protection strategies and a whole-of-infrastructure approach to achieving data continuity.

And that, my friends, is where Data Protection: Ensuring Data Availability is born from. It’s not just backup and recovery any more. It’s not just replication and snapshots, or continuous data protection. It’s all the technology married with business awareness, data lifecycle management and the recognition that Professor Moody in Harry Potter was right, too: “constant vigilance!”

Data Protection: Ensuring Data Availability

This isn’t a book about just backup and recovery because that’s just not enough any more. You need other data protection functions deployed holistically with a business focus and an eye on data management in order to truly have an effective data protection strategy for your business.

To give you an idea of the topics I’m covering in this book, here’s the chapter list:

  1. Introduction
  2. Contextualizing Data Protection
  3. Data Lifecycle
  4. Elements of a Protection System
  5. IT Governance and Data Protection
  6. Monitoring and Reporting
  7. Business Continuity
  8. Data Discovery
  9. Continuous Availability and Replication
  10. Snapshots
  11. Backup and Recovery
  12. The Cloud
  13. Deduplication
  14. Protecting Virtual Infrastructure
  15. Big Data
  16. Data Storage Protection
  17. Tape
  18. Converged Infrastructure
  19. Data Protection Service Catalogues
  20. Holistic Data Protection Strategies
  21. Data Recovery
  22. Choosing Protection Infrastructure
  23. The Impact of Flash on Data Protection
  24. In Closing

There’s a lot there – you’ll see the first eight chapters are not about technology, and for a good reason: you must have a grasp on the other bits before you can start considering everything else, otherwise you’re just doing point-solutions, and eventually just doing point-solutions will cost you more in time, money and risk than they give you in return.

I’m pleased to say that Data Protection: Ensuring Data Availability is released next month. You can find out more and order direct from the publisher, CRC Press, or order from Amazon, too. I hope you find it enjoyable.

The divisibility of eggs

 Backup theory, Recovery  Comments Off on The divisibility of eggs
Nov 272012

The caution about keeping all of ones eggs in the one basket is a fairly common one.

It’s also a fairly sensible one; after all, eggs are fragile things and putting all of them into a single basket without protection is not necessarily a good thing.

Yet, there’s an area of backup where many smaller companies easily forget the lesson of eggs-in-baskets, and that area is deduplication.

The mistake made is assuming there’s no need for replication. After all, no matter what the deduplication system, there’s RAID protection, right? Looking just at EMC, with either Avamar or Data Domain, you can’t deploy the systems without RAID*.

As we all know, RAID doesn’t protect you from accidental deletion of data – in mirrored terms, deleting a file from one side of the mirror doesn’t even commit the operation until it’s been completed on the other side of the mirror. It’s the same for all other RAID.

Yet deduplication is potentially very much like putting all ones eggs in one basket when comparing to conventional storage of backups. Consider the following scenario in a non-deduplication environment:

Backup without deduplication

In this scenario, imagine you’re doing a full backup once a week of 1.1TB, and incrementals all other days, with each incremental averaging around 0.1TB. So at the end of each week you’ll have backed up 1.7TB. However, cumulatively you keep multiple backups over the retention period, so those backups will add up, week after week, until after just 3 weeks you’re storing 5.1TB of backup.

Now, again keeping the model, imagine a similar scenario but with deduplication involved (and not accounting for any deduplication occurring within any individual backup):

Backup with deduplication


Now, again, I’m keeping things really simple and not necessarily corresponding to a real-world model. However, while each week may see 1.7TB backed up, cumulatively, week after week, the amount of data stored by the deduplication system will be much lower; 1.7TB at the end of the first week, 2.3TB at the end of the second, 2.9TB at the end of the third.

Cumulatively, where do those savings come from? By not storing extra copies of data. Deduplication is about eliminating redundancy.

On a single system, deduplication is putting all your eggs in one basket. If you accidentally delete a backup (and it gets scrubbed in a housekeeping operation), or if the entire unit fails, it’s like dropping the basket. It’s not just one backup you lose, but all backups that referred to the specific data lost. It’s something that you’ve got to be much more careful about. Don’t treat RAID as a blank cheque.

The solution?

It’s trivially simple, and it’s something every vendor and system integrator worth their salt will tell you: when you’re deduplicating, you must replicate (or clone, in a worse case scenario), so you’re protected. You’ve got to start storing those twin eggs in another basket.

Cloning of course is important in non-deduplicated backups, but if you’ve come from a non-deduplicated backup world, you’re used to having at least a patchy safety net involved – with multiple copies of most data generated, even in an uncloned situation if a recovery from week 2 fails, you might be able to go back to the week 3 backup and recover what you need, or at least enough to save the day.

The message is simple:

Deduplication = Replication

If you’re not replicating or otherwise similarly protecting your deduplication environment, you’re doing it wrong. You’ve put your eggs all in one basket, and forgotten that you can’t unbreak an egg.

Well, technically, you could probably sneak in an AVE deployment without RAID, but you’d be getting fairly desperate.


ILP policies vs Backup Policies

 Architecture, Backup theory  Comments Off on ILP policies vs Backup Policies
May 152012

This post has moved to the Enterprise Systems Backup blog, and can be accessed here.

Snapshots and Backups, Part 2

 Backup theory, General thoughts  Comments Off on Snapshots and Backups, Part 2
Feb 082010

Over the weekend I wrote up a piece about how snapshots are not a valid replacement to enterprise backup. The timing of this was in response to NetApp recently abandoning development of their VTL systems, and subsequent discussions this triggered, but it was something that I’d had sitting in the wings for a while.

It’s fair to say that discussions on snapshots and backups polarise a lot of people; I’ll fully admit that I side with the “snapshots can’t replace backups” side of the argument.

I want to go into this in a little more detail. First I’ll point out in fairness that there are people willing to argue the other side that don’t work for NetApp, in the same way that I don’t work for EMC. One of those is the other Preston – W. Curtis Preston, and you can read his articulate case here. I’m not going to spend this article going point for point against Curtis – it’s not the primary point of discussion I want to make in this entry.

Moving away from vendors and consultants, another and very interesting opinion, from the customer perspective, comes from Martin Glassborow’s Storagebod blog. Martin brings up some valid customer points – that being snapshot and replication represents extreme hardware lock-in. Some would argue that any vendor’s backup product represents vendor lock in as well, and this is partly right – though remember it’s not so difficult to keep a virtual machine around with the “last state” of the previous backup application available for recovery purposes. Keeping old and potentially obsolete NAS technology running to facilitate older recoveries after a vendor switch can be a little more challenging.

To get onto what I want to raise today, I need to revisit a previous topic as a means of further explaining my position. Let’s look for instance at my previous coverage of Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP). You can read the entire piece here, but the main point I want to focus on is my ILP ‘diagram’:

Components of ILP

One of the first points I want to make from that diagram is that I don’t exclude snapshots (and their subsequent replication) from an overall information lifecycle protection mechanism. Indeed, depending on the SLAs involved, they’re going to be practically mandatory. But, to use the analogy offered by the above diagram, they’re just pieces of the pie rather than the entire pie.

I’m going to extend my argument a little now, and go beyond just snapshots and replication, so I can elucidate the core reasons why I don’t like replicated snapshots as a permanent backup solution. Here’s a few other things I don’t like as a permanent backup solution:

  • VTLs replicated between a primary and disaster recovery site, with no tape out.
  • ADV_FILE (or other products disk backup solutions) cloned/duplicated between the primary and disaster recovery site, with no tape out.
  • Source based deduplication products with replication between two locations, with no tape out.

My fundamental objection in all of these solutions is the long term failure caused by keeping everything “online”. Maybe I’m a pessimist, but when I’m considering backup/recovery and disaster recovery solutions, I firmly believe that I’m being paid to consider all likely scenarios. I don’t personally believe in luck, and I won’t trust a backup/disaster recovery solution on luck either. The old Clint Eastwood quote comes to mind here:

You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do ya, punk?

When it comes to your data, no, no I don’t. I don’t feel lucky, I don’t encourage you to feel lucky. Instead I rely on solid, well protected systems with offline capabilities. Thus, I plan for at least some level of cascading failures.

It’s the offline component that’s most critical. Do I want all my backups for a year online, only online, even with replication? Even more importantly – do I want all your backups online, only online, even with replication? The answer remains a big fat no.

The simple problem with any solution that doesn’t provide for offline storage is that (in my opinion), it brings the risk of cascading failures into play too easily. It’s like putting all storage for your company on a single RAID-5 LUN and not having a hot spare. Sure you’re protected against that first failure, but it’s shortly after the first failure that Murphy will make an appearance in your computer room. (And I’ll qualify here: I don’t believe in luck, but I’ve observed over the years in many occasions that Murphy’s Law rules in computer rooms as well as in other places.) Or to put it another way: you may hope for the best, but you should plan for the worst. Let’s imagine a “worst case scenario”: a fire starts in your primary datacentre 10 minutes after upgrade work has commenced on the array that receives replicated snapshots in your disaster recovery runs into problems with firmware, leaving that array inaccessible until vendor upgrades are complete. Or worse again, it leaves storage corrupted.

Or if that seems too extreme, consider a more basic failure: a contractor near to your primary datacentre digs through the cables linking your production and disaster recovery sites, and it’s going to take 3 days to repair. Suddenly you’ve got snapshots and no replication. Just how lucky does that leave you feeling? Personally, I feel slightly naked and vulnerable when I have a single backup that’s not cloned. If suddenly none of my backups were getting duplicated, and I had no easy access to my clones, I’d feel much, much worse. (And that full body shiver I do from time to time would get very pronounced.)

Usually all this talk of a single instance failure frequently leads proponents of snapshots+replication only to suggest that a good design will see 3-way replication, so there’s always two backup instances. This doubles a lot of costs while merely moving the failure point just a jump to the left. On the other hand, offline backup where there’s the backup from today, the backup from yesterday, the backup from the day before … the backup from last week, the backup from last month, etc., all offline, all likely on different media – now that’s failure mitigation. Even if something happens and I can’t recover the most recent backup, in many recovery scenarios I can go back one day, two days, three days, etc. Oh yes, you can do that with snapshots too, but not if the array is a smoking pile of metal and plastic fused to the floor after a fire. In some senses, it’s similar to the old issue of trying to get away from cloning by backing up from the production site to media on the disaster recovery site. It just doesn’t provide adequate protection. If you’re thinking of using 3-way replication, why not instead have a solution that uses two entirely different types of data protection to mitigate against extreme levels of failure?

It’s possible I’ll have more to say on this in the coming weeks, as I think it’s important, regardless of your personal view point, to be aware of all of the arguments on both sides of the fence.

Feb 062010

Every now and then the topic arises over whether snapshots are backups.

This is going through a resurgence at the moment, as NetApp has dropped development of their VTL systems, with some indications being that they’re going to revert to recommending people use snapshots and replication for backup.

So this raises the question again – is a snapshot a backup? I’ll start by quoting from my book here:

A backup is a copy of any data that can be used to restore the data as/when required to its original form. That is, a backup is a valid copy of data, files, applications, or operating systems that can be used for the purposes of recovery.

On the face of this definition, a snapshot is indeed a backup, and I’d agree that on a per-instance basis snapshots can act as backups. However, I’d equally argue that building your entire backup and recovery system on the basis of snapshots and replication is like building a house of cards on shifting sand in the face of an oncoming storm. In short, I don’t believe that snapshots and replication alone provide:

  1. Sufficient long-term protection.
  2. Sufficient long-term management.
  3. Sufficient long-term performance.

I’ll be the first to argue that in a system with high SLAs, having snapshots and/or replication is going to be almost a 100% requirement. You can’t meet a 1 hour data loss deadline if you only backup once every 24 hours – and backing up every hour using conventional backup systems is rarely appropriate (or rarely even works). So I’m not dismissing snapshots at all.

It’s easy to discuss the theoretical merits of using snapshots in lieu of backup/recovery software as a total backup system, but I think that the practical considerations quickly overcome any theoretical discussion. So let’s consider a situation though where you want to keep your backups for 6 months. (These days that’s a fairly short period.) Do you really want to keep 6 months of snapshots around? Let’s assume we keep hourly snapshots for 2 weeks, then one snapshot per day for the rest of the time. That’s 504 snapshots per system – in fact, normally per NAS filesystem. Say you’ve got 4 NAS units and 30 filesystems on each one – that’s around 60,000 snapshots over a course of 6 months.

What’s 60,000+ snapshots going to do to:

  • Primary production storage performance?
  • Storage and backup administrator management?
  • Storage costs?
  • Indexing costs?

The argument that snapshots and replication alone can replace a healthy enterprise backup system (or act in lieu of it) just doesn’t wash as far as I’m concerned. It looks good on paper to some, but on closer inspection, it’s a paper tiger. By all means within environments with heavy SLAs they’re likely to form part of the data protection solution, but they shouldn’t be the only solution.

Sep 122009

In my opinion (and after all, this is my blog), there’s a fundamental misconception in the storage industry that backup is a part of Information Lifecycle Management (ILM).

My take is that backup has nothing to do with ILM. Backup instead belongs to a sister (or shadow) activity, Information Lifecycle Protection – ILP. The comparison between the two is somewhat analogous to the comparison I made in “Backup is a Production Activity” between operational production systems and infrastructure support production systems; that is, one is directly related to the operational aspects of the data, and the other exists to support the data.

Here’s an example of what Information Lifecycle Protection would look like:

Information Lifecycle Protection

Information Lifecycle Protection

Obviously there’s some simplification going on in the above diagram – for instance, I’ve encapsulated any online storage based fault-protection into “RAID”, but it does serve to get the basic message across.

If we look at say, Wikipedia’s entry on Information Lifecycle Management, backup is mentioned as being part of the operational aspects of ILM – this is actually a fairly standard definition of the perceived position of backup within ILM; however, standard definition or not, I have to disagree.

At its heart, ILM is about ensuring correct access and lifecycle retention policies for data: neither of these core principles encapsulate the activities in information lifecycle protection. ILP on the other hand is about making sure the data remains available to meet the ILM policies. If you think this is a fine distinction to make, you’re not necessarily wrong. My point is not that there’s a huge difference, but there’s an important difference.

To me, it all boils down to a fundamental need to separate access from protection/availability, and the reason I like to maintain this separation is how it affects end users, and the level of awareness they need to have for it. In their day-to-day activities, users should have an awareness of ILM – they should know what they can and can’t access, they should know what they can and can’t delete, and they should know where they will need to access data from. They shouldn’t however need to concern themselves with RAID, they shouldn’t need to concern themselves with snapshots, they shouldn’t need to concern themselves with replication, and they shouldn’t need to concern themselves with backup.

NOTE: I do, in my book, make it quite clear that end users have a role in backup in that they must know that backup doesn’t represent a blank cheque for them to delete data willy-nilly, and that they should know how to request a recovery; however, in their day to day job activities, backups should not play a part in what they do.

Ultimately, that’s my distinction: ILM is about activities that end-users do, and ILP is about activities that are done for end-users.