As evidenced by the title of my book (Enterprise Systems Backup and Recovery: A corporate insurance policy), I’m a firm believer that the only way to conceptualise the purpose of backup is to describe it as insurance. The way I describe this is to compare the way in which we take out insurance, but hope not to use it, and to make backups, and similarly hope not to use them. This can be easiest described through a couple of Venn diagrams.

First, let’s look at insurance:

Backup and Insurance: Insurance Venn DiagramNo-one wants to claim on their insurance. We take it out on a yearly basis, and any year that we don’t have to use it is good. (Particularly in countries where insurance companies run rough-shod over morality, decency and legal restraint.) I personally have home insurance, contents insurance, car insurance, travel insurance (whenever I travel) and health insurance. Any time I don’t have to make a claim on any of these types of insurance is good – because in order to make a claim, something bad needs to have happened. So I’m much happier paying the fees each year and hoping that I don’t have any more involvement than that with my insurance agencies. Do I resent paying these fees? Hell no – because I’m well aware that if I don’t, and something bad happens, I’ll be up the creek without a paddle. (Or to use the Australian vernacular, I’d be up s––t creek.)

So let’s see the Venn diagram for backup:

Venn Diagram for BackupAs you can see, it’s spookily similar to the diagram for insurance. Now, one of the first things that I tend to hear when I roll out my “backup = insurance” argument is that occasionally, people will want to recover from backups – e.g., to migrate between systems, refresh Q/A systems from production, etc. Well, this isn’t really using backup for the primary purpose – recovery, but instead using it as a data migration/retrieval system. It’s a fine distinction, but it’s an important distinction. The primary reason backup systems are deployed is to recover data when there’s been a failure – any secondary benefit from a backup and recovery system is just that – a secondary benefit.

Your next question may be – so what point is there in classifying backup as a type of insurance?

This is the absolute core of why companies need to think of backup as being a type of insurance – it’s all about the budget.

Look at an example company. Let’s say there’s 5 departments:

  • IT
  • Finance and Human Resources
  • Sales
  • Warehousing and Operations
  • Solutions Delivery

In a standard company, each department will have it’s own budget, but there’s also the corporate budget. That’s the budget that covers costs which affect all departments and have to be met regardless of the size or capacity of each department – it’s for the core business costs. One of those “core” costs is usually the various insurance policies that companies take out. This will definitely include some sort of standard business insurance, but will then cover other types of insurance – professional indemnity, building insurance, contents insurance, car insurance, etc. Few businesses would argue that each department needs to individually seek out and/or pay for its own insurance on each of those matters.

The mistake then made by many businesses is to fail to think of backup as insurance, and therefore work on the basis that IT will manage data and systems backup out of its own budget. This sort of thinking leads to the most common disasters where:

  • Backup systems budget is cut to meet the budget requirements of “production” systems. (See my points here about why it’s a fallacy to think of backup systems as anything other than production systems.)
  • “Make do” data protection systems are deployed that require significant time to complete recovery – e.g., to “save” money, some IT departments will decide to only backup actual data, and leave operating systems and applications at the mercy of being re-installed from the ground up.
  • Backup retention is cut to reduce operational expenditure (i.e., limit the purchase of new media).
  • SLAs, if established, are silently ignored – or even railed against by IT.

None of these processes or decisions are conducive to sensible or useful business systems management – yet they’re the inevitable consequence of asking one department to meet costs that are shared between all departments. It would be like demanding that the sales department pay for all company insurance out of their budget: it just doesn’t make sense.

Where does this discussion leave us? There’s a lesson any business can take out of this: backup, being insurance, is something that’s funded by the corporate operational and capital budget, not the budgets of any individual department.

Chances are if your business isn’t thinking of backup as insurance, it’s not handling or funding backup properly either.

 

I want to spend a few minutes discussing something that drives me nuts. It’s something I see quite regularly on technical websites that discuss data protection, and it’s about time I make my opinion clear on it.

The latest instance comes from an article at SearchStorage called “How tiering can improve your backup strategies“. Marc Staimer wrote:

In one example, all data is commonly backed up once a day, put on tape, then shipped offsite. This methodology means that the RPO is 24 hours, and the RTO is a few days or longer. This is not a good idea for an organization’s mission-critical data. First, the process in recovering the data takes much too long, bringing all of the correct tapes back from offsite, and then recovering them in order, (which is subject to common human error). This can be incredibly tiresome and annoying if all that is being recovered is a single file caused by an accidental deletion. Second, it assumes all data on all tapes are recoverable. In the end, both introduce unacceptable risks to mission-critical data.

Now, I’m not going to dispute the fact that daily backups to tape can give RPOs of 24 hours or more, and can result in RTO’s of more than 24 hours. However, I don’t agree that an RPO of 24 hours is always the case, and I certainly don’t agree that an RTO of 24 hours (or more) is a 100% inevitability. Instead, I want to spend some time picking apart the rest of this junk statement.

Let’s first consider:

[T]he process in recovering the data takes much too long, bringing back all of the correct tapes from offsite, and then recovering them in order, (which is subject to human error). This can be incredibly tiresome and annoying if all that is being recovered is a single file caused by an accidental deletion.

This would be true if we were using archaic backup scripts (perhaps in a completely decentralised environment) with no automation. On the other hand, if you’re using decent, enterprise backup software there are absolutely no reasons why this should be the case. Enterprise class backup software will:


  • Identify which media is required for a recovery.
  • Read only from the media required for a recovery.
  • Seek to positions as close to the recovery point so as to avoid reading redundant data.

If we look at NetWorker for instance, we know it’s no slouch when it comes to seeking to the right spot on media for rapid single-file recovery. Between file records and media record markers, NetWorker can very quickly direct a tape drive to seek to the optimum location to commence recovery.

So my first thought is – if that’s the sort of experience that Marc Staimer has with tape based backup and recovery systems, he’s using the wrong ones, and shouldn’t blame that on tape.

Now let’s cover the second point:

[I]t assumes all data on all tapes are recoverable.

This can only be interpreted to mean one thing: the old “tape is unreliable” mantra. If tape were half as unreliable as every second article on tape made out to believe, there wouldn’t be a single tape vendor left in the market – they’d have all been sued out of business for deceptive trading and terribly unreliable products.

I’m not claiming that tape is fault free – if I did, I’d have a heck of a lot less cause to do the Ballmer Monkey Dance shouting “Cloning! Cloning! Cloning!” than I do. Tapes aren’t infallible, but I’ve not seen a single published paper citing extreme fault rates of enterprise class media*. On a yearly basis, the number of cases I see at customer sites of tape failure could be counted on a butcher’s right hand**. And you know what? Those instances are almost always at the backup point, not the recovery point.

So where does this leave us? At FUD central.

I’m the first to admit that the role of tape is changing within backup environments – I stated my thoughts on this previously in the article “Direct to Tape is Dead, Long Live Tape“, and I stand by this; so any overall discussion about backup media tiering with a model along the lines of disk->disk->tape or disk->vtl->tape will be the sort of thing I’ll usually heartily agree with.

If someone can point out independent studies showing high tape failure rates for enterprise class tapes – I’d like to know. Until then, let’s talk about valid, non-FUD reasons for pulling tape out of the immediate backup path. These include (but are not limited to):


  • Inability of most environments to stream tape.
  • SLAs requiring faster recovery starts, which in turn necessitate recovery from disk.
  • To allow for more streamlined backup cloning operations.
  • To support target deduplication for nearline backup storage.

Tape “unreliability” is not in that list. Maybe it is in limited environments that are currently using non-enterprise tape

* On the other hand, the easiest way of storing DAT media after generating your backup is to throw it into the bin. I might trust a DAT with a backup a little more than I’d trust a monkey with a pen to take notes in a court case, but not by much.

** I’m talking an old-style butcher. Before they had to start wearing chain mail gloves.

 

Over the weekend I wrote up a piece about how snapshots are not a valid replacement to enterprise backup. The timing of this was in response to NetApp recently abandoning development of their VTL systems, and subsequent discussions this triggered, but it was something that I’d had sitting in the wings for a while.

It’s fair to say that discussions on snapshots and backups polarise a lot of people; I’ll fully admit that I side with the “snapshots can’t replace backups” side of the argument.

I want to go into this in a little more detail. First I’ll point out in fairness that there are people willing to argue the other side that don’t work for NetApp, in the same way that I don’t work for EMC. One of those is the other Preston – W. Curtis Preston, and you can read his articulate case here. I’m not going to spend this article going point for point against Curtis – it’s not the primary point of discussion I want to make in this entry.

Moving away from vendors and consultants, another and very interesting opinion, from the customer perspective, comes from Martin Glassborow’s Storagebod blog. Martin brings up some valid customer points – that being snapshot and replication represents extreme hardware lock-in. Some would argue that any vendor’s backup product represents vendor lock in as well, and this is partly right – though remember it’s not so difficult to keep a virtual machine around with the “last state” of the previous backup application available for recovery purposes. Keeping old and potentially obsolete NAS technology running to facilitate older recoveries after a vendor switch can be a little more challenging.

To get onto what I want to raise today, I need to revisit a previous topic as a means of further explaining my position. Let’s look for instance at my previous coverage of Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP). You can read the entire piece here, but the main point I want to focus on is my ILP ‘diagram’:

Components of ILP

One of the first points I want to make from that diagram is that I don’t exclude snapshots (and their subsequent replication) from an overall information lifecycle protection mechanism. Indeed, depending on the SLAs involved, they’re going to be practically mandatory. But, to use the analogy offered by the above diagram, they’re just pieces of the pie rather than the entire pie.

I’m going to extend my argument a little now, and go beyond just snapshots and replication, so I can elucidate the core reasons why I don’t like replicated snapshots as a permanent backup solution. Here’s a few other things I don’t like as a permanent backup solution:

  • VTLs replicated between a primary and disaster recovery site, with no tape out.
  • ADV_FILE (or other products disk backup solutions) cloned/duplicated between the primary and disaster recovery site, with no tape out.
  • Source based deduplication products with replication between two locations, with no tape out.

My fundamental objection in all of these solutions is the long term failure caused by keeping everything “online”. Maybe I’m a pessimist, but when I’m considering backup/recovery and disaster recovery solutions, I firmly believe that I’m being paid to consider all likely scenarios. I don’t personally believe in luck, and I won’t trust a backup/disaster recovery solution on luck either. The old Clint Eastwood quote comes to mind here:

You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do ya, punk?

When it comes to your data, no, no I don’t. I don’t feel lucky, I don’t encourage you to feel lucky. Instead I rely on solid, well protected systems with offline capabilities. Thus, I plan for at least some level of cascading failures.

It’s the offline component that’s most critical. Do I want all my backups for a year online, only online, even with replication? Even more importantly – do I want all your backups online, only online, even with replication? The answer remains a big fat no.

The simple problem with any solution that doesn’t provide for offline storage is that (in my opinion), it brings the risk of cascading failures into play too easily. It’s like putting all storage for your company on a single RAID-5 LUN and not having a hot spare. Sure you’re protected against that first failure, but it’s shortly after the first failure that Murphy will make an appearance in your computer room. (And I’ll qualify here: I don’t believe in luck, but I’ve observed over the years in many occasions that Murphy’s Law rules in computer rooms as well as in other places.) Or to put it another way: you may hope for the best, but you should plan for the worst. Let’s imagine a “worst case scenario”: a fire starts in your primary datacentre 10 minutes after upgrade work has commenced on the array that receives replicated snapshots in your disaster recovery runs into problems with firmware, leaving that array inaccessible until vendor upgrades are complete. Or worse again, it leaves storage corrupted.

Or if that seems too extreme, consider a more basic failure: a contractor near to your primary datacentre digs through the cables linking your production and disaster recovery sites, and it’s going to take 3 days to repair. Suddenly you’ve got snapshots and no replication. Just how lucky does that leave you feeling? Personally, I feel slightly naked and vulnerable when I have a single backup that’s not cloned. If suddenly none of my backups were getting duplicated, and I had no easy access to my clones, I’d feel much, much worse. (And that full body shiver I do from time to time would get very pronounced.)

Usually all this talk of a single instance failure frequently leads proponents of snapshots+replication only to suggest that a good design will see 3-way replication, so there’s always two backup instances. This doubles a lot of costs while merely moving the failure point just a jump to the left. On the other hand, offline backup where there’s the backup from today, the backup from yesterday, the backup from the day before … the backup from last week, the backup from last month, etc., all offline, all likely on different media – now that’s failure mitigation. Even if something happens and I can’t recover the most recent backup, in many recovery scenarios I can go back one day, two days, three days, etc. Oh yes, you can do that with snapshots too, but not if the array is a smoking pile of metal and plastic fused to the floor after a fire. In some senses, it’s similar to the old issue of trying to get away from cloning by backing up from the production site to media on the disaster recovery site. It just doesn’t provide adequate protection. If you’re thinking of using 3-way replication, why not instead have a solution that uses two entirely different types of data protection to mitigate against extreme levels of failure?

It’s possible I’ll have more to say on this in the coming weeks, as I think it’s important, regardless of your personal view point, to be aware of all of the arguments on both sides of the fence.

 

Every now and then the topic arises over whether snapshots are backups.

This is going through a resurgence at the moment, as NetApp has dropped development of their VTL systems, with some indications being that they’re going to revert to recommending people use snapshots and replication for backup.

So this raises the question again – is a snapshot a backup? I’ll start by quoting from my book here:

A backup is a copy of any data that can be used to restore the data as/when required to its original form. That is, a backup is a valid copy of data, files, applications, or operating systems that can be used for the purposes of recovery.

On the face of this definition, a snapshot is indeed a backup, and I’d agree that on a per-instance basis snapshots can act as backups. However, I’d equally argue that building your entire backup and recovery system on the basis of snapshots and replication is like building a house of cards on shifting sand in the face of an oncoming storm. In short, I don’t believe that snapshots and replication alone provide:

  1. Sufficient long-term protection.
  2. Sufficient long-term management.
  3. Sufficient long-term performance.

I’ll be the first to argue that in a system with high SLAs, having snapshots and/or replication is going to be almost a 100% requirement. You can’t meet a 1 hour data loss deadline if you only backup once every 24 hours – and backing up every hour using conventional backup systems is rarely appropriate (or rarely even works). So I’m not dismissing snapshots at all.

It’s easy to discuss the theoretical merits of using snapshots in lieu of backup/recovery software as a total backup system, but I think that the practical considerations quickly overcome any theoretical discussion. So let’s consider a situation though where you want to keep your backups for 6 months. (These days that’s a fairly short period.) Do you really want to keep 6 months of snapshots around? Let’s assume we keep hourly snapshots for 2 weeks, then one snapshot per day for the rest of the time. That’s 504 snapshots per system – in fact, normally per NAS filesystem. Say you’ve got 4 NAS units and 30 filesystems on each one – that’s around 60,000 snapshots over a course of 6 months.

What’s 60,000+ snapshots going to do to:

  • Primary production storage performance?
  • Storage and backup administrator management?
  • Storage costs?
  • Indexing costs?

The argument that snapshots and replication alone can replace a healthy enterprise backup system (or act in lieu of it) just doesn’t wash as far as I’m concerned. It looks good on paper to some, but on closer inspection, it’s a paper tiger. By all means within environments with heavy SLAs they’re likely to form part of the data protection solution, but they shouldn’t be the only solution.

 

(This is a local mirror posting of the guest blog piece I wrote for Parallels Consumer Tech Blog.)

I made a fortuitous discovery with Parallels Desktop v5 for Mac overnight. I had been patching my Mac Pro, and thought one of the patches was going to need a reboot, so of course I shutdown Parallels. After the patching was completed, it turned out I didn’t need to reboot, and I got distracted so I never got back around to launching Parallels.

Last night I needed to check something on one of my Linux virtual machines that I run in Parallels Desktop, and rather than use screen sharing to my Mac Pro, I pulled out my handy iPhone application for Parallels, jumped into the virtual machine list and turned the Linux guest on. 10 seconds or so later I was able to ssh into it, do what I needed to do, then didn’t think about it again.

I came back to my Mac Pro this morning and again logged onto the Linux virtual machine via ssh, and ran a bunch of tests without once noticing: Parallels Desktop for Mac was not running. I’m not saying that the virtual machine window wasn’t visible – the application itself wasn’t running, the console for my virtual machine wasn’t running, and the virtual machine was happily chugging away.

Here you can see my dock showing that Parallels isn’t running:

Parallels Desktop

So, is this good or bad? I have to say it falls into the category of sheer awesome.

If you run multiple virtual machines in Parallels – particularly if you’re running a bunch of Linux virtual machines, being able to go headless is really useful. You don’t end up with so many windows (minimised or otherwise) cluttering up the desktop, and you can still access the virtual machines just how you want.

So in order to run virtual machines headless, you’re going to need the iPhone Parallels application, which means you’ll need an iPhone. (But we all have one of those, right? :-) )

Once you’ve got the iPhone application for Parallels installed, and connected to your Parallels Desktop system, you can quit Parallels and use the iPhone application to start and stop virtual machines:

Parallels Desktop iPhone Application

Obviously this doesn’t give you the full flexibility of running the Parallels application completely, but if you’re only wanting to run virtual machines without a console, or you just need to quickly fire up a single virtual machine regardless of whether Parallels is running or not, using the Parallels iPhone application can be a real time saver.

 

There’s a simple rule to remember when it comes to removable media handling (both within backups, and generally within IT) – if you don’t know where your media is, you can’t be certain someone hasn’t misappropriated it.

Taking this further, if you can’t be sure of the security of your backup media, you can’t be sure of the security of your backups; and if you can’t be sure of the security of your backups, you can’t be sure of your security of your data.

So, how can you be certain of the security of your media, and therefore your backups and data?

Here’s a few guidelines:

  • Always use reputable media handling companies. This is for a two-fold requirement. First, you want to make sure that the company that handles and stores your media knows how to treat it carefully. That means correct handling procedures, storage in appropriate environmental conditions, and storage in a location that is unlikely to be affected by disasters that could affect your datacentre. The second part of the requirement is knowing that the media is always secure. This means signed, authorised access, a known reputation for security, audited processes and (preferably) premises that you can periodically visit to confirm security levels.
  • Store media securely on-site too. It is far from the case that media can only be stolen when off-site or travelling to/from site. Indeed, some of Australia’s biggest media losses have occurred on-site due to poor media handling security. (I seriously doubt Australia is unique in this). Tapes shouldn’t be kept insecurely anywhere on-site. When being transported from the computer room to on-site storage, they should be securely monitored at all times. When readying for transport off-site, they should be kept under lock and key, or kept in a secure location. And when at-rest on-site, they should also be kept under lock and key.
  • Media encryption. For a long time media encryption has been available only to the high end of enterprise backup. However, with tape technologies such as LTO-4 incorporating hardware encryption, any company using removable media in their backup environment should either:
    • Already be using media encryption, or
    • Be actively planning moving to media encryption, or
    • If nothing else, use NetWorker’s software encryption on critically sensitive data if the business is too small to afford hardware-encryption devices. This means taking a hit on backup performance, but as the old saying goes, you can’t have your cake and eat it too. I.e., there’s always a cost to encryption.
  • Secure key management. Media encryption doesn’t mean a thing if you’re not using some form of secure key management. Discuss and plan backup key management with your corporate security policy makers.
  • Have established, immutable processes for the recall of media. Media that has been sent to offsite storage should either be returned under specific, agreed circumstances. That may be a fixed rotation policy normally, with provisions for recall for recoveries with specific authorisation. Make sure that authorisation process is locked down with your media offsite vendor so that social engineering attacks can’t be employed (particularly when it comes to ex-employees).
  • Use strong password management for backup server access. As I’ve previously discussed, your entire backup environment is only as secure as your backup server. This places a special responsibility on backup and system administrators to ensure that the backup environment is highly secure.

Of course, there’s more to backup systems security than the above, but I wanted to focus primarily on physical security considerations for removable media, which for a lot of sites represent the weakest link in the security of the backup environment (and by extension, a significantly weak link in the security of the company’s IT systems and data as well).

If you fail to focus on removable media security, you potentially leave your company open to data loss.

 

I had hoped that the NetWorker Power User’s Guide to nsradmin micromanual might be popular enough to get say, at least 50 or 100 downloads, but I’ve been overwhelmed by the hundreds and hundreds of downloads.

That high number of downloads has well and truly been reflected in the fact that the article introducing the micromanual was the top viewed article for January.

If you’ve not already checked out the micromanual, please feel free to download it. Don’t be afraid of the request for a name and email address – I’m not harvesting this information for any nefarious purposes. As I state quite clearly on the download page, it’s only to let you know if there are any updates to the manual. Any person who has already downloaded the manual will attest to the fact that I’ve not contacted them – and that’s because there’s been no updates yet.

As a side note, this blog is now officially a year old, and the readership continues to grow – a big thank-you to everyone for taking the time to read what I have to say!

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha