RIP Old Backup Software

Much of what I deal with relates to active backup systems, but sometimes a backup system will reach an end-point in its lifecycle. To be fair, this isn’t something that should necessarily happy regularly. If chosen correctly, a backup system (particularly an enterprise one) should evolve with the needs of business. Indeed, it could be argued that in order to even be classified as an enterprise backup product, software must feature both growth and scaleability so it can remain useful and relevant in a deployment.

That being said, there are still times when a company will decide to decommission a backup system. Reasons I’ve seen in the past include:

  1. Business is purchased by another company that has a backup software standard;
  2. Critical feature set<->requirements gap develops, necessitating re-evaluation;
  3. Backup product is discontinued (or subsumed by another product);
  4. OS platform shift necessitates a product change;
  5. New manager has a beef against existing product or vendor (sadly, while this shouldn’t come into play, it really does sometimes).

There are going to be other reasons from time to time, of course, but those represent the most common reasons I’ve seen (not in any real particular order, I should note).

These days it’s actually extremely rare to encounter a business that doesn’t have any long-term recovery requirements. (Indeed, typically businesses that believe they don’t have any long-term recovery requirements are mistaken.) Out of all my current customers, there’s only one that I can immediately think of that has short-term retention policies only and proof that’s all they need.

It’s the transitioning between backup products that sees us lose the insurance policy analogy. We can compare a lot of backup and recovery system operations to insurance policies – backing up is taking out the policy, recovery is making a claim, cloning your backups is like ensuring your policy is up to date and your insurer is liquid, and having a support contract is like making sure your insurer has an underwriter.

Switching backup products? You might say that it’s like switching insurance companies, except when you switch insurance companies you don’t have to keep your old policy around “just in case”. It’s a very rare situation to be able to switch without any legacy considerations.

And so, the net result when it comes time to decommission a backup product is that a full decommissioning may in fact take months, or even years, to complete, depending on the retention requirements on the backups.

When a backup environment is due to be decommissioned, you can typically choose one or more of the following actions:

  1. Migrate all, or the critical long-term backups to the new product. This typically is a costly and fairly manual process involving recoveries and new backups, typically requiring third party certification that no data was changed during the process, etc.;
  2. Maintain the old backup environment ‘as-is’, with appropriate support contracts, which may be costly;
  3. Maintain the old backup environment ‘as-is’, without support contracts (i.e., an Icarus support contract process), which will be risky;
  4. Virtualise and the essential components of the backup environment, and reduce to a bare minimum the hardware requirements necessary for a recovery (e.g., replace a large tape library with just one or two standalone drives, etc.);
  5. Decommission the environment, archiving the requisite hardware and systems to facilitate a “cold” startup and recovery (possibly exporting the meta-data necessary for long-term backup tracking before hand to facilitate those recoveries).

To be perfectly honest, none of these options are inherently ideal, and each carry their own risks, costs and compromises. (I believe the most flexible choice, if it’s available to the business, is virtualisation.)

If migration isn’t performed, then there’s another aspect to decommissioning which needs to be considered. Like everything to do with backups, the technology isn’t likely to be the biggest challenge; in this case, the challenge will centre around staff knowledge.

At the best of times, backup product expertise is best acquired by regular use of the product, and moving to a new product will obviously draw attention away from the old product. If a recovery needs to be performed three months after decommissioning, a backup administrator will likely have no issue performing that recovery. But after six months? Twelve months? Three years? People who are rusty with the product will work slower and are more likely to make mistakes.

The simple fact is that there’s no really easy way to decommission a backup system in favour of a new one. That lack of simplicity should, by rights, factor into any decision process relating to the decommissioning itself; namely:

  1. Will we migrate, decommission or retain a reduced, active form of the old system?
  2. What will be the costs associated with each option?
  3. What will be the risks associated with each option?
  4. What are the benefits (both direct and indirect) from the transition?
  5. Do the costs and risks of the transition outweigh the benefits?

The last question is not flippant – any decision to change a backup product must be closely and carefully weighed up. (This is why the “new manager hates vendor X/product Y and insists on change” transition reason is particularly challenging and unpleasant to deal with – there’ll likely be few, if any benefits to that transition.)

Make sure that all of the above questions can be answered clearly and accurately; if they can’t, then in all likelihood the decommissioning will get very messy.

 

New years resolutions for backup

I’d like to suggest that companies be prepared to make (and keep!) 7 new years resolutions when it comes to the field of backup and recovery:

  1. We will test our backups: If you don’t have a testing regime in place, you don’t have a backup system at all.
  2. We will duplicate our backups: Your backup system should not be a single point of failure. If you’re not cloning, replicating or duplicating your backups in some form, your backup system could be the straw that breaks the camel’s back when a major issue occurs.
  3. We will document our backups: As for testing, if your backup environment is undocumented, it’s not a system. All you’ve got is a collection of backups, which, if the right people are around at the right time and in the right frame of mind, you could get a recovery from it. If you want a backup system in place, you not only have to test your backups, you also have to keep them well documented.
  4. We will train our administrators and operators: It never ceases to amaze me the number of companies that deploy enterprise backup software and then insist that administrators and operators just learn how to use it themselves. While the concept of backup is actually pretty simple (“hey, you, back it up or you’ll lose it!”), the practicality of it can be a little more complex, particularly given that as an environment grows in size, so does the scope and the complexity of a backup system. If you don’t have some form of training (whether it’s internal, by an existing employed expert, or external), you’re at the edge of the event horizon, peering over into the abyss.
  5. We will implement a zero error policy: Again, there’s no such thing as a backup system when there’s no zero error policy. No ifs, no buts, no maybes. If you don’t rigorously implement a zero error policy, you’re flipping a coin every time you do a recovery, regardless of what backup product you use. (To learn more about a zero error policy, check out the trial podcast I did where that was the topic.)
  6. We will appoint a Data Protection Advocate: There’s a lot of data “out there” within a company, not necessarily under central IT control. Someone needs to be thinking about it. That someone should be the Data Protection Advocate (DPA). This person should be tasked with being the somewhat annoying person who is present at every change control meeting, raising her or his hand and saying “But wait, how will this affect our ability to protect our data?” That person should also be someone who wanders around the office(s) looking under desks for those pesky departmental servers and “test” boxes that are deployed, the extra hard drives attached to research machines, etc. If you have multiple offices, you should have a DPA per office. (The role of the DPA is outlined in this post, “What don’t you backup?“)
  7. We will assemble an Information Protection Advisory Council (IPAC): Sitting at an equal tier to the change control board, and reporting directly to the CTO/CIO/CFO, the IPAC will liaise with the DPA(s) and the business to make sure that everyone is across the contingencies that are in place for data protection, and be the “go-to” point for the business when it comes to putting new functions in place. They should be the group that sees a request for a new system or service and collectively liaises with the business and IT to ensure that the information generated by that system/service is protected. (If you want to know more about an IPAC and its role in the business, check out “But where does the DPA fit in?“)

And there you have it – the new years resolutions for your company. You may be surprised – while there’ll be a little effort getting these in place, once they’re there, you’re going to find backup, recovery, and the entire information protection process a lot easier to manage, and a lot more reliable.

 

Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

 

The backup world is, to use a quaint colloquialism, “arse about face“. So much of the talk in the backup world revolves around backup, when it actually should revolve around recovery.

It’s not that we don’t care about recovery – it’s just we often consider data protection in terms of “backup”, when in actual fact that’s just the means, not the end.

For this reason, I periodically see sites where backup failures are allowed to go on for sometimes weeks – sometimes, even, months – because “it’s just a backup”, or “it’s probably not important”, or some other such reason. Now, long-term readers would know that I prescribe zero error policies, but sometimes getting traction for zero error policies is difficult. So, we need to rename “backup failure”.

You see, in reality, it’s not really a “backup failure”, and while we call it a “backup failure”, anyone who is busy will think that it just means we wait for the next backup, or assume that it’s a minor problem. Let’s instead call it by what it really is:

an unrecoverable backup

I’ve been testing this out recently, and when I’ve talked to people in the past about “backup failures”, they’ll admit there’s risk involved but tend to shrug them off, hoping for the best. However, when I’ve started talking about “unrecoverable backups”, the squirming starts. It creates a slight sense of panic in the eyes, or a heightened need to check status emails, etc.

So, that’s my tip for today – there’s no such thing as a “backup failure”, it’s an “unrecoverable backup”. And if you think about it as an “unrecoverable backup”, and you talk about it within the company as an “unrecoverable backup”, it’ll get the attention, priority and budget it requires.

 

Martin Glassborow, aka @storagebod, and I had a bit of a discussion via Twitter, which came down to the following:

  • Martin feels the default backup policy within an environment should be to backup nothing;
  • I feel the default backup policy within an environment should be to backup everything.

Now the interesting thing is, we both actually meet in the middle, but just start from different points.

Martin has discussed his reasoning behind his default policy here, in “Don’t BackUp“, which I encourage you to read before continuing. There is, indeed, as Martin suggested in a tweet to me last night, a nice absolutism in either approach – don’t backup, or backup everything. Yet, neither is really the case.

My approach – that being to start with “backup everything”, starts with the following assumptions:

  1. Hardware can fail.
  2. Software can fail.
  3. Humans can make errors.
  4. Processes can fail.

By my very nature I think I’m perfectly suited to working in the backup space. I’ve always been into backup. On the Vic-20, when I was learning to program, I’d always save my programs onto two different tapes. On the Commodore 64, I’d always save my programs and documents onto two different disks. When I went to the PC, I’d always have a copy on a hard drive, and a copy on a floppy drive.

Martin’s approach is this:

Making it policy that nothing gets backed-up unless requested takes out all ambiguity. There can be no assumptions about what is being backed-up, it makes it someone’s responsibility as opposed to an assumed default.

There is, undoubtedly, logic in what Martin suggests, but it’s not a logical starting point I can personally reconcile myself with, for the fundamental reason that it (IMHO) assumes that everyone who interacts with the system understands the system and the nature of their interaction.

It in fact runs completely contrary to an axiom in user desktop/laptop backup approaches – if you leave backups up to the users, nothing will get backed up. That holds true for pretty much every business I’ve ever interacted with, from the most, to the least technical.

It’s for that reason, that lack of total systems awareness and data responsibility from all users of any environment, that my approach starts from the other end. Backup everything.

But I don’t really mean it. I abhor wastage. Recently, I’ve learnt that wastage comes in many forms, which is why the decision to move interstate and re-evaluate what I/we own has been cleansing. (See the article “deconstruction of falling stars” over at my personal blog for a bit more on that front.)

As I abhor wastage, I don’t actually believe you should backup everything within your environment. Sure, some vendors might like that notion – infinite tapes, disk, storage, snapshots, you name it. But it’s neither practical nor commercial reality to do this.

No, there is a middle ground. For me, the sweet spot is this what I always come back to:

It is always better to backup a little more than you need, and waste some storage media, than it is to not backup quite enough, and be unable to recover.

So if your tape usage is say, 5-10% higher than it should be, or your VTL/B2D environment is 5-10% bigger than it really needs to be, I’m not concerned. (If it’s a crazy amount, like 100% more, then there’s a problem – a serious problem that has arisen from a lack of capacity planning, etc.)

I’ve seen IT sites where NetWorker agents have been deployed on every server within the environment, and when I’ve done a coverage analysis, I’ve seen servers that have this as the saveset:

/etc/hosts

Just that. Nothing more, nothing less. (You couldn’t get much less anyway.) I’ve equally seen sites where not only was a hot backup done of the production Oracle database via a module, but the database files were backed up as part of the filesystem backup, and then export/dumps were generated and backed up as well. Overkill? Yes. Were some backups unrecoverable? Yes.

Both are very clear examples of wastage, but I’ll tell you the difference.

The latter one – backing up too much, is time and money wastage. Neither are pleasant, both can hurt the bottom line of a company, yet that’s where it stops.

The former – backing up only what is explicitly requested, nothing more, is corporate wastage. There’s a little bit of monetary wastage involved (why spend the money on an agent to backup a single file?) – the real wastage though is that it could waste the company. Unable to recover legally required files because someone forgot to request them to be backed up? Hello, lawsuit loss. Unable to recover financial data that proves your company has correctly paid its taxes because someone forgot to request them to be backed up? Hello, double tax payments. For me it triggers thought of every possible nightmare scenario a company might experience, right through to total dissolution and loss of the company itself.

In my book, I make the differentiation between what I call inclusive and exclusive backup products. I define:

  • An inclusive backup product is one where you have to explicitly specify what gets backed up. By default, nothing is backed up unless you specify it.
  • An exclusive backup product is one where you have to explicitly specify what doesn’t get backed up. By default, everything is selected and you have to winnow that selection down yourself.

The first, I consider to be the hallmark of a workgroup backup product approach. Cost reduction is the primary focus of this approach. The second, I consider to be a fundamental requirement for a product to earn the “enterprise backup product” badge of honour. Without this, there is a distinct lack of trust.

While I can understand Martin’s starting point, and that he moves more to the middle of making sure the right things are backed up, I can’t agree with this logic that this is the best approach.

I’ve seen, heard of, and witnessed too many IT war stories.

 

Last month, I posted a survey with the following questions:

  1. What is your backup server (currently)?
    1. Physical server
    2. Virtual server, backing up directly
    3. Virtual server, in director mode only
    4. Blade server, backing up directly
    5. Blade server, director mode only
  2. Would you run a virtual backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.
  3. Would you run a blade backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.

Now, I did preface this survey with my own feelings at the time:

I have to admit, I have great personal reservations towards virtualising backup servers. There’s a simple, fundamental reason for this: the backup server should have as few dependencies as possible in an environment. Therefore to me it seems completely counter-intuitive to make the backup server dependent on an entire virtualisation layer existing before it can be used.

For this reason I also have some niggling concerns with running a backup server as a blade server.

Personally, at this point in time, I would never willingly advocate deploying a NetWorker server as a virtual machine (except in a lab situation) – even when running in director mode.

At the time of the survey, I already knew from a few different sources that EMC run virtualised NetWorker servers as part of their own environment, and are happy to recommend it. I however, wasn’t. (And let’s face it, I’ve been working with NetWorker for longer than EMC’s owned it.) That being said, I wasn’t looking for confirmation that I was right – I was looking for justifiable reasons why I might be wrong.

First, I want to present the survey findings, and then I’ll discuss some of the comments and where I now stand.

There were 122 respondents to the survey, and the answers were:

Current Backup Server

Did this number surprise me? Not really – by its very nature, backup operations and administration is about being conservative: keep things simple, don’t go bleeding edge, and trust what is known. As such, the majority of sites are running a physical backup server. Of the respondents, only 10% were running any form of virtualised backup server, regardless of whether that was a software or hardware virtualised server, and regardless of whether it was directly doing backups or backing up in director mode only.

Would you run a virtual backup server?

So this question was a simple one – would you run a backup server that was virtual? Anyone who has done any surveys would claim (rightly so) that my leading questions into the survey may have coloured the results of the survey, and I’d not disagree with them.

Yet, let’s look at those numbers – less than 50% (admittedly only by a small margin) gave an outright “No” response to this question. I was pleased though that those who would run a virtualised backup server seemed to mirror my general thoughts on the matter – the majority would only do so in director mode, with the next biggest group being willing to backup to disk to the backup server, but not using other devices.

Would you run a blade backup server?

The final question asked the same about blade servers. To be fair to those using blade servers, this probably should have been prefaced with a question “Do you use blade servers in your environment already?”, since it would seem logical that anyone currently not using blade servers probably wouldn’t answer yes to this. But I was still curious – as you may be aware, I’ve had some questions about blade servers in the past; and other than offering better rack density I see them having no tangible benefits. (Then again, I am in a country that has no lack of space.)

The big difference between a software virtualised backup server and a hardware virtualised backup server though was that people who would run a backup server in a blade environment were more willing to backup to any device. That’s probably understandable. It smells like and looks like regular hardware, so it feels easier than say, a virtual machine accessing a physical tape drive does.

So, the survey showed me fairly much what I was expecting I’d see – a high level of users with physical backup servers. I was hoping though that I might see some comments from people who were either using, or considering using virtual servers, and get some feedback on what they found to be the case.

One of the best comments that came through was from Alex Kaasjager. He started with this:

I agree with you that a backup server (master, director) should be as independent as possible – and right for that specific reason, I’d prefer the server virtualised. Virtualisation solves the problem of a hardware, a hardware-bound OS, location and redundancy.

That immediately got my attention – and so Alex followed with these examples:

- if my hardware breaks (and it will at a certain point in time) I will have to keep a spare machine or go with reinstall-recovery, which, as you will agree, poses its own very peculiar set of problems
- the OS, regardless which one, is bound to the hardware, be it for licensing, MAC address, or drivers. A change in the OS (because of a move to another datacenter for example) may hurt (although it probably won’t, in all fairness)
- I can move my VM anywhere, to another rack, datacenter, or country without much hassle, I can copy, make a snap and even export it. Hardware will prevent this.

Of all the things I hadn’t considered, it was the simple ability to move your backup server between virtual servers wasn’t what I’d considered. Alex’s first point – about protection from hardware failure – is very cogent on its own, but being able to just move the backup server around without impacting any operations, or disrupting licenses – now that’s the kind of “bonus” argument I was looking for. (It’s why, for instance, I’ve advocated that if you’re going to have a License Manager server, you make that virtual.)

Another backup administrator (E. O’S) advocated:

It absolutely has to be in director mode as you describe. All the benefits of hardware abstraction and HA/FT that you get with VM are just as relevant to a critical an app as NetWorker, especially for storage mobility and expansion for a growing and changing datazone. Snapshots before major upgrades? Cloning for testing or redeployment to another site? Yes please. You have to be more confident than ever in your ability to recover NetWorker with bootstraps and indices (even onto a physical host if you need to, to solve your virtualisation layer dependency conundrum) if and when the time comes. Plan for it, practice it, and sleep easy.

The final part of what I’ve quoted there comes to the heart of my reservations of running NetWorker virtualised, even in a director role – how do you do an mmrecov of it? In particular, even when running as a backup director, the NetWorker server still has to back its own bootstrap information up to a local device. Ensuring that you can still recover from such a device would become of paramount importance.

I think the solution here is three-fold:

  • (Already available) Design a virtualised backup server such that the risk of having to do a bootstrap recovery in DR is as minimal as possible.
  • (Already available) Assuming you’re doing those bootstrap backups to disk/virtual disk, be sure to keep them as a separate disk file to the standard disk file for the VM, so that you can run any additional cloning/copying of that you want at a lower level, or attach it to another VM in an emergency.
  • (EMC please take note) It’s time that we no longer needed to do any backups to devices directly attached to the backup server. NetWorker does need architectural enhancements to allow bootstrap backup/recovery to/from storage node devices. Secondary to this: DR should not be dependent on the original and the destination host having the same names.)

So, has this exercise changed my mind or reinforced my belief that you should always run a physical backup server?

I’m probably now awkwardly sitting on the fence – facing the “virtual is OK for director mode only” camp. That would be with strong caveats to do with recoverability arrangements for the virtual machine. In particular, what I’d suggest is that I would not agree with virtualising the backup server if you were in such a small environment that there’s no provisioning for moving the guest machine between virtual servers. The absolute minimum, for me, in terms of reliability of such a solution is being able to move the backup server from one physical host to another. If you can do that, and you can then have a very well practiced and certain recovery plan in the event of a DR, then yeah, I’m sold on the merits of having a virtualised backup director server.

(If EMC updated NetWorker as per that final bullet point above? I’d be very happy to pitch my tent in that camp.)

I’ve got a couple of follow-up points and questions I’ll be making over the coming week, but I wanted to at least get this initial post out.

 

Pumping data

The age-old consideration in backup is the most simple one: how to pump the required data through in the required time frame in such a way that it can be readily recovered. This challenges us to constantly find the best way to achieve the data throughput required. What worked 10 years ago was not always applicable 5 years ago; what worked 5 years ago is not always applicable now. Consider for instance the adage:

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

(Andrew Tanenbaum, 1996.)

What surprises me, to a degree, is that still, in 2011, we’re having discussions about data throughput where people focus on the wrong thing. I would humbly respect, that you shouldn’t give a flying fracas about how fast  you can back your data up when compared to how fast you can recover it.

That’s right: when talking feeds and speeds, the only one to give a damn about in backup is how quickly you can recover the data once it’s been captured.

This is, in fact, why the terms RPO and RTO were invented. In particular for the topic of “pumping data”, RTO – Recovery Time Objective – is most important. How quickly do you need to get the data back?

In this scenario, Andrew Tanenbaum’s caution about a station wagon full of tapes hurtling down the highway is entirely appropriate. In fact, so much so that when companies start talking about how fast they need to backup (or how fast they can backup) without reference to recovery, I unfortunately go into this loop:

Why? Because it’s like when my grandmother wants to tell me a story about how she bumped into someone she hadn’t seen for 57 years in the supermarket, but gets stuck on an irrelevant detail. “Peaches or pears!” I used to say to her as a kid, perhaps a little disrespectfully – it didn’t matter whether she was out shopping for peaches or pears before the important thing happened! Same here – it doesn’t matter how fast you can pump data into the backup system – it’s how fast you can pump data out of it that is the only number worth focusing on.

We have to, as storage industry insiders, experts, advisors, consultants – whatever we want to call ourselves – keep vendors and customers focused on the real important metric: how fast they can recover. We have a duty of care to stand between the FUD and the hype and steer companies on a safe trajectory. The safe trajectory in this case is talking about recovery speeds rather than backup speeds.

This is, for instance, why I rarely get excited about remote office backup strategies. For instance, a current meme in remote office backup strategy is the use of deduplication – most likely source based. The goal? Reduce the amount of data you have to transfer from the remote office to the head office to a small trickle, and all your problems are solved … until, of course, you need to recover that data.

Don’t get me wrong, I’m not against remote office backups – I’m also not against centralised remote office backups, regardless of whether they’re achieved by deduplication, compression, magic pixies or faerie dust. In this example though there’s a simple fact: to talk about remote office backup without discussing remote office recovery is reprehensible.

Yes, reprehensible. I’ll use that term. It’s not a nice term, I know, but nor is the practice of ignoring the elephant in the room – recovery.

Look folks, do you really want me to prance around a stage doing the monkey dance shouting “Recovery! Recovery! Recovery!”? Is that what it has to take? Because, if it is, I’ll do it. (I might, if you don’t mind, try to avoid the flop sweat though.)

What am I asking for? Maybe it’s this simple thought:

Starting this year, let no company (vendor or otherwise) talk about a product’s backup performance without citing real world recovery scenarios and performance in those scenarios.

There is not a guaranteed 1:1 mapping between backup and recovery performance, and to imply there is, either by obfuscation or omission is disrespectful to the data protection industry.

 

The holiday season is upon many of us – whether you celebrate xmas or christmas, or just the new year according to the Julian calendar, we’re approaching that point where things start to ease off for a lot of people and we spend more time with our families and friends.

Before I wrap up for the year, I wanted to spend a few minutes reintroducing some of the most popular topics of the year on the blog – the top ten articles based on directly linked accesses. Going in reverse order, they are:

  • Number 10 – “Why I’d choose NetWorker over NetBackup every time“. I was basically called an idiot by someone in the storage community for writing this, but the fact remains for me that any backup product that fails to support backup dependencies is not one that I would personally choose. Given that a top search that leads people to the blog is of the kind, “netbackup vs networker” or “networker vs netbackup”, clearly people are out there comparing the two products, and I stand by my support of the primacy of backup dependency tracking.
  • Number 9 – “A tale of 4 vendors“. A couple of months ago I attended SNIA’s first Australian storage blogger event, touring EMC, IBM, HDS and NetApp. Initially I’d planned to blog a fairly literal dump of the information I jotted down during the event, but I realised instead I was more drawn to the total solution stories being told by the 4 vendors.
  • Number 8 – “NetWorker 7.5.2 – What’s it got?“. NetWorker 7.5 represented a big upgrade mark for a lot of sites, particularly those that wanted to jump the v7.3 and v7.4 release trees. I still get a lot of searches coming to the blog based on NetWorker 7.5 features and upgrades.
  • Number 7 – “Using NetWorker Client with Opensolaris“. This was written by guest blogger Ronny Egner, and has seen more interest over the last few months as Oracle’s acquisition continues to grind down paid Sun customers. If you’re interested in writing guest blog pieces for the NetWorker Blog in 2011, let me know!
  • Number 6 – “Basics – Fixing ‘NSR peer information’ errors“. I’ve said it before, and I’ll say it again: there is no valid reason why the resolution for this hasn’t been built into NMC!
  • Number 5 – “NetWorker and linuxvtl, Redux“. The open source LinuxVTL project continues to grow and develop. While it’s not suited for production environments, LinuxVTL is certainly a handy VTL to plug into a NetWorker/Linux system for testing purposes. I know – I use it almost every single day.
  • Number 4 and Number 3 – “NetWorker 7.6 SP1“. Interest in NetWorker 7.6 SP1 has been huge, and I had two blog postings about it – a preview posting based on publicly shared information from EMC, and the actual post-release article that covered some key features more in-depth.
  • Number 2 – “Carry a Jukebox with you (if you’re using Linux)“. The first article I wrote about the LinuxVTL project.
  • Number 1 – “micromanual: NetWorker Power User Guide to nsradmin“. The Power User guide to nsradmin has been downloaded well over a thousand times. I’ve been a fan of nsradmin ever since I started using NetWorker and had to administer a few NetWorker servers over extremely slow links (think dial-up speeds). It’s been very gratifying to be able to introduce so many people to such a useful and powerful tool.

Personally this year has been a pretty big one for me. Probably the biggest single event was that my partner and I made the decision to move from central coast NSW to Melbourne, Victoria during the year. We haven’t moved yet; it’s due for June 2011, but it’s going to necessitate a lot of action and work on our part to get there. It’ll be well worth the effort though, and I’ve already reached that odd point where I no longer think of the place I’m living as “home”. The reasons that led us to that decision are covered on my personal blog here. Continuing the personal front, I was extremely pleased to be able to say goodbye to the mobile “netwont” that is Vodafone in Australia. I’ve been using my personal blog to talk about a lot of varied topics running from internet censorship to invasive information requests to more mundane things, such as what makes a good consultant.

Technically I think the coming few years are going to be fascinating. Deduplication has only just started to make a splash; I think it’ll be a while before it becomes as pervasive as say, plain old disk backup, but it will have a continued and growing effect in the enterprise backup market. I predict that another bevy of dopey analysts will insist that tape is dead, just like they have every year for the last 2 decades, and at the end of the year I predict the majority of companies they interface with will still be using tape in some form or another. However, the use of tape will continue to evolve in the marketplace; as nearline disk storage becomes more regular and cheaper for backup solutions, we’ll see tape continue to be pushed out to longer term retention systems and safety nets – i.e., tape is certainly sliding away from being the primary source for recoveries in an enterprise backup environment.

One last thing – I want to thank the readers of this blog. To those people who subscribe to the mailing list, and those who subscribe to the RSS feed, to those who have the site bookmarked and to those who just randomly stumble across the site – I hope in each case you’re finding something useful, and I’m grateful for your readership.

Happy holidays to those of you celebrating or relaxing over the coming weeks, and peaceful times to those working through.

 

I’m not a storage geek – storage to me is a means to an end, almost irrelevant to the final goal.

I’m passionate about backup though, because backup is about making people happy.

Backup is about recovery, you see.

Recovery is about making sure people can go home on time rather than re-entering lost data all night.

Recovery is about knowing someone can turn up for a flight they booked six weeks earlier and know the airline still knows they booked the ticket.

Recovery is about knowing someone’s pay deposit isn’t lost after a brief systems hiccup.

Recovery is about a student saving a 50,000 word thesis on a server and knowing it will still be there next morning.

Recovery is about being able to look at digital photos of a loved one ten years after they’re gone.

I have the best job in the world.

If you work in backup and recovery, so do you.

 

Once upon a time, if you said to someone “do you have a test environment?” there was at least a 70 to 80% chance that the answer would be one of the following:

  • Only some very old systems that we decommissioned from production years ago
  • No, management say it’s too expensive

I’d like to suggest that these days, with virtualisation so easy, there are few reasons why the average site can’t have a reasonably well configured backup and recovery test environment. This would allow the following sorts of tests could be readily conducted:

  • Disaster recovery of hosts and databases
  • Disaster recovery of the backup server
  • Testing new versions of operating systems, databases and applications with the backup software
  • Testing new versions of the backup software

Focusing on the Intel/x86/x86_64 world, we see where this is immediately achievable. Remember, for the average set of tests that you run, speed is not necessarily going to be the issue. Let’s focus on non-speed functionality testing, and think of what would be required to have a test environment that would suit many businesses, regardless of size:

  1. Virtualisation server – obviously VMware ESXi springs to mind here, if cost is a driving factor.
  2. Cheap storage – if performance is not an issue for testing (i.e., you’re after functionality not speed testing), there’s no reason why you can’t use cheap storage. A few 2TB SATA drives in a RAID-5 configuration will give you oodles of space if you need any level of redundancy, or just in a RAID-0 stripe will give you capacity and performance. Optionally present storage via iSCSI if its available.
  3. Tiny footprint – previously test environments were disqualified in a lot of organisations, particularly those at locations where space was at a premium. Allocating room for say, 15 machines to simulate part of the production network took up tangible space – particularly when it was common for test environments to not be built using rackable equipment.

In the 2000′s, much excitement was heralded over the notion of supercomputers at your desk – for example, remember when Orion released a 96-CPU capable system? The notion of that much CPU horsepower under your desk for single tasks may be appealing to some, but let’s look at more practical applications flowing from multi-core/multi-CPU systems – a mini datacentre under your desk. Or in that spare cubicle. Or just in a 3U rack enclosure somewhere within your datacentre itself.

Gone are the days when backup and recovery test environments are cost prohibitive. You’re from a small organisation? Maybe 10-20 production servers at most? Well that simply means your requirements will be smaller and you can probably get away with just VMware Workstation, VMware Fusion, Parallels or VirtualBox running on a suitably powerful desktop machine.

For companies already running virtualised environments, it’s more than likely the case that you can even use a production virtualisation server due for replacement as a host to the test environment, so long as it can still virtualise a subset of the production systems you’d need to test with. During budgetary planning this can make the process even more painless.

This sort of test environment obviously doesn’t suit every single organisation or every single test requirement – however, no single solution ever does. If it does suit your organisation though, it can remove a lot of the traditional objections to dedicated test environments.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha