Last month, I posted a survey with the following questions:

  1. What is your backup server (currently)?
    1. Physical server
    2. Virtual server, backing up directly
    3. Virtual server, in director mode only
    4. Blade server, backing up directly
    5. Blade server, director mode only
  2. Would you run a virtual backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.
  3. Would you run a blade backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.

Now, I did preface this survey with my own feelings at the time:

I have to admit, I have great personal reservations towards virtualising backup servers. There’s a simple, fundamental reason for this: the backup server should have as few dependencies as possible in an environment. Therefore to me it seems completely counter-intuitive to make the backup server dependent on an entire virtualisation layer existing before it can be used.

For this reason I also have some niggling concerns with running a backup server as a blade server.

Personally, at this point in time, I would never willingly advocate deploying a NetWorker server as a virtual machine (except in a lab situation) – even when running in director mode.

At the time of the survey, I already knew from a few different sources that EMC run virtualised NetWorker servers as part of their own environment, and are happy to recommend it. I however, wasn’t. (And let’s face it, I’ve been working with NetWorker for longer than EMC’s owned it.) That being said, I wasn’t looking for confirmation that I was right – I was looking for justifiable reasons why I might be wrong.

First, I want to present the survey findings, and then I’ll discuss some of the comments and where I now stand.

There were 122 respondents to the survey, and the answers were:

Current Backup Server

Did this number surprise me? Not really – by its very nature, backup operations and administration is about being conservative: keep things simple, don’t go bleeding edge, and trust what is known. As such, the majority of sites are running a physical backup server. Of the respondents, only 10% were running any form of virtualised backup server, regardless of whether that was a software or hardware virtualised server, and regardless of whether it was directly doing backups or backing up in director mode only.

Would you run a virtual backup server?

So this question was a simple one – would you run a backup server that was virtual? Anyone who has done any surveys would claim (rightly so) that my leading questions into the survey may have coloured the results of the survey, and I’d not disagree with them.

Yet, let’s look at those numbers – less than 50% (admittedly only by a small margin) gave an outright “No” response to this question. I was pleased though that those who would run a virtualised backup server seemed to mirror my general thoughts on the matter – the majority would only do so in director mode, with the next biggest group being willing to backup to disk to the backup server, but not using other devices.

Would you run a blade backup server?

The final question asked the same about blade servers. To be fair to those using blade servers, this probably should have been prefaced with a question “Do you use blade servers in your environment already?”, since it would seem logical that anyone currently not using blade servers probably wouldn’t answer yes to this. But I was still curious – as you may be aware, I’ve had some questions about blade servers in the past; and other than offering better rack density I see them having no tangible benefits. (Then again, I am in a country that has no lack of space.)

The big difference between a software virtualised backup server and a hardware virtualised backup server though was that people who would run a backup server in a blade environment were more willing to backup to any device. That’s probably understandable. It smells like and looks like regular hardware, so it feels easier than say, a virtual machine accessing a physical tape drive does.

So, the survey showed me fairly much what I was expecting I’d see – a high level of users with physical backup servers. I was hoping though that I might see some comments from people who were either using, or considering using virtual servers, and get some feedback on what they found to be the case.

One of the best comments that came through was from Alex Kaasjager. He started with this:

I agree with you that a backup server (master, director) should be as independent as possible – and right for that specific reason, I’d prefer the server virtualised. Virtualisation solves the problem of a hardware, a hardware-bound OS, location and redundancy.

That immediately got my attention – and so Alex followed with these examples:

- if my hardware breaks (and it will at a certain point in time) I will have to keep a spare machine or go with reinstall-recovery, which, as you will agree, poses its own very peculiar set of problems
- the OS, regardless which one, is bound to the hardware, be it for licensing, MAC address, or drivers. A change in the OS (because of a move to another datacenter for example) may hurt (although it probably won’t, in all fairness)
- I can move my VM anywhere, to another rack, datacenter, or country without much hassle, I can copy, make a snap and even export it. Hardware will prevent this.

Of all the things I hadn’t considered, it was the simple ability to move your backup server between virtual servers wasn’t what I’d considered. Alex’s first point – about protection from hardware failure – is very cogent on its own, but being able to just move the backup server around without impacting any operations, or disrupting licenses – now that’s the kind of “bonus” argument I was looking for. (It’s why, for instance, I’ve advocated that if you’re going to have a License Manager server, you make that virtual.)

Another backup administrator (E. O’S) advocated:

It absolutely has to be in director mode as you describe. All the benefits of hardware abstraction and HA/FT that you get with VM are just as relevant to a critical an app as NetWorker, especially for storage mobility and expansion for a growing and changing datazone. Snapshots before major upgrades? Cloning for testing or redeployment to another site? Yes please. You have to be more confident than ever in your ability to recover NetWorker with bootstraps and indices (even onto a physical host if you need to, to solve your virtualisation layer dependency conundrum) if and when the time comes. Plan for it, practice it, and sleep easy.

The final part of what I’ve quoted there comes to the heart of my reservations of running NetWorker virtualised, even in a director role – how do you do an mmrecov of it? In particular, even when running as a backup director, the NetWorker server still has to back its own bootstrap information up to a local device. Ensuring that you can still recover from such a device would become of paramount importance.

I think the solution here is three-fold:

  • (Already available) Design a virtualised backup server such that the risk of having to do a bootstrap recovery in DR is as minimal as possible.
  • (Already available) Assuming you’re doing those bootstrap backups to disk/virtual disk, be sure to keep them as a separate disk file to the standard disk file for the VM, so that you can run any additional cloning/copying of that you want at a lower level, or attach it to another VM in an emergency.
  • (EMC please take note) It’s time that we no longer needed to do any backups to devices directly attached to the backup server. NetWorker does need architectural enhancements to allow bootstrap backup/recovery to/from storage node devices. Secondary to this: DR should not be dependent on the original and the destination host having the same names.)

So, has this exercise changed my mind or reinforced my belief that you should always run a physical backup server?

I’m probably now awkwardly sitting on the fence – facing the “virtual is OK for director mode only” camp. That would be with strong caveats to do with recoverability arrangements for the virtual machine. In particular, what I’d suggest is that I would not agree with virtualising the backup server if you were in such a small environment that there’s no provisioning for moving the guest machine between virtual servers. The absolute minimum, for me, in terms of reliability of such a solution is being able to move the backup server from one physical host to another. If you can do that, and you can then have a very well practiced and certain recovery plan in the event of a DR, then yeah, I’m sold on the merits of having a virtualised backup director server.

(If EMC updated NetWorker as per that final bullet point above? I’d be very happy to pitch my tent in that camp.)

I’ve got a couple of follow-up points and questions I’ll be making over the coming week, but I wanted to at least get this initial post out.

 

It used to be 10 years ago that you couldn’t do anything in the backup space without having an answer to the question, “How do you achieve BMR?” Nowadays, it’s not a dirty word in backup, but it certainly seems to be somewhat passé.

So what happened? Is BMR now dead? Is it on life support? Did it ascend?

It’s an interesting question. I think that as an independent technology, BMR has become ever more niche, and what we’ve seen is a gradual shift in technology so as to allow BMR to become a silent feature. As such, it doesn’t necessarily get a lot of attention – it just blends into the background.

For the most part, I’d suggest that I found BMR to be more of a focus point in the Windows market, then later in the emerging Linux market, though still with a primary focus on Windows. This wasn’t to say that rapid systems recovery wasn’t important on other platforms, but on those platforms there were frequently technologies built into the OS. AIX could boot from a system image tape. Solaris could be Jumpstarted, etc. Eventually, Linux could be Kickstarted.

In the Legato space, BMR options were pretty challenging for the most part, so 10 years ago I’d regularly recommend customers wanting to BMR their Windows servers to deploy Ghost. It wasn’t perfect, but it did the trick – the goal in my mind was to get a system back to a state of easy recoverability; i.e., BMR was about allowing you to get a system back to the point where you could run a full recovery. Nothing more, nothing less. That was undoubtedly influenced by the lack of integrated BMR within NetWorker, but it worked, and it let each product focus on what it did best.

These days I think BMR is something that’s effectively available in most enterprise spaces without actually needing to reference it as an independent technology. So it comes into play primarily as a result of virtualisation and snapshots.

Within virtualisation, there’s two options that tend resolve independent BMR requirements – templates, and image level backups, though for slightly different reasons.

Templates are designed to allow a rapid deployment of a new guest – be it just at the operating system level, or a combination operating system and application level; such templates will usually include a certain level of patching – enough to get a host at a secure enough point to connect to a corporate network. But they don’t have to be used just for the deployment of a new guest; instead, if a guest fails or becomes otherwise hopelessly corrupt, there’s nothing stopping the use of a template to rapidly bring the guest “back to life” to allow a regular recovery. If backups are being done at the guest level, then a smart template will also include the backup software so that it’s immediately available on system (re)creation.

On the other hand, image level backups fulfil the old “cold backup” niche. When virtualisation started hitting its stride, image level backups were seen as the future, but then reality struck and it became painfully obvious that recovering a 100GB virtual machine to pull out a 10KB document was wasteful and time consuming. Since then file level recovery from image level backup has improved, but it’s still not an omnipresent technology. That being said, image level backup works perfectly as a rapid BMR mechanism. Even assuming a situation where an image level backup is only taken once a month, recovering a machine from an image backup done 30 days ago puts you in a situation to allow regular host-based recoveries to run with minimum effort.

We frequently look at snapshots at enabling more useful RPO and RTOs than traditional “once per day” backups. It’s common for instance to see NAS systems with hourly read-only snaps immediately available to end users for self-directed recoveries. They’re also used to facilitate traditional backups by doing quiesced backups with minimum downtime, or less disruptive backups.

However, certainly in the enterprise space, snapshots equally provide an excellent BMR solution. Snapshot, patch, revert to snapshot if patch fails, etc. Array level snapshots (IMHO) provide a significantly greater level of flexibility than a traditional BMR solution where the primary focus is getting a machine back to its most recent usable state. Snapshots are so useful on this front that they’re even used within virtualisation for exactly that reason – why go back to an image level backup, or waste time doing a cold backup of a virtual machine when you can just roll back to a snapshot taken 10 minutes ago?

What I’ve been observing now for a while is that BMR as an independent product gets very little attention these days in enterprises. At the small to medium business it still gets bandied about – often for desktops as much as for servers, but it increasingly seems that virtualisation and snapshots have gobbled up most of the BMR space in the enterprise.

It seems that over time even that space may become narrowed. Looking at Mac OS X as an example, the ability to do a new system install referencing a Time Machine backup is a perfect example of an operating system integrated approach to BMR. Does it solve all BMR issues, even on the OS X platform? No, but it addresses the 80% rule, I believe. Will it be the only such product? I can’t believe so – I have to believe we’ll eventually see something comparable in other operating systems.

What are your thoughts?

 

Everyone makes mistakes. That’s part of being human. Indeed, I’d suggest that anyone who expects you to never make mistakes in your job may perhaps be either demented or live at right angles to reality.

I believe the best, the most realistic and useful thing we can aim for is to never make the same mistake twice.

Thus, below I present my “worst recovery ever” as an example of a mistake that I certainly don’t intend to ever have happen to me again. It happened in my last job, and since then I’ve changed the way I work when it comes to recoveries.

It was Friday afternoon – about 3pm in fact. There was a training course running, and for once the training network was behaving itself. As our management had (once) bought into the notion of network computing, our training environment was sufficiently convoluted such that Sun-Rays referred to our production backup server+fileserver+SunRay server, then ran RDP to a VMware server.

One of our engineers who had little to do with NetWorker (ironically, he was hired to be my field replacement for NetWorker, but circumstances changed when he was hired) ran one of those notoriously bad RedHat updates that for a while killed glibc and a bunch of other system files if you didn’t happen to be in North America. So after much diagnosing and discussing, it was necessary to do an OS recovery. The only problem was that the OS on his laptop was so hosed that you couldn’t start any more login sessions, so you were stuck with what was there.

I started the recovery, selected files and viewed volumes. However, some of the files we needed were on media that was outside of the library, and because we were backing up to disk, then cloning, then staging, NetWorker wanted the (offsite) clones, not the onsite originals. His laptop didn’t have administration rights on the backup server, so I ssh’d across to the backup server, set the appropriate tapes to have a ‘suspect’ status, then ran up the recovery again. I selected the root filesystem (/) and kicked off the recovery.

About 2-3 minutes later someone came to ask me about why they couldn’t access the fileserver any more. Checking, I couldn’t log in either. It was odd – the training course was still working, but nothing new would work.

And then it hit me. I never logged out of the Sun system before I kicked off the recovery. Not only that, when I kicked off the recovery, I’d run “recover -c linuxClienton the Sun server.

That’s right, I was recovering a Linux filesystem on top of a Solaris system – including /dev, including all the base binaries. And because it was a recovery designed to overwrite a clobbered operating system, I’d told it to force-overwrite everything it came across.

I was …unhappy… with myself. I aborted the recovery, obviously, but the damage had already been done. No-one else could do anything, and I couldn’t start the recovery of the backup server because it was also the Sun-Ray server, and there were a bunch of paying students wrapping up their training course. So I had to wait for the training course to complete before I could even start the recovery. Practically everyone else got an early mark, since there wasn’t much they could do.

The recovery turned out to be somewhat problematic. Because the Solaris OS was hosed, no new processes could be started on that – the only solution, after much consideration, was an OS reinstall. At some point though the OS disks for that server had been taken out to a customer site because that customer had lost their disks, or something along those lines, so an earlier revision Solaris installer disk was used. However, it turned out that earlier revision disk wasn’t compatible with that hardware, and these were the days when Solaris took forever to install when you didn’t have tonnes of RAM. So by 10pm that night, I’d given up on being able to install the OS that night. Murphy’s law strikes as much with recoveries as it does with anything else in IT, so of course I’d not slept a wink the night before, and I lived one and a half hours away from the office by train at the best of times – at that time of the night, 3 hours at least.

I fell into bed around 2am, but fretting over the recovery, didn’t sleep much either. I’d recently purchased a Sun workstation myself, so I knew I had an install disk that would work, which was why I’d chosen to come home rather than download Solaris from the office. Later that morning saw me heading back into the office to keep going, and of course, my installer of Solaris was now faulty so the OS install hit bad sectors on the disk during the process and I hit my head repeatedly against the column in my office.

Lady luck then struck. Or at least wandered by and waved. There was a detached mirror left over from some disk swapping from about 2-3 months ago.

Finally I managed to boot the system from the previously attached mirror, get that mirror re-syncing, and installed NetWorker. Once the mirrors had resynced the recovery started in earnest, and the system finally came back.

By that stage though it was about 5pm on Saturday afternoon. I was wrecked from having not slept for a couple of days, not to mention still bloody angry with myself for the entire chain of events.

So I vowed I’d never make the same mistake again.

How, I hear you ask, will I prevent myself from ever making this mistake again? I check, check, check. In real recovery situations, I never run a recovery command any longer without first checking what host I’m logged into. It takes me an extra 10 seconds, maybe even fewer, but it guarantees that I don’t get that sick filling in the pit of my stomach that comes with FUBAR’ing one system when trying to recover another.

If I had done that simple check all those years ago, I would have had a lovely, quiet weekend.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha