Last month, I posted a survey with the following questions:

  1. What is your backup server (currently)?
    1. Physical server
    2. Virtual server, backing up directly
    3. Virtual server, in director mode only
    4. Blade server, backing up directly
    5. Blade server, director mode only
  2. Would you run a virtual backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.
  3. Would you run a blade backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.

Now, I did preface this survey with my own feelings at the time:

I have to admit, I have great personal reservations towards virtualising backup servers. There’s a simple, fundamental reason for this: the backup server should have as few dependencies as possible in an environment. Therefore to me it seems completely counter-intuitive to make the backup server dependent on an entire virtualisation layer existing before it can be used.

For this reason I also have some niggling concerns with running a backup server as a blade server.

Personally, at this point in time, I would never willingly advocate deploying a NetWorker server as a virtual machine (except in a lab situation) – even when running in director mode.

At the time of the survey, I already knew from a few different sources that EMC run virtualised NetWorker servers as part of their own environment, and are happy to recommend it. I however, wasn’t. (And let’s face it, I’ve been working with NetWorker for longer than EMC’s owned it.) That being said, I wasn’t looking for confirmation that I was right – I was looking for justifiable reasons why I might be wrong.

First, I want to present the survey findings, and then I’ll discuss some of the comments and where I now stand.

There were 122 respondents to the survey, and the answers were:

Current Backup Server

Did this number surprise me? Not really – by its very nature, backup operations and administration is about being conservative: keep things simple, don’t go bleeding edge, and trust what is known. As such, the majority of sites are running a physical backup server. Of the respondents, only 10% were running any form of virtualised backup server, regardless of whether that was a software or hardware virtualised server, and regardless of whether it was directly doing backups or backing up in director mode only.

Would you run a virtual backup server?

So this question was a simple one – would you run a backup server that was virtual? Anyone who has done any surveys would claim (rightly so) that my leading questions into the survey may have coloured the results of the survey, and I’d not disagree with them.

Yet, let’s look at those numbers – less than 50% (admittedly only by a small margin) gave an outright “No” response to this question. I was pleased though that those who would run a virtualised backup server seemed to mirror my general thoughts on the matter – the majority would only do so in director mode, with the next biggest group being willing to backup to disk to the backup server, but not using other devices.

Would you run a blade backup server?

The final question asked the same about blade servers. To be fair to those using blade servers, this probably should have been prefaced with a question “Do you use blade servers in your environment already?”, since it would seem logical that anyone currently not using blade servers probably wouldn’t answer yes to this. But I was still curious – as you may be aware, I’ve had some questions about blade servers in the past; and other than offering better rack density I see them having no tangible benefits. (Then again, I am in a country that has no lack of space.)

The big difference between a software virtualised backup server and a hardware virtualised backup server though was that people who would run a backup server in a blade environment were more willing to backup to any device. That’s probably understandable. It smells like and looks like regular hardware, so it feels easier than say, a virtual machine accessing a physical tape drive does.

So, the survey showed me fairly much what I was expecting I’d see – a high level of users with physical backup servers. I was hoping though that I might see some comments from people who were either using, or considering using virtual servers, and get some feedback on what they found to be the case.

One of the best comments that came through was from Alex Kaasjager. He started with this:

I agree with you that a backup server (master, director) should be as independent as possible – and right for that specific reason, I’d prefer the server virtualised. Virtualisation solves the problem of a hardware, a hardware-bound OS, location and redundancy.

That immediately got my attention – and so Alex followed with these examples:

- if my hardware breaks (and it will at a certain point in time) I will have to keep a spare machine or go with reinstall-recovery, which, as you will agree, poses its own very peculiar set of problems
- the OS, regardless which one, is bound to the hardware, be it for licensing, MAC address, or drivers. A change in the OS (because of a move to another datacenter for example) may hurt (although it probably won’t, in all fairness)
- I can move my VM anywhere, to another rack, datacenter, or country without much hassle, I can copy, make a snap and even export it. Hardware will prevent this.

Of all the things I hadn’t considered, it was the simple ability to move your backup server between virtual servers wasn’t what I’d considered. Alex’s first point – about protection from hardware failure – is very cogent on its own, but being able to just move the backup server around without impacting any operations, or disrupting licenses – now that’s the kind of “bonus” argument I was looking for. (It’s why, for instance, I’ve advocated that if you’re going to have a License Manager server, you make that virtual.)

Another backup administrator (E. O’S) advocated:

It absolutely has to be in director mode as you describe. All the benefits of hardware abstraction and HA/FT that you get with VM are just as relevant to a critical an app as NetWorker, especially for storage mobility and expansion for a growing and changing datazone. Snapshots before major upgrades? Cloning for testing or redeployment to another site? Yes please. You have to be more confident than ever in your ability to recover NetWorker with bootstraps and indices (even onto a physical host if you need to, to solve your virtualisation layer dependency conundrum) if and when the time comes. Plan for it, practice it, and sleep easy.

The final part of what I’ve quoted there comes to the heart of my reservations of running NetWorker virtualised, even in a director role – how do you do an mmrecov of it? In particular, even when running as a backup director, the NetWorker server still has to back its own bootstrap information up to a local device. Ensuring that you can still recover from such a device would become of paramount importance.

I think the solution here is three-fold:

  • (Already available) Design a virtualised backup server such that the risk of having to do a bootstrap recovery in DR is as minimal as possible.
  • (Already available) Assuming you’re doing those bootstrap backups to disk/virtual disk, be sure to keep them as a separate disk file to the standard disk file for the VM, so that you can run any additional cloning/copying of that you want at a lower level, or attach it to another VM in an emergency.
  • (EMC please take note) It’s time that we no longer needed to do any backups to devices directly attached to the backup server. NetWorker does need architectural enhancements to allow bootstrap backup/recovery to/from storage node devices. Secondary to this: DR should not be dependent on the original and the destination host having the same names.)

So, has this exercise changed my mind or reinforced my belief that you should always run a physical backup server?

I’m probably now awkwardly sitting on the fence – facing the “virtual is OK for director mode only” camp. That would be with strong caveats to do with recoverability arrangements for the virtual machine. In particular, what I’d suggest is that I would not agree with virtualising the backup server if you were in such a small environment that there’s no provisioning for moving the guest machine between virtual servers. The absolute minimum, for me, in terms of reliability of such a solution is being able to move the backup server from one physical host to another. If you can do that, and you can then have a very well practiced and certain recovery plan in the event of a DR, then yeah, I’m sold on the merits of having a virtualised backup director server.

(If EMC updated NetWorker as per that final bullet point above? I’d be very happy to pitch my tent in that camp.)

I’ve got a couple of follow-up points and questions I’ll be making over the coming week, but I wanted to at least get this initial post out.

 

For a while now I’ve been working with EMC support on an issue that’s only likely to strike sites that have intermittent connectivity between the server and storage nodes and that stage from ADV_FILE on the storage node to ADV_FILE on the server.

The crux of the problem is that if you’re staging from storage node to server and comms between the sites are lost for long enough that NetWorker:

  • Detects the storage node nsrmmd processes have failed, and
  • Attempts to restart the storage node nsrmmd processes, and
  • Fails to restart the storage node nsrmmd processes

Then you can end up in a situation where the staging aborts in an ‘interesting’ way. The first hint of the problem is that you’ll see a message such as the following in your daemon.raw:

68975 10/15/2009 09:59:05 AM  2 0 0 526402000 4495 0 tara.pmdg.lab nsrmmd filesys_nuke_ssid: unable to unlink /backup/84/05/notes/c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342 on device `/backup’: No such file or directory

(The above was rendered for your convenience.)

However, if you look for the cited file, you’ll find that it doesn’t exist. That’s not quite the end of the matter though. Unfortunately, while the saveset file that was being staged didn’t stay on disk, its media database details did. So in order to restart staging, it becomes necessary to first locate the saveset in question and delete the media database entry for the (failed) server disk backup unit copy. Interestingly, this is only ever to be found on the RW device, not the RO device:

[root@tara ~]# mminfo -q "ssid=c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342"
 volume        client       date      size   level  name
Tara.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001.RO    fawn      10/15/2009 1287 MB manual  /usr/share

We had hoped that it was fixed in 7.5.1.5, but my tests aren’t showing that to be the case. Regardless, it’s certainly around in 7.4.x as well and (given the nature of it) has quite possibly been around for a while longer than that.

As I said at the outset, this isn’t likely to affect many sites, but it is something to be aware of.

 

For some time I’ve wished NetWorker would support both storage node and server functions on Mac OS X. When I had a PPC 17″ PowerBook, this mostly came from the glacially slow performance of running Linux within VirtualPC so as to run up a NetWorker server for testing. (Windows-within-Virtual PC was a dead-loss: the then-current version of NetWorker would not even start within VirtualPC.)

Since Apple made the jump to Intel machines, running a NetWorker server for lab work within a virtual machine has been far more efficient, given that now it’s just virtualisation rather than emulation. However, I’ve been thinking for a while that given the performance options available on Mac OS X, and the amount of data frequently stored on Mac OS X machines, not supporting at least a storage node is foolish.

Now that I have a Mac Pro, my personal belief is that it’s crazy not to support Mac OS X both as server and storage node.

Why, you may ask, would I think this? Is it just some weird combination of the “Mac Fan Boy” and “NetWorker Fan Boy” that I want them joined at the hip like some bizarre Doctor Frankenstein experiment?

[Here's an aside. Why is it that people who defend Apple, and Macs, are immediately declared to be Apple Fan Boys, when PC/Windows users just as vehemently defend their own platforms declare themselves 'realistic'? There's only one answer: sad hypocrisy. Defending one platform is "hysterial frothing at the mouth buy-in to the reality distortion effect of Steve Jobs", whereas equally defending another platform is "logical". Please, spare me.]

So, jumping off that little soap box, I do actually have a method to my madness here. I honestly think, bang for buck, that the Mac Pro (using Apple’s high end machines as a reference point) represent the sort of significant processing and expansion capability often sought in backup servers. I snapped up a bargain previous generation Mac Pro that features that Intel Xeon 5400 CPUs rather than the current top-of-the-line Nehalem based processors, but this machine has serious processing power. The reason it’s called a workstation by Apple is because of it’s ability to handle complex graphics – but in reality it’s basically a server in a nice shell. With 8 x 3.2GHz cores and (currently) 12GB of RAM, this is a machine that just absolutely flies at data throughput. With expansion of up to 32GB of RAM, Mac Pros represent in one shiny shell more than enough processing power to run a backup server/storage node for any sized business*.

For companies that are space-conscious, there’s the “server” version of the Mac Pro, the Xserve, which is quite a powerful host in a 1RU enclosure.

Given the client software has already been ported to Mac OS X, the hard work has effectively already been done; server and storage node options are not going to take a significant amount of development effort.

Is there justification in porting server and storage node to Mac OS X? The cynical part of me wants to answer that there’s a hell of a lot better justification in porting server/storage node to Mac OS X than there was in porting the client to Linux PPC, but undoubtedly that would have been done to service some large-scale deal for EMC – i.e., there would have been significant business-incentive to do so.

Is there a business incentive for supporting more than client capabilities on Mac OS X? Well, market share is continuing to grow, as evidenced by Microsoft breaking what is almost universally acknowledged as the golden rule of advertising**. Then there’s the high frequency of use of Mac OS X systems in academia. This may not seem a compelling business case for EMC now, but let’s think a little longer-term – as more and more people become exposed to Macs again during their education (either secondary or tertiary), that exposure is going to influence them in their buying decisions as they move into employment. I.e., short of some catastrophic collapse***, Apple is going to see market share continue to increase – not flatten, not drop, but continue to increase.

In the short term though, another compelling reason is where Apple’s market share is at its highest – multimedia: graphic design, advertising, etc., all feature large amounts of data storage. While there’s some support for client software within Mac OS X, the backup server market in that arena is owned almost exclusively by Retrospect. (Retrospect is a good product, but it is still reasonably limited – definitely a workgroup, rather than an enterprise product.) In short, it seems mad that machines that routinely store tens or more terabytes of storage are denied storage node/dedicated storage node capabilities.

Now, some might argue that the lack of support for Sybase DBAnywhere (which powers GST, the back-end to NMC) would be sufficient cause to stop at the client (or at most, the storage node); after all, if you can’t run the GST/NMC server on the backup server, what’s the point? I have two (I believe valid) responses to this: first, it’s reasonably common to see separation between NMC/GST services and the NetWorker server, not only in environments that have multiple backup servers, but also just for reducing the potential for one service impacting the other. Secondly, there’s already examples of NetWorker server platforms that don’t have an accompanying NMC/GST server option – Solaris/AMD springs to mind immediately, and I know there’s other examples as well.

I do honestly think the point is rapidly approaching (if it is not already here) where there are more compelling reasons to port NetWorker server and storage node to Mac OS X than there are for not doing so. Architecturally, data storage volumes, increasing market share and an existing client all point to this having solid reasons.


* Note that I’m not saying that they are capable of being the sole backup server for a massive company; just like any other platform, in larger environments, the three-tier approach is always required.

** That rule is that the number one company in an industry should never refer to the number two company in the industry in their advertising.

*** With a market cap that now periodically bounces above Google’s, this seems somewhat unlikely now. (Apple’s market cap now exceeds the combined market cap of HP and Dell.)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha