Last month, I posted a survey with the following questions:

  1. What is your backup server (currently)?
    1. Physical server
    2. Virtual server, backing up directly
    3. Virtual server, in director mode only
    4. Blade server, backing up directly
    5. Blade server, director mode only
  2. Would you run a virtual backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.
  3. Would you run a blade backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.

Now, I did preface this survey with my own feelings at the time:

I have to admit, I have great personal reservations towards virtualising backup servers. There’s a simple, fundamental reason for this: the backup server should have as few dependencies as possible in an environment. Therefore to me it seems completely counter-intuitive to make the backup server dependent on an entire virtualisation layer existing before it can be used.

For this reason I also have some niggling concerns with running a backup server as a blade server.

Personally, at this point in time, I would never willingly advocate deploying a NetWorker server as a virtual machine (except in a lab situation) – even when running in director mode.

At the time of the survey, I already knew from a few different sources that EMC run virtualised NetWorker servers as part of their own environment, and are happy to recommend it. I however, wasn’t. (And let’s face it, I’ve been working with NetWorker for longer than EMC’s owned it.) That being said, I wasn’t looking for confirmation that I was right – I was looking for justifiable reasons why I might be wrong.

First, I want to present the survey findings, and then I’ll discuss some of the comments and where I now stand.

There were 122 respondents to the survey, and the answers were:

Current Backup Server

Did this number surprise me? Not really – by its very nature, backup operations and administration is about being conservative: keep things simple, don’t go bleeding edge, and trust what is known. As such, the majority of sites are running a physical backup server. Of the respondents, only 10% were running any form of virtualised backup server, regardless of whether that was a software or hardware virtualised server, and regardless of whether it was directly doing backups or backing up in director mode only.

Would you run a virtual backup server?

So this question was a simple one – would you run a backup server that was virtual? Anyone who has done any surveys would claim (rightly so) that my leading questions into the survey may have coloured the results of the survey, and I’d not disagree with them.

Yet, let’s look at those numbers – less than 50% (admittedly only by a small margin) gave an outright “No” response to this question. I was pleased though that those who would run a virtualised backup server seemed to mirror my general thoughts on the matter – the majority would only do so in director mode, with the next biggest group being willing to backup to disk to the backup server, but not using other devices.

Would you run a blade backup server?

The final question asked the same about blade servers. To be fair to those using blade servers, this probably should have been prefaced with a question “Do you use blade servers in your environment already?”, since it would seem logical that anyone currently not using blade servers probably wouldn’t answer yes to this. But I was still curious – as you may be aware, I’ve had some questions about blade servers in the past; and other than offering better rack density I see them having no tangible benefits. (Then again, I am in a country that has no lack of space.)

The big difference between a software virtualised backup server and a hardware virtualised backup server though was that people who would run a backup server in a blade environment were more willing to backup to any device. That’s probably understandable. It smells like and looks like regular hardware, so it feels easier than say, a virtual machine accessing a physical tape drive does.

So, the survey showed me fairly much what I was expecting I’d see – a high level of users with physical backup servers. I was hoping though that I might see some comments from people who were either using, or considering using virtual servers, and get some feedback on what they found to be the case.

One of the best comments that came through was from Alex Kaasjager. He started with this:

I agree with you that a backup server (master, director) should be as independent as possible – and right for that specific reason, I’d prefer the server virtualised. Virtualisation solves the problem of a hardware, a hardware-bound OS, location and redundancy.

That immediately got my attention – and so Alex followed with these examples:

- if my hardware breaks (and it will at a certain point in time) I will have to keep a spare machine or go with reinstall-recovery, which, as you will agree, poses its own very peculiar set of problems
- the OS, regardless which one, is bound to the hardware, be it for licensing, MAC address, or drivers. A change in the OS (because of a move to another datacenter for example) may hurt (although it probably won’t, in all fairness)
- I can move my VM anywhere, to another rack, datacenter, or country without much hassle, I can copy, make a snap and even export it. Hardware will prevent this.

Of all the things I hadn’t considered, it was the simple ability to move your backup server between virtual servers wasn’t what I’d considered. Alex’s first point – about protection from hardware failure – is very cogent on its own, but being able to just move the backup server around without impacting any operations, or disrupting licenses – now that’s the kind of “bonus” argument I was looking for. (It’s why, for instance, I’ve advocated that if you’re going to have a License Manager server, you make that virtual.)

Another backup administrator (E. O’S) advocated:

It absolutely has to be in director mode as you describe. All the benefits of hardware abstraction and HA/FT that you get with VM are just as relevant to a critical an app as NetWorker, especially for storage mobility and expansion for a growing and changing datazone. Snapshots before major upgrades? Cloning for testing or redeployment to another site? Yes please. You have to be more confident than ever in your ability to recover NetWorker with bootstraps and indices (even onto a physical host if you need to, to solve your virtualisation layer dependency conundrum) if and when the time comes. Plan for it, practice it, and sleep easy.

The final part of what I’ve quoted there comes to the heart of my reservations of running NetWorker virtualised, even in a director role – how do you do an mmrecov of it? In particular, even when running as a backup director, the NetWorker server still has to back its own bootstrap information up to a local device. Ensuring that you can still recover from such a device would become of paramount importance.

I think the solution here is three-fold:

  • (Already available) Design a virtualised backup server such that the risk of having to do a bootstrap recovery in DR is as minimal as possible.
  • (Already available) Assuming you’re doing those bootstrap backups to disk/virtual disk, be sure to keep them as a separate disk file to the standard disk file for the VM, so that you can run any additional cloning/copying of that you want at a lower level, or attach it to another VM in an emergency.
  • (EMC please take note) It’s time that we no longer needed to do any backups to devices directly attached to the backup server. NetWorker does need architectural enhancements to allow bootstrap backup/recovery to/from storage node devices. Secondary to this: DR should not be dependent on the original and the destination host having the same names.)

So, has this exercise changed my mind or reinforced my belief that you should always run a physical backup server?

I’m probably now awkwardly sitting on the fence – facing the “virtual is OK for director mode only” camp. That would be with strong caveats to do with recoverability arrangements for the virtual machine. In particular, what I’d suggest is that I would not agree with virtualising the backup server if you were in such a small environment that there’s no provisioning for moving the guest machine between virtual servers. The absolute minimum, for me, in terms of reliability of such a solution is being able to move the backup server from one physical host to another. If you can do that, and you can then have a very well practiced and certain recovery plan in the event of a DR, then yeah, I’m sold on the merits of having a virtualised backup director server.

(If EMC updated NetWorker as per that final bullet point above? I’d be very happy to pitch my tent in that camp.)

I’ve got a couple of follow-up points and questions I’ll be making over the coming week, but I wanted to at least get this initial post out.

 

I want to spend a few minutes discussing something that drives me nuts. It’s something I see quite regularly on technical websites that discuss data protection, and it’s about time I make my opinion clear on it.

The latest instance comes from an article at SearchStorage called “How tiering can improve your backup strategies“. Marc Staimer wrote:

In one example, all data is commonly backed up once a day, put on tape, then shipped offsite. This methodology means that the RPO is 24 hours, and the RTO is a few days or longer. This is not a good idea for an organization’s mission-critical data. First, the process in recovering the data takes much too long, bringing all of the correct tapes back from offsite, and then recovering them in order, (which is subject to common human error). This can be incredibly tiresome and annoying if all that is being recovered is a single file caused by an accidental deletion. Second, it assumes all data on all tapes are recoverable. In the end, both introduce unacceptable risks to mission-critical data.

Now, I’m not going to dispute the fact that daily backups to tape can give RPOs of 24 hours or more, and can result in RTO’s of more than 24 hours. However, I don’t agree that an RPO of 24 hours is always the case, and I certainly don’t agree that an RTO of 24 hours (or more) is a 100% inevitability. Instead, I want to spend some time picking apart the rest of this junk statement.

Let’s first consider:

[T]he process in recovering the data takes much too long, bringing back all of the correct tapes from offsite, and then recovering them in order, (which is subject to human error). This can be incredibly tiresome and annoying if all that is being recovered is a single file caused by an accidental deletion.

This would be true if we were using archaic backup scripts (perhaps in a completely decentralised environment) with no automation. On the other hand, if you’re using decent, enterprise backup software there are absolutely no reasons why this should be the case. Enterprise class backup software will:


  • Identify which media is required for a recovery.
  • Read only from the media required for a recovery.
  • Seek to positions as close to the recovery point so as to avoid reading redundant data.

If we look at NetWorker for instance, we know it’s no slouch when it comes to seeking to the right spot on media for rapid single-file recovery. Between file records and media record markers, NetWorker can very quickly direct a tape drive to seek to the optimum location to commence recovery.

So my first thought is – if that’s the sort of experience that Marc Staimer has with tape based backup and recovery systems, he’s using the wrong ones, and shouldn’t blame that on tape.

Now let’s cover the second point:

[I]t assumes all data on all tapes are recoverable.

This can only be interpreted to mean one thing: the old “tape is unreliable” mantra. If tape were half as unreliable as every second article on tape made out to believe, there wouldn’t be a single tape vendor left in the market – they’d have all been sued out of business for deceptive trading and terribly unreliable products.

I’m not claiming that tape is fault free – if I did, I’d have a heck of a lot less cause to do the Ballmer Monkey Dance shouting “Cloning! Cloning! Cloning!” than I do. Tapes aren’t infallible, but I’ve not seen a single published paper citing extreme fault rates of enterprise class media*. On a yearly basis, the number of cases I see at customer sites of tape failure could be counted on a butcher’s right hand**. And you know what? Those instances are almost always at the backup point, not the recovery point.

So where does this leave us? At FUD central.

I’m the first to admit that the role of tape is changing within backup environments – I stated my thoughts on this previously in the article “Direct to Tape is Dead, Long Live Tape“, and I stand by this; so any overall discussion about backup media tiering with a model along the lines of disk->disk->tape or disk->vtl->tape will be the sort of thing I’ll usually heartily agree with.

If someone can point out independent studies showing high tape failure rates for enterprise class tapes – I’d like to know. Until then, let’s talk about valid, non-FUD reasons for pulling tape out of the immediate backup path. These include (but are not limited to):


  • Inability of most environments to stream tape.
  • SLAs requiring faster recovery starts, which in turn necessitate recovery from disk.
  • To allow for more streamlined backup cloning operations.
  • To support target deduplication for nearline backup storage.

Tape “unreliability” is not in that list. Maybe it is in limited environments that are currently using non-enterprise tape

* On the other hand, the easiest way of storing DAT media after generating your backup is to throw it into the bin. I might trust a DAT with a backup a little more than I’d trust a monkey with a pen to take notes in a court case, but not by much.

** I’m talking an old-style butcher. Before they had to start wearing chain mail gloves.

 

Having recently encountered a situation where a NetWorker client on a customer site repeatedly failed its full backup, I wanted to take a few moments to stress the absolute, importance – no, extreme criticality – of always being on top of your full backups.

Specifically:

  • You should always know whether your full backups have succeeded or not for each and every client of your backup system.
  • Unless there are specific management directives to the contrary, you should always re-run full backups in the event of failure as soon as possible.

To put it another way – a set of backups without a full, when it comes to performing a complete filesystem or system recovery, is about as useful as a chocolate teapot. Perhaps even less so.

I’ve described previously the importance of having a zero error policy, and always knowing if failures occur. So this topic could be summarised as being a subset of the zero error policy. However, if I were to be asked what backup I could “afford to lose” in terms of complete system recoverability, I’d pick an incremental any day over a full. (It’s actually a fine line, but it’s still an important differentiation.)

Without a full backup, at best you can pull back bits and pieces of a filesystem. Sure, they might be the most recently modified bits, which in themselves are important, but they’re not the entire filesystem. For most organisations, they barely touch the surface of the filesystem. Incrementals (and for that matter, differentials) are like the proverbial tip of the iceberg – perhaps without the penguins though*. The real monstrosity in a backup environment – the rest of the iceberg – are the fulls.

Let’s consider it this way – in most environments (discounting say, backups of database dump regions) you’ll find that an incremental backup covers somewhere between 5% to 10% of the filesystem. Not only that, the delta change on a day to day basis will also be quite small. That is, in many situations the files that are backed up each day in incremental backup regimes are the same files, modified day after day for working purposes. So while you may have incrementals of even up to 10% per day of your fulls, in turn 90% or more of those files may be the same files each day that are getting backed up in incrementals.

If we look at a 200GB filesystem though, even 10% of that filesystem is just 20GB. So if your full is somehow lost, that’s 180GB that you can’t readily recover. Additionally, the 20% or so that you can recover is going to be a pigs breakfast as far as getting it back in any consistent state.

NetWorker, through its use of saveset dependency chains, will do its utmost to protect you from regular saveset failures. If a full filesystem backup fails, subsequent incrementals will be chained onto the previous dependency set, retaining the previous full backup for a longer period of time.

It’s important we don’t let those dependency chains just keep building and building. They need to be broken and restarted so that we don’t get into messy situations or use up too much media. That’s why you should have a policy to rerun a full backup as soon as possible if it fails, rather than just waiting for the next one. (Further, I’ve far too often seen that sites with a “just wait until the next full backup runs” policy continually miss full backup failures, often for months at a time, because that sort of attitude also seems to be accompanied with informal records keeping.)

The next thing to consider is that we mustn’t just arbitrarily break dependency chains ourselves. By this, I’m referring to manually recycling media without regards to what may depend on that media, just because we need to free up volumes or have policies that media should be recycled after a certain length of time.

More than anything else, I see this as the reason companies find themselves in situations where NetWorker returns an “Unknown” volume being required for recovery. In this situation, NetWorker knows there should be a full backup, but it doesn’t have access to it, and therefore it can’t do anything to get the complete filesystem (or other type of data) recovered. Or, if there’s going to be a significant recovery error

Your full backups are like gold. No, gold isn’t special enough. Platinum, maybe. Or some combination of gold, platinum and saffron. They’re not to be cavalierly deleted, they’re not to be ignored, and they’re not to be left unchecked. (They’re not to be uncloned, either.)

In actual fact, it really doesn’t matter what your backup product is. What always matters is that your full backups are done, they’re done as soon as possible around the scheduled time, they’re successful, they’re known to be successful, and they’re successfully cloned. If any of those factors aren’t in play, you’ve got to get it fixed straight away.


* Unless they’re incrementals from a Linux system, of course.

 

I thought it about time that I cited the two key reasons why, if faced with a choice between NetWorker and NetBackup, I would choose NetWorker every time.

As you might expect, given my focus on backup as insurance, both of these reasons are firmly focused on recovery. In fact, so much so that I still don’t really understand why EMC doesn’t go to market with these points time and time and time again and just smack Symantec around until it’s blue in the face and begging for mercy.

Reason 1: NetBackup does not implement backup dependencies

I struggle to call NetBackup an “enterprise” backup product because of this simple fact. Honestly, backup dependencies are critically important when it comes to guaranteeing anything but last-backup recoverability.

What does this mean?

In short, as soon as a backup hits its retention period in NetBackup, it’s toast – it’s a goner.

Irrespective of whether there are any backups of the same filesystem/data set that requires the “outside retention” backup for recovery purposes.

I can’t sum this up any other way: in a backup product, I see this as recklessly irresponsible. It provides a focus on media savings that even the most miserly bean cruncher would admire. Well, until the bean cruncher’s system can’t be recovered from 6 weeks ago to fulfil audit requirements.

Reason 2: True Image Recovery is “optional”

If you’ve grown up in a NetWorker world, where the emphasis has always been, and will always continue to be on recovery, this will, like the reason above, make you soil yourself. Imagine having a full backup plus six incremental backups of a directory, and wanting to recover the filesystem from last night. Now imagine just selecting the full plus the incrementals for recovery and getting back everything generated during that time.

Even the files that had been deleted between backups. I.e., you don’t get back what the filesystem looked like at the time of the backup that you’re recovering from, but what it looked like for every backup that you’re recovering from.

NetWorker, once, in the 5.5.x stream implemented this. It was called a BUG. In NetBackup, it’s a “feature”. In order to enable a correct recovery, you have to turn on “true image recovery”, something that takes extra resources, and is typically advised  that you keep the data just for a small cycle (e.g., 7 days) rather than the complete retention time for the backups.

There’s another word for this: Joke.

On another front…

As recently as December I mentioned that I wished EMC would get their act together and implement inline cloning – one of the few things where I saw that NetBackup had a distinct competitive advantage over NetWorker.

Maybe it was the glow of the cider, but I had an epiphany in Copacabana on a hill watching (probably illegal) fireworks in Avoca and Terrigal on new years eve. Inline cloning is no longer a compelling factor in a backup product. Why? Media streaming speeds have reached a point where companies with serious amounts of data just should not be implementing direct-to-tape backup solutions any more. Inline cloning was developed at a time when you’d want to generate both sets of tapes as quickly as possible, but only companies with very small data sets will find themselves not backing up to some disk unit first (be it say, ADV_FILE, or VTL, in NetWorker), and those companies won’t be constrained on backup/clone windows to a point where they’d need inline cloning anyway.

When not backing up direct-to-tape, there are several factors that mitigate the need to do inline cloning. In organisations with a very strong need for offsiting, there’s replication at a VTL or disk backup unit layer. In organisations that just need a second copy generated “as soon as possible”, doing disk/virtual tape to physical tape cloning following the backup should be fast enough to handle the cloning at appropriate performance levels.

In other words: there’s no need for EMC to implement inline cloning. As a technology, it’s a dead-end from a tape-only time. I feel somewhat silly this didn’t occur to me sooner.

 

Are you still backing up Novell NetWare hosts? If you are, I hope you’re actively considering what you’re going to do in relation to NetWare recoveries in March 2010, when NetWare support ceases from both Novell and EMC.

I still have a lot of customers backing up NetWare hosts, and I’m sure my customer set isn’t unique. While Novell still tries to convince customers to switch from traditional NetWare services to NetWare on OES/SLES, a lot of companies are continuing to use NetWare until “the last minute”.

The “last minute” is of course, March 2010, when standard support for NetWare finishes.

Originally, NetWare support in NetWorker was scheduled to finish in March 2009, but partners and customers managed to convince EMC to extend the support to March 2010, to match Symantec and co-terminate with Novell’s end of standard support for NetWare as well.

Now it’s time we start considering what happens when that support finishes. Namely:

  1. How will you recover long term NetWare backups?
  2. How will you still run NetWare systems?
  3. How will you manage NetWorker upgrades?

These are all fairly important questions. While we’re hopeful we might get some options for recovering NetWare backups on OES systems (i.e., pseudo cross-platform recoveries), there’s obviously no guarantees of that as yet.

So the question is – if you’re still using NetWare, how do you go about guaranteeing you can recover NetWare backups once NetWare has been phased out of existence?

The initial recommendation from Novell on this topic is: keep a NetWare box around.

I think this is a short-sighted recommendation on their part, and shows that they haven’t properly managed (internally) the transition from traditional NetWare to NetWare on OES/SLES. This is perhaps why there isn’t a 100% transition from one NetWare platform to the other. Being faced with unpalatable transition options, some Novell customers are instead considering alternate transitionary options.

Unfortunately, in the short term, I don’t see there being many options. I’m therefore inclined to recommend that:

  1. Companies backing up traditional NetWare who only need to continue to recover a very small number of backups consider performing an old-school migration – recover the data to a host, and backup on an operating system that will continue to enjoy OS vendor and EMC support moving forward.
  2. Companies backing up larger amounts of traditional NetWare should consider virtualising at least one, preferably a few more NetWare systems before end of support, and keeping good archival VM backups (to avoid having to do a reinstall), using those systems as recovery points for older NetWare data.

The longer-term concern is that the NetWare client in NetWorker has always been … interesting. Once NetWare support vanishes, the primary consideration for newer versions of NetWorker will be whether those newer versions actually support the old 7.2 NetWare client for recovery purposes.

With this in mind, it will become even more important to carefully review release notes and conduct test upgrades when new releases of NetWorker come out to confirm whether newer versions of the server software actually support communicating with the increasingly older NetWare client until such time as recovery from those NetWare backups is no longer required.

You may think this is a bit extreme, but bear in mind we don’t often see entire operating systems get phased out of existence, so it’s not a common problem. To be sure, individual iterations or releases may drop out of support (e.g., Solaris 6), but the entire operating system platform (e.g., Solaris, or even more generally, Unix) tends to stay in some level of support. In fact, the last time I think I recall an entire OS platform slipping out of NetWorker support was Banyan Vines, and the last client version released for that was 3 point something. (Data General Unix (DGUX) may have ceased being supported more recently, but overall the Unix platform has remained in support.)

If you’re still backing up NetWare servers and you’re not yet considering how you’re going to recover NetWare backups post March 2010, it’s time to give serious consideration to it.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha