Are your service level agreements and your backup software support contracts in alignment?

A lot of companies will make the decision to run with “business hours” backup support – 9 to 5, or some variant like that, Monday to Friday. This is seen as a cheaper option, and for some companies, depending on their requirements, it can be a perfectly acceptable arrangement too. That’s usually the case where there are no SLAs, or smaller environments where the business is geared to being able to operate for protracted periods with minimal IT.

What can sometimes be forgotten in attempts to restrain budgets is whether reduced support for production support systems has any impact on meeting business requirements relating to service level agreements. If for instance, you have to start getting data flowing back within 2 hours of a failure, a system fails at midnight and the subsequent recovery has issues, your chances of being able to hit your service level agreement start to plummet if you don’t have a support contract that guarantees you access to help at this point in time.

A common response to this from management – be it IT, or financial – is “we’ll buy per-incident support if we need to“. In other words, the service level agreements the business has established necessitates a better support contract than is budgeted for, so it is ‘officially’ planned to “wing it” in the event of a serious issue.

I describe that as an Icarus Support Contract.

Icarus, as you may remember, is from Greek mythology. His father Daedalus fashioned wings out of feathers and wax so that he and Icarus could escape from prison. They escaped, but Icarus, enjoying the sensation of flight so much, disregarded his father’s warnings about flying too high. The higher he got, the closer he was to the sun. Then, eventually, the sun melted the wax, his wings fell off, and he fell to his death into the sea.

Planning to buy per-incident support is effectively building a contingency plan based on unbooked, unallocated resources.

It’s also about as safe as relying on wings held together by wax when flying high. Sure, if you’re lucky, you’ll sneak through it; but is do you really want to trust data recovery and SLAs to luck? What if those unbooked resources are already working on something for someone who does have a 24×7 contract? There’s a creek for that – and a paddle too.

In a previous job, I once discussed disaster recovery preparedness with an IT manager at a financial institution. Their primary site and their DR site were approximately 150 metres away from one other, leaving them with very little wiggle room in the event of a major catastrophe in the city. (Remember, the site being inaccessible can be just as deadly to business as the site being destroyed – and while there’s a lot less things that may destroy two city blocks, there’s plenty more things that might cut off two city blocks from human access for days.)

When questioned about the proximity of the two sites, he wasn’t concerned. Why? They were a big financial institution, they had emergency budget, and they were a valued customer of a particular server/storage manufacturer. Quite simply, if something happened and they lost both sites, they’d just go and buy or rent a truckload of new equipment and get themselves back operational again via backups. I always found this a somewhat dubious preparedness strategy – it’s definitely an example of an Icarus support contract.

I’ve since talked to account managers at multiple server/storage vendors, including the one used in this scenario, and all of them, in this era of shortened inventory streams, have scoffed at the notion of being able to instantly drop in 200+ servers and appropriate storage at the drop of a hat – especially in a situation where there’s a disaster and there’s a run on such equipment. (In Australia for instance, a lot of high end storage kit usually takes 3-6 weeks to arrive since it’s normally shipped in from overseas.)

Icarus was a naïve fool who got lost in the excitement of the moment. The fable of Icarus teaches us the perils of ignoring danger and enjoying the short-term too much. In this case, relying on future unbooked resources in the event of an issue in order to save a few dollars here and there in the now isn’t all that reliable. It’s like the age-old tape cost-cutting: if you manage to shave 10% off the backup media budget by deciding not to backup certain files or certain machines, you may very well get thanked for it. However, no-one will remember congratulating you when there’s butt-kicking to be done if it turns out that data no longer being backed up actually needed recovery.

So what is an Icarus support contract? Well, it’s a contract where you rely on luck. It’s a gamble – that in the event of a serious problem, you can buy immediate assistance at the drop of a hat. Just how bad can planning on being lucky get? Well, consider that over the last 18 months the entire world has been dealing with Icarus financial contracts – they were officially called Sub-Prime Mortgages, but the net result was the same – they were contracts and financial agreements built around the principle of luck.

Do your business a favor, and avoid Icarus support contracts. That’s the real way to get lucky in business – to not factor luck into your equations.

 

While initially I had some success with Snow Leopard and Mac OS X, I’m increasingly finding that it’s just boiling down to being too random for reliable backups. So far problems mainly seem to occur after a machine has gone to sleep and woken up multiple times – or had its network location changed multiple times. Thus it mainly seems (for the moment) to affect laptops or machines that frequently sleep.

The net result is that you’ll get into situations where several errors will start to happen and you’ll need to eventually reinstall the NetWorker client, reboot, and then potentially reinstall the NetWorker client another time. Note that complete cold restarts do not seem to as reliably fix (or temporarily offer a workaround to the) issues as does the reinstall/reboot/reinstall method.

Error 1

Attempts to connect from the server to the client will fail – e.g.,

[root@nox ~]# nsradmin -p 390113 -s archon
39078:nsradmin: RPC error: Remote system error
There does not appear to be a NetWorker nsrexecd server running on archon.

Error 2

Stopping and restarting the NetWorker services on the client fails:

root@archon ~
$ SystemStarter stop NetWorker
Stopping NetWorker Client.
root@archon ~
$ ps -eaf | grep nsr
0  5381  5230   0   0:00.00 ttys001    0:00.00 grep nsr
root@archon ~
$ SystemStarter start NetWorker
Starting NetWorker Client.
/Library/StartupItems/NetWorker/NetWorker: line 10:  5389 Illegal instruction     /usr/sbin/nsrexecd

Error 3

I’m finding that directives are getting confused over directories and paths too:

* archon:/ 70340:savepnpc: ignoring directory specification for `/Users/preston/Library/Application Support/Yojimbo/' in
* archon:/ `/Users/preston/Library/Application Support/Yojimbo/.nsr' - not contained within directory `/users/preston/Library/Application Support/Yojimbo/'
* archon:/ 70340:savepnpc: ignoring directory specification for `/Users/preston/Library/Parallels/' in
* archon:/ `/Users/preston/Library/Parallels/.nsr' - not contained within directory `/users/preston/Library/Parallels/'

It seems to be a spurious error – usually when this happens the directives are still processed.

Error 4

On some backups – usually full, I get hundreds of malloc errors in the savegroup completion – e.g.,

* archon:/ savepnpc(668,0xa0a01500) malloc: *** error for object 0x20: pointer being freed was not allocated
* archon:/ *** set a breakpoint in malloc_error_break to debug
* archon:/ savepnpc(668,0xa0a01500) malloc: *** error for object 0x20: pointer being freed was not allocated
* archon:/ *** set a breakpoint in malloc_error_break to debug

What I’m doing

I’ve currently got a question case open with EMC asking when we’ll get official support for Snow Leopard. I’ll update this blog with details when I can.

[Update] There are existing escalations to get Snow Leopard support. The current tentative schedule, I’m told, is for support in NetWorker 7.6 SP1. There’s apparently escalations against 7.5.x as well – personally, if I were a betting person, I’d be betting we’ll more likely get support in just 7.6 via SP1 rather than both 7.6 and the 7.5 tree.

 

Yesterday I experienced one of those weird NetWorker issues that is such an odd combination of factors that I felt it had to be discussed.

Here’s the scenario. A customer was:

  • Previously running NetWorker 7.4.2 on their backup server.
  • Upgraded the server to 7.5.1.
  • Had a bunch of Windows clients and one Unix client.
  • The Unix client was configured for filesystem backups and Oracle backups.
  • All clients were running 7.4.2(ish). The Oracle module was 4.5.
  • Once the upgrade was done, Unix filesystem backups continued to work but the Oracle backups would fail with:
client:RMAN:/path/to/script.rman 1 retry attempted
client:RMAN:/path/to/script.rman off
client:RMAN:/path/to/script.rman /path/to/nsrnmo[291]: -l:  not found
client:RMAN:/path/to/script.rman nsrnmostart returned status of 127
client:RMAN:/path/to/script.rman /path/to/nsrnmo exiting.

My first thought when a colleague asked me to have a look at it was that somehow there was enough of a slight enough incompatibility between 7.5.x and NMO 4.5 that some argument carried over from an earlier version of NMO was causing problems with talking to a 7.5.x server. This wasn’t the case. (Yes, I knew that the two versions are meant to be compatible, and when I’ve installed and used them they have been, but that doesn’t mean you can’t have one single setting somewhere that tickles a coding error across versions.)

I went back and forth with a few other checks with the customer, noting that there were various issues reported in the NMO applogs, but none specific enough to nail the problem. So since everything looked OK I agreed with the customer that a WebEx would probably help us solve the issue faster.

Even though the customer had given me the client resource, I hadn’t found anything wrong with the backup command or the save set name, so out of curiosity I’d asked the customer when we started the WebEx to show me the client details. The saveset looked fine, so we jumped across to the backup command, and that also looked fine. But then, underneath the backup command, there was the “save operations” field, and in that save operations field held:

VSS:*=off

It hadn’t been recently added. It had been there since before the upgrade, and before the upgrade the backups had been working. But as we know, on pre-VSS Windows systems invoking that will cause backup failures, so I asked the customer to remove that entry and start the backup. Neither of us really thought that this would solve the problem, given the filesystem backups were still working, but lo and behold, with that removed the Oracle RMAN backups started properly working.

In retrospect, this of course was definitely the problem, but working it out was a bit more challenging. The reason was that the configuration shouldn’t have worked under a NetWorker 7.4.x server either, but for some reason it did. The 7.4.x NetWorker server was likely not sending through the VSS directive to the Unix client and the Unix Oracle module, but having upgraded to 7.5.x, the new install stopped “filtering the error” and started causing the problem to manifest. Or alternatively, 7.4.x and 7.5.x both send the save operations setting, but just differently enough to be dangerous.

I wouldn’t exactly say this was NetWorker’s fault – those VSS options are only designed for use with Windows 2003 and higher clients, and I’d guess that the VSS:*=off was just applied to every single client on the customer site without considering the 1 x Unix client.

In retrospect, the following line now completely makes sense:

client:RMAN:/path/to/script.rman off

That was our only “hint” as to the cause of the problem in the savegroup completion. It wasn’t enough by a long stretch. Sometimes, and this is the challenging bit – sometimes you can have configuration errors even if you haven’t changed the actual resource configuration. Different versions of NetWorker will react differently to an incorrect configuration – so the upgrade didn’t cause the problem, it just allowed the problem to appear.

 

On the NetWorker Mailing List, I still frequently see a lot of posts from people who are having various problems with their NetWorker 7.2.x servers.

It’s time to move away from 7.2. I know, it was the last version before nsrjobd; the move to nsrjobd in 7.3, then raw daemon logs in 7.4 can both be a bit shocking, but 7.2 is now critically old and critically out of support. Equally, there’s still a lot of people out there running 7.3 releases of NetWorker. That, too, exited support some time ago, and it’s time to move on from it too.

I’ll agree that within backup, there is a strong logic to the statement “if it ain’t broke, don’t fix it”, but, you have to weigh up that against the simple fact that 7.2.x releases in particular are very old, and 7.3.x releases are fairly aged as well.

Since I’ve been watching more and more of Top Gear, I’ll use a car analogy. Let’s say you’ve got a brand new, top of the line Ferrari. When it needs servicing, do you take it to the official Ferrari shop that provides a 100% warranty on all repairs and whose repairs keep the original vehicle warranty intact, or do you take it to Bill & Joes Motor Fixits ‘R’ Us, who not only might leave you with a car in a worse condition than when you drove it in, but who aren’t certified by Ferrari and thus lose you your new car warranty?

Continuing to backup your environment with a backup product which is long out of support is like outsourcing to Bill & Joes Motor Fixits ‘R’ Us IT Service.

I’ll be the first to admit that even on simple updates you can run into a few hassles. Particularly as you move up the NetWorker version chain you’ll find changes to authentication and name resolution requirements alone that may necessitate some additional work around the time of the update. If your clients are old you’ll also be needing to plan an update for them as soon as possible too, and in some cases, you may find yourself definitely having to update clients if there turns out to be some particularly odd issue.

But I’ll be honest: that little bit of up-front pain is much, much better than hitting a critical backup or recovery problem that can’t be solved without upgrading (or worse, can’t be solved due to incompatibilities between ancient NetWorker versions and modern operating system versions). Planning and implementing a controlled upgrade, even if it does end up having a few hassles, is infinitely better than doing an emergency upgrade without any planning to facilitate a recovery or a backup that has to be done.

 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

 

Looking at the stats both for this new site and the previous site, I’ve compiled a list of the top 10 read articles on The NetWorker Blog for 2009. The top 3 of course match the three articles that routinely turn out to be the most popular on any given month, which speaks something of their relevance to the average NetWorker administrator.

(Note: I’ve excluded non-article pages from the top 10.)

Number 10 – Instantiating Savesets

The very first article on the blog, Instantiating Savesets detailed the importance of distinguishing between all instances of a saveset and a specific instance of a saveset.

This distinction between using just the saveset ID, and using a saveset ID/clone ID combination becomes particularly important when staging from disk backup units. If clones exist and you stage using just the saveset ID, when NetWorker cleans up at the end of the staging operation it will remove reference to the clones as well as deleting the original from the disk backup unit. (Something you really don’t want to have happen.)

Recommendation to EMC: Perhaps it would be worthwhile requiring a “-y” argument to nsrstage if staging savesets from disk backup units and specifying only the saveset ID.

Recommendation to NetWorker administrators: Always be careful when staging that you specify both the saveset and the clone ID.

Number 9 – Basics – Important mminfo fields

In May I wrote about a few key mminfo fields – notably:

  • savetime
  • sscreate
  • ssinsert
  • sscomp
  • ssaccess

Sadly, I didn’t get the result I wanted with EMC on ssaccess. Documented as being updated whenever a saveset fragment is accessed for backup and recovery, the most I could get was an acknowledgement that it was currently broken and to lodge an RFE to get it fixed. (The alternative was to have the documentation changed to take out reference to read operations – something I didn’t want to have happen!)

Recommendation to EMC: ssaccess would be a particularly useful mminfo field, particularly when analysing recovery statistics for NetWorker. Please fix it.

Number 8 – Basics – Listing files in a backup

Want to know what files were backed up as part of the creation of a saveset? If you do, you’re not unique – this has remained a very popular article since it was written in January.

Recommendation to EMC: This information can be retrieved via a combination of mminfo/nsrinfo, but it would be handy if NMC supported drilling down into a saveset to provide a file listing.

Number 7 – Using yum to install NetWorker on Linux

NetWorker’s need for dependency resolution on Linux for installation of the client packages in particular drew a lot of people to this article.

Number 6 – Basics – mminfo, savetime, and greater than/less than

This article explained why NetWorker uses the greater than and less than signs in mminfo in a way that newcomers to the product might find backwards. If you’re not aware of why mminfo works the way it does for specifying savetimes, you should be.

Number 5 – 7.5(.1) changed behaviour – deleting savesets from adv_file devices

This was a particularly unpleasant bug introduced into NetWorker 7.5, thankfully resolved now in the cumulative service releases and NetWorker 7.6

The gist of it is that in NetWorker 7.5/7.5.1 (aka 7.5 SP1), if you deleted a saveset on a disk backup unit, NetWorker would suffer a serious failure where it would from that point have issues cleaning regular expired savesets from the disk backup unit and insist that the disk backup unit had major issues. The primary error would manifest as:

nsrd adv_file warning: Failed to fetch the saveset(ss_t) structure for ssid 1890993582

This was fixed in 7.5.1.2, thankfully.

Recommendation to EMC: Never let this bug see the light of day again, please. (So far you’re doing an excellent job, by the way.)

Number 4 – NetWorker 7.5.1 Released

I’ve recently noticed a disturbing trend among many vendors, EMC included, where once a new release is made of a product, sales and account staff become overly enthusiastic about recommending new releases. This comes on top of not really having any technical expertise. (Please be patient, I’m trying to put this as diplomatically as possible.)

One of the worst instances I’ve seen of this in the last few years was the near-hysterical pumping of 7.5 thanks to some useful features to do with virtualisation in particular. I’ll admit that my articles on the integration between Oracle Module 5 and NetWorker 7.5, as well as Probe Based Backups may have added to this. However, there was somewhat of a stampede to 7.5 when it came out, and consequently, when it had some issues, there was strong enthusiasm for the release of 7.5.1.

This is why, by the way, that IDATA maintains for its support customers a recommended versions list that is not automatically updated when new versions of products come out.

Recommendation to EMC: Remind your sales staff that existing users already have the product, and not to just go blindly convincing them to upgrade. Otherwise you’ll eventually start sounding like this.

Number 3 – Carry a jukebox with you (if you’re using Linux)

During 2009, Mark Harvey’s LinuxVTL project first got the open source LinuxVTL working with NetWorker in a single drive configuration, then eventually, in multi-drive configurations. (Mark assures me, by the way, that patches are coming real soon to allow multiple robots on the same storage node/server.)

Lesson for me: With the LinuxVTL configured on multiple lab servers in my environment, I’ve really taken to VTLs this year, and considerably changed my attitude on using them. (I’ll say again: I still resent that they’re needed, but I now respect them a lot more than I previously did.)

Lesson for others: Even Mark himself says that the open source VTL shouldn’t be used for production backups. Don’t be cheap with your backup system, this is an excellent tool for lab setups, training, diagnostics, etc., but it is not a replacement to a production-ready VTL system. If you want a VTL, buy a VTL.

Number 2 – Basics – Parallelism in NetWorker

Some would say that the high popularity of an article about parallelism in NetWorker indicates that it’s not sufficiently documented.

I’m not entirely convinced that’s the case. But it does go to show that it’s an important topic when it comes to performance tuning, and summary articles about how the various types of parallelism interact are obviously popular.

Lesson for everyone: Now that the performance tuning guide has been updated and made more relevant in NetWorker 7.6, I’d recommend people wanting an official overview of some of the parallelism options checking that out in addition to the article above.

Number 1 – Basics – Fixing “NSR peer information” errors

Goodness this was a popular article in 2009 – detailing how to fix the “NSR peer information” errors that can come up from time to time in the NetWorker logs. If you’re not familiar with this error yet, it’s likely you will eventually as a NetWorker administrator see an error such as:

39078 02/02/2009 09:45:13 PM  0 0 2 1152952640 5095 0 nox nsrexecd SYSTEM error: There is already a machine using the name: “faero”. Either choose a different name for your machine, or delete the “NSR peer information” entry for “faero” on host: “nox”

Recommendation for EMC: Users shouldn’t really need to be Googling for a solution to this problem. Let’s see an update to NetWorker Management Console where these errors/warnings are reported in the monitoring log, with the administrator being able to right click on them and choose to clear the peer information after confirming that they’re confident no nefarious activity is happening.

Wrapping Up

I have to say, it was a fantastically satisfying year writing the blog, and I’m looking forward to seeing what 2010 brings in terms of most useful articles.

 

There was a recent discussion on the NetWorker mailing list as to whether some additional logging information that appeared in 7.4.x was worthwhile or whether it was worthless to the point of getting in the way of an administrator.

So that everyone is across what I’m talking about, the messages that started in 7.4.x are along the lines of:

nsrim: Only one browsable Full exists for saveset X. Its browse period is equal to retention period.

So here’s my take on the discussion: log files aren’t to be resented.

I recognise there’s a point where log files become either useless or waste people’s time. However, there’s really only one time for this – when the exact same information is needlessly repeated. In the case of these log messages though, it’s not the exact same information needlessly repeated. It’s different information – it’s going to be about a different saveset each time.

What is the message about, you may be wondering? Well, I actually don’t 100% know for sure. My suspicion is that it’s a message introduced to deal with processing saveset retention following changes introduced for pool based retention policies. But it doesn’t matter.

One thing that will drive me nuts with just about any product is encountering an issue where there’s insufficient logs to actually work out what is going on. Obviously, there’s a fine line to walk – log too much and you waste space and potentially reveal too much about the IP of the package. However, don’t do enough and it becomes extremely challenging for the people doing support (or the people who write the patches (or the people who wrote the software)) to resolve an issue. I don’t believe that having accurate logs guarantees quickly resolving an issue, but they certainly help – and not having them certainly hinders.

So my point is – don’t resent your log files. The amount of space they generally take up in NetWorker is quite minimal (compared to say, the index region), and so you shouldn’t be concerned about space. Nor, I’ll insist, should you be concerned about how to go about stripping out messages you don’t need to review when scanning log files. Backup administrators of enterprise products in particular should be quite conversant with log analysis and text extraction.

If those extra logged entries allow me to quickly find something in a Knowledge Base, or similarly allows support to find something quickly in an engineering database, or allows a patch developer to isolate the section of code that causes the problem, or allows the core developer to target the section of code to write an enhancement, it’s fantastic, and well worth the extra few bytes here and there that occupy my filesystems.

 

Sometimes I feel like a NetWorker old-timer. (When I don’t feel like a NetWorker old-timer, it doesn’t change the fact that I am.) These days, given the huge architectural gulf between them, I’d suggest that anyone who has been using NetWorker since v4.x or v5.x days is a NetWorker “old timer”. Since I’ve been using it with a trailing edge of v3.x days and heavily from v4.x, that puts me well into that territory.

One of the things NetWorker old timers will remember is that for a lot of its history, it was impossible to change anything to do with pools whenever a backup was running. If for instance, you had a backup going to a Monthly pool, and you wanted to configure a new group that would go to the Daily pool, you had to wait for all backups to complete, even though there was no overlap in pool interests, before you could add that group to the Daily pool.

When the restriction was first relaxed, the NetWorker GUIs would prompt with a warning when pool changes were being made during backup activities to indicate that it wasn’t recommended, but giving a proceed/OK button to force the change. These days, NetWorker is more permissive, but not always more forgiving.

A customer experienced a problem recently where he’d configured a new group, and started the group, only to realise that he’d not configured it to go to the correct pool. Rather than stopping the group and making the pool change, he hoped that he could change the pool settings and see the group start requesting media from the correct pool instead of the Default pool. Once the change was made though, the group kept on asking for media in the Default pool, so he stopped the group, waited a few minutes, and restarted.

NetWorker kept asking for media in the Default pool. The NMC pool configuration pane clearly showed that the group was now configured for the correct pool, but plain as day, the group wanted to still write to the Default pool.

Stopping and starting NetWorker didn’t seem to help either.

When he logged the case, and explained about the pool-change-during-backup, I immediately thought back to how NetWorker previously wouldn’t have allowed such a change to happen, and how it in an interim period used to allow the change to happen after issuing a warning. But what if, I thought, there’s still some locking that can happen which would cause a screw-up if the pool were changed for a group while the group was already requesting media?

So I suggested two courses of action to the customer:

  • EMC engineering’s hated solution: Stop NetWorker, clean out /nsr/tmp, restart, and see if that fixes it.
  • Stop the backup, take the group back out of the pool, restart and allow it to write to Default, then put the group back in the correct pool and run the backup again.

In this case, the customer chose the first option – cleaning out /nsr/tmp. While it wasn’t tested, I equally suspect that the second option would have worked too.

There is a lesson with this: avoid making changes to pools for data which is already actively trying to be written to media. Even though it’s technically supported, operationally it can still cause issues.

 

A few days ago a customer was having a rather odd problem. They’re currently running NetWorker 7.3.3 and getting ready to jump directly to NetWorker 7.5.1, but to do so they wanted to first run up a NetWorker 7.5.1 server and confirm current client types, databases, etc., will backup without issue*.

So the customer installed NetWorker 7.5.1 on a new Linux host, created some devices and pools, but then encountered a particularly odd problem when they went to create the clients. NMC would allow them to fill in all the properties for the client, but when they clicked OK in the new client dialog box, nothing would happen. No errors were produced, but nor were any clients actually created.

When they raised this with me I was a little puzzled for a few minutes, then asked if they were using the NMC that comes with NetWorker 7.5.1, or the NMC that comes with 7.3.3 and had just added the new server to the control zone.

The answer was that the control zone for the existing NMC that came with NetWorker 7.3.3 was just extended to include the 7.5.1 server.

For pools, devices and groups this was not a problem – these were all successfully created on the 7.5.1 server using the 7.3.3 NMC. However, when it came to clients, it wouldn’t work.

The reason is quite simple – as new features and functions are added to NetWorker over time, different fields within a configuration resource may or may not become mandatory. Some of the time this is obvious, because we’re required to fill in certain fields – e.g., client names, schedules, etc. However, in other instances, NetWorker has predefined defaults that it slots into place if a value isn’t entered – e.g., parallelism, priority, browse/retention time, etc. Just because defaults are put into place however doesn’t mean that fields are any less mandatory – it’s all about allowing you to create resources quicker.

So, what’s all this got to do with differing NMC/NSR versions? In short, everything!

You see, what happened for this customer is that between NetWorker 7.3 and 7.5, there has been a raft of client based functionality added – e.g., data deduplication, support for defining a client as being virtual, etc.

Undoubtedly some of these new features have mandatory values – so that if the server is probing details for the clients, it can safely request say, dedupe status or virtual status without worrying about getting an (undefined!) style response. Each version of NetWorker is “aware”, via base configuration, what fields must be supplied when creating a new resource, and thus, the scenario for this customer would have been:

  1. Fill in client properties in NMC 7.3.3
  2. Attempt “client create” 7.3.3 -> 7.5.1.
  3. The 7.5.1 server reviews the proposed client resource and,
  4. The 7.5.1 server rejects the proposed client resource as not having all the mandatory fields filled in.

Should NetWorker/NMC have provided an error to explain what was going wrong? Undoubtedly that would have been good, and I’d suggest that NMC/NSR should be able to better communicate resource creation/update failure in these circumstances. However, that being said, the fundamental problem remained the same – the version of NMC in use couldn’t create new clients because it wasn’t supplying all the mandatory details to the more recent version of NetWorker.

In many small sites, the NMC server and the NetWorker server are on the same host, and are thus upgraded in lock-step. However, for sites where the NMC server is installed on another host, this is a valuable lesson – unless you have a very valid reason, don’t run a version of NMC that matches an earlier version of NetWorker than the current server version. It may work (mostly), but if it does fail, it’s unlikely to be immediately obvious why it’s failing.


* This is what I’d call an excellent upgrade policy – you can read the release notes until they’re 100% memorised, but nothing quite beats actually running up your own test server.

 

It’s easy to get confused on ‘supported’. That is, when EMC (or any other vendor) publishes a guide on say, what operating systems are supported, many will ask whether that means if some operating system X that does not appear in the list will work.

The terms ‘work’ and ‘supported’ are not synonymous, and should not be confused.

I’ll be the first to point out that I routinely use CentOS in my lab – a Linux distribution that is most definitely not on the supported operating systems list. It’s a repackaged RedHat Enterprise Server, and I can install it as many times as I want at zero cost. On the other hand, if I needed to actually buy a RedHat Enterprise Server license for every Linux test VM, I’d be very, very poor.

So clearly, CentOS works with NetWorker, even though it’s not supported.

Would I recommend it being used at a customer site in a full production environment? Not without rigorous caveats.

You see, backup is one of those fundamentally low-level scenarios where taking risks is just plain wrong. It’s like the difference between leading edge and bleeding edge. There’s nothing wrong with being leading edge in the backup environment; many companies depend on being leading edge so they can meet their backup and recovery windows. Bleeding edge though – going out and using untested or uncertified configurations, just asks for trouble. Indeed, the term says it all – bleeding.

There are typically two key reasons why something may ‘work’ but be ‘unsupported’. These are:

  • The vendor has not had a chance to test that particular configuration. I.e., it’s unqualified. For example, a Widgets Inc. Tape Library with LTO-5 drives and four robot heads may just not have made it to the vendor labs for qualification; so, while it may technically work, it’s never been tested.
  • The vendor is not comfortable with the supplier support for the product.

Now, in the case of a solution or a configuration option being unqualified, there’s a solution. EMC for instance will work with customers and partners to determine whether a particular configuration can be qualified – indeed, most vendors have a similar process. While everyone would undoubtedly prefer that they get all the qualification done in their labs, we must also accept that it’s practically impossible to achieve, so some level of on-site qualification must be accepted as required from time to time.

In the second instance though, things are a little more difficult. If a vendor isn’t comfortable that the supplier of a product will be able to suitably support that product at an enterprise level, then getting it qualified is unlikely at best.

In these instances, if you want to deploy unsupported components in your system, ask yourself these questions:

  1. Is there a supported option available?
  2. What are the pros and cons of the supported option vs the unsupported option?
  3. What is the risk to the business if the unsupported option has issues and the vendor refuses to support it?
  4. If the unsupported option is chosen, can a test lab be setup using the supported option so as to prove, at any point, that the use of the unsupported product does not contribute to an issue?

The last point may seem a little odd – after all, if you can afford the supported option for a lab, why wouldn’t you deploy in production? I’ve actually seen this scenario with CentOS – a company couldn’t afford RedHat Enterprise Server licenses for all their production machines, so they deployed CentOS, but they also did buy a RedHat Enterprise Server license for a lab machine. Whenever an issue occurred that required escalation to the vendor, they’d first reproduce it on the RedHat Enterprise Server. That way, when it went to the vendor, they could (rightly) claim an issue on a supported operating system.

Even so, this isn’t necessarily ideal. What was obviously not accounted for here was the potential for a high severity issue occurring. E.g., if a severity-1 fault occurred on a system, where data recovery was imperative, but recreating the configuration would take a long period of time, the risk remained that either (a) an escalation based on an unsupported operating system would be rejected or (b) the SLAs might be blown out of the water recreating the issue on a supported platform in order to get a successful escalation.

In short – the decision to use unsupported software/hardware is not the decision of IT staff. It must be the decision of senior management. It must be signed off, and stakeholders of affected systems and processes must be aware of the potential consequences.

While unsupported does not necessarily imply doesn’t work, it’s important to remember that unsupported can most definitely mean unsupported when it stops working.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha