Yesterday I experienced one of those weird NetWorker issues that is such an odd combination of factors that I felt it had to be discussed.

Here’s the scenario. A customer was:

  • Previously running NetWorker 7.4.2 on their backup server.
  • Upgraded the server to 7.5.1.
  • Had a bunch of Windows clients and one Unix client.
  • The Unix client was configured for filesystem backups and Oracle backups.
  • All clients were running 7.4.2(ish). The Oracle module was 4.5.
  • Once the upgrade was done, Unix filesystem backups continued to work but the Oracle backups would fail with:
client:RMAN:/path/to/script.rman 1 retry attempted
client:RMAN:/path/to/script.rman off
client:RMAN:/path/to/script.rman /path/to/nsrnmo[291]: -l:  not found
client:RMAN:/path/to/script.rman nsrnmostart returned status of 127
client:RMAN:/path/to/script.rman /path/to/nsrnmo exiting.

My first thought when a colleague asked me to have a look at it was that somehow there was enough of a slight enough incompatibility between 7.5.x and NMO 4.5 that some argument carried over from an earlier version of NMO was causing problems with talking to a 7.5.x server. This wasn’t the case. (Yes, I knew that the two versions are meant to be compatible, and when I’ve installed and used them they have been, but that doesn’t mean you can’t have one single setting somewhere that tickles a coding error across versions.)

I went back and forth with a few other checks with the customer, noting that there were various issues reported in the NMO applogs, but none specific enough to nail the problem. So since everything looked OK I agreed with the customer that a WebEx would probably help us solve the issue faster.

Even though the customer had given me the client resource, I hadn’t found anything wrong with the backup command or the save set name, so out of curiosity I’d asked the customer when we started the WebEx to show me the client details. The saveset looked fine, so we jumped across to the backup command, and that also looked fine. But then, underneath the backup command, there was the “save operations” field, and in that save operations field held:

VSS:*=off

It hadn’t been recently added. It had been there since before the upgrade, and before the upgrade the backups had been working. But as we know, on pre-VSS Windows systems invoking that will cause backup failures, so I asked the customer to remove that entry and start the backup. Neither of us really thought that this would solve the problem, given the filesystem backups were still working, but lo and behold, with that removed the Oracle RMAN backups started properly working.

In retrospect, this of course was definitely the problem, but working it out was a bit more challenging. The reason was that the configuration shouldn’t have worked under a NetWorker 7.4.x server either, but for some reason it did. The 7.4.x NetWorker server was likely not sending through the VSS directive to the Unix client and the Unix Oracle module, but having upgraded to 7.5.x, the new install stopped “filtering the error” and started causing the problem to manifest. Or alternatively, 7.4.x and 7.5.x both send the save operations setting, but just differently enough to be dangerous.

I wouldn’t exactly say this was NetWorker’s fault – those VSS options are only designed for use with Windows 2003 and higher clients, and I’d guess that the VSS:*=off was just applied to every single client on the customer site without considering the 1 x Unix client.

In retrospect, the following line now completely makes sense:

client:RMAN:/path/to/script.rman off

That was our only “hint” as to the cause of the problem in the savegroup completion. It wasn’t enough by a long stretch. Sometimes, and this is the challenging bit – sometimes you can have configuration errors even if you haven’t changed the actual resource configuration. Different versions of NetWorker will react differently to an incorrect configuration – so the upgrade didn’t cause the problem, it just allowed the problem to appear.

 

Needing a few interesting things to read at the end of the week?

Here’s a few things I’ve found fascinating this week:

  • Why do IT operations suck? An insightful article by Steve O’Donnell. Steve asks why our staff who have primary involvement with systems 24×7 (operators) are often the least skilled, least trained and least paid. (As a consultant, I’ve frequently experienced companies who consider it a waste of time to properly train operators, and as a result their systems usually suffer for it.)
  • Over at Daring Fireball, John Gruber has an article called The Original Tablet. (It’s a great historical perspective on why Microsoft can’t exclusively claim ownership of the tablet idea.)
  • Like many others, I found Google’s slap in the face to China’s net censorship and cyber-warfare activities well timed and highly appropriate. On the other hand, others such as John Obeto over at Absolutely Windows found it not much more than petty PR. Somewhere in the middle is probably the whole story…
  • Over at IT Depends, I found Terri McClure’s views on Microsoft’s requirements for accessing their Azure SLAs to be the same as mine – staggeringly stupid. (According to Microsoft Fanboy site The Register, Microsoft are reviewing their decision on that one.)
  • Storagebod got me thinking again about Availability and Uptime with his article about how availability is measured.
  • Not technically reading, but I’ve finally jumped on board the growing number of listeners to Infosmack. This podcast is run by Greg Knieriemen and Marc Farley, and frequently has guests from many of the storage vendors and other storage bloggers. I’m really regretting that I haven’t been listening to it for longer. It’s definitely going to be a regular podcast for me from now on.
  • Over at Storage Monkeys, Sunshine Mugrabi’s article on EMC’s heavy involvement in social networking is definitely worth reviewing. (For what it’s worth, if you haven’t ever read it, you need to read The Cluetrain Manifesto if you think that all this social networking stuff is rubbish or just a passing fad. It isn’t. Written years before its time, The Cluetrain Manifesto is a clear and articulate series of essays about exactly how important social networking is.)
  • Finally, there’s been some interesting discussions on VMware and application level VSS backups through VCB/vSphere. Check my posting here for the summary of the important links to be following about it.

Finishing up, a little about what you’ve been reading: the NetWorker Power Users Guide to nsradmin. The number of downloads has been staggering – far more than I hoped for, and I hope like the main blog, the guide proves useful to many a NetWorker administrator.

 

Over at Backup Central, Curtis Preston has written a couple of excellent blog posts to do with VSS.

The first, What is Windows VSS and why should you care? is an excellent overview of how the VSS process works within Windows. Even if you’ve been using VSS within your environment, if you’re not quite sure how it works, this is a great piece to read.

The second delves into issues relating to VMware VCB’s (in)ability to perform consistent application backups – i.e., via VSS for say, an Exchange or Microsoft SQL guest. Titled Hyper-V ahead of VMware in the Backup Race, it’s a justifiable kick in the pants to VMware, and a pointed warning regarding VMware/VCB backups of applications.

(These two articles, Curtis mentions, came about from some posts by Scott Waterhouse on The Backup Blog, which talked about vSphere backups.)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha