Yesterday I experienced one of those weird NetWorker issues that is such an odd combination of factors that I felt it had to be discussed.

Here’s the scenario. A customer was:

  • Previously running NetWorker 7.4.2 on their backup server.
  • Upgraded the server to 7.5.1.
  • Had a bunch of Windows clients and one Unix client.
  • The Unix client was configured for filesystem backups and Oracle backups.
  • All clients were running 7.4.2(ish). The Oracle module was 4.5.
  • Once the upgrade was done, Unix filesystem backups continued to work but the Oracle backups would fail with:
client:RMAN:/path/to/script.rman 1 retry attempted
client:RMAN:/path/to/script.rman off
client:RMAN:/path/to/script.rman /path/to/nsrnmo[291]: -l:  not found
client:RMAN:/path/to/script.rman nsrnmostart returned status of 127
client:RMAN:/path/to/script.rman /path/to/nsrnmo exiting.

My first thought when a colleague asked me to have a look at it was that somehow there was enough of a slight enough incompatibility between 7.5.x and NMO 4.5 that some argument carried over from an earlier version of NMO was causing problems with talking to a 7.5.x server. This wasn’t the case. (Yes, I knew that the two versions are meant to be compatible, and when I’ve installed and used them they have been, but that doesn’t mean you can’t have one single setting somewhere that tickles a coding error across versions.)

I went back and forth with a few other checks with the customer, noting that there were various issues reported in the NMO applogs, but none specific enough to nail the problem. So since everything looked OK I agreed with the customer that a WebEx would probably help us solve the issue faster.

Even though the customer had given me the client resource, I hadn’t found anything wrong with the backup command or the save set name, so out of curiosity I’d asked the customer when we started the WebEx to show me the client details. The saveset looked fine, so we jumped across to the backup command, and that also looked fine. But then, underneath the backup command, there was the “save operations” field, and in that save operations field held:

VSS:*=off

It hadn’t been recently added. It had been there since before the upgrade, and before the upgrade the backups had been working. But as we know, on pre-VSS Windows systems invoking that will cause backup failures, so I asked the customer to remove that entry and start the backup. Neither of us really thought that this would solve the problem, given the filesystem backups were still working, but lo and behold, with that removed the Oracle RMAN backups started properly working.

In retrospect, this of course was definitely the problem, but working it out was a bit more challenging. The reason was that the configuration shouldn’t have worked under a NetWorker 7.4.x server either, but for some reason it did. The 7.4.x NetWorker server was likely not sending through the VSS directive to the Unix client and the Unix Oracle module, but having upgraded to 7.5.x, the new install stopped “filtering the error” and started causing the problem to manifest. Or alternatively, 7.4.x and 7.5.x both send the save operations setting, but just differently enough to be dangerous.

I wouldn’t exactly say this was NetWorker’s fault – those VSS options are only designed for use with Windows 2003 and higher clients, and I’d guess that the VSS:*=off was just applied to every single client on the customer site without considering the 1 x Unix client.

In retrospect, the following line now completely makes sense:

client:RMAN:/path/to/script.rman off

That was our only “hint” as to the cause of the problem in the savegroup completion. It wasn’t enough by a long stretch. Sometimes, and this is the challenging bit – sometimes you can have configuration errors even if you haven’t changed the actual resource configuration. Different versions of NetWorker will react differently to an incorrect configuration – so the upgrade didn’t cause the problem, it just allowed the problem to appear.

 

If you’re backing up Oracle with the NetWorker module/RMAN, there are an extremely large number of options you can choose from. RMAN, after all, is a complete backup/recovery system in and of itself, and so when you combine RMAN and NetWorker you, well, find yourself swimming in options.

One such option is the allocate channel command within RMAN. If you’ve not seen a basic RMAN script before, I should put one here for your reference:

connect target rman/supersecretpassword@DB10;

run {
 allocate channel t1 type 'SBT_TAPE';
 send 'NSR_ENV=(NSR_SAVESET_EXPIRATION=14 days,
       NSR_SERVER=nox,NSR_DATA_VOLUME_POOL=Daily)';

 backup format '/%d_%p_%t.%s/'
 (database);

 backup format '/%d_%p_%t_al.%s/'
 (archivelog from time 'SYSDATE-2');

 release channel t1;
}

You’ll note that one of the first commands used in the script is the allocate channel command. This effectively tells RMAN to open up a line of communication with NetWorker. Now, you can consider an RMAN channel to be a unit of parallelism in NetWorker parlance. Thus, if you want to backup (larger) databases with higher levels of parallelism, you need to allocate more channels.

In many NetWorker/Oracle scenarios, the NetWorker administrator has very little, if no, control over the construction and the configuration of the RMAN script. (The introduction of v5 of the module may change this.)

As a consequence, there’s often a reduced level of communication between the NetWorker administrator and the Oracle DBA which can result in reduced performance or scheduling conflicts. One particular issue that can occur though is that the Oracle DBA, eager to have the database backed up as quickly as possible, will throw a lot of allocate channel commands in. That little script above may become something such as say:

connect target rman/supersecretpassword@DB10;

run {

 allocate channel t1 type 'SBT_TAPE';
 allocate channel t2 type 'SBT_TAPE';
 allocate channel t3 type 'SBT_TAPE';
 allocate channel t4 type 'SBT_TAPE';
 allocate channel t5 type 'SBT_TAPE';
 allocate channel t6 type 'SBT_TAPE';
 allocate channel t7 type 'SBT_TAPE';
 allocate channel t8 type 'SBT_TAPE';

 send 'NSR_ENV=(NSR_SAVESET_EXPIRATION=14 days,
       NSR_SERVER=nox,NSR_DATA_VOLUME_POOL=Daily)';

 backup filesperset 4
 format '/%d_%p_%t.%s/'
 (database);

 backup format '/%d_%p_%t_al.%s/'
 (archivelog from time 'SYSDATE-2');

 release channel t1;
 release channel t2;
 release channel t3;
 release channel t4;
 release channel t5;
 release channel t6;
 release channel t7;
 release channel t8;
}

However, there’s a catch to lots of channels being allocated – channel allocation has no bearing on or is in any way impacted by NetWorker client parallelism. You see, the NetWorker client instance has a single saveset – the RMAN script name (or equivilant thereof, when using the Wizard in v5). Thus, to NetWorker, any Oracle client instance only has one saveset. Thus, that client parallelism will not affect the number of channels that can be allocated, but instead the number of simultaneous instances of the client that can be initiated.

The net result? Consider a client with parallelism of 4, that has 6 databases to be backed up. This would have 6 client instances, one per database. Assuming they’re all in the same group*, then at any one instance NetWorker will only allow the backup for 4 of those instances to be running. However, each instance, or each Oracle RMAN script, can start as many channels as it wants. If each RMAN script has been “tweaked” to allocate say, 8 channels like the above script example, this would mean that backing up 4 instances simultaneously would potentially see the client trying to send 32 savesets simultaneously to NetWorker.

Thus, if using multiple Oracle channels in RMAN backups with NetWorker, and particularly if backing up multiple Oracle databases simultaneously, it’s very important to have the NetWorker administrator and the DBA responsible for the RMAN scripts to communicate effectively and plan overall levels of parallelism/number of channels to avoid swamping the NetWorker server, swamping the network, or swamping the Oracle server.


* There are other considerations for starting multiple Oracle backups on the same machine and at the same time. In other words I’m not necessarily calling this best practice, just using an example.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha