Sometimes it’s helpful to run NetWorker in debug mode – but sometimes, you just want to throw the nsrmmd processes into debug mode, and depending on your site, there may be a lot of them.

So, I finally got around to writing a “script” to throw all nsrmmd processes into debug mode. It hardly warrants being a script, but it may be helpful to others. Of course, this is Unix only – I’ll leave it as an exercise to the reader to generate the equivalent Windows script.

The entire script is as follows:

#!/bin/sh

PLATFORM=`uname`

if [ "$PLATFORM" = "Linux" ]
then
	PROCLIST=`ps -C nsrmmd -o pid | grep -v PID`
elif [ "$PLATFORM" = "SunOS" ]
then
	PROCLIST=`ps -ea -o pid,comm | grep 'nsrmmd$' | awk '{print $1}'`
fi

DBG=$1

for pid in $PROCLIST
do
	echo dbgcommand -p $pid Debug=$DBG
	dbgcommand -p $pid Debug=$DBG
done

The above is applicable only to Solaris and Linux so far – I’ve not customised for say, HPUX or AIX simply because I don’t have either of those platforms hanging around in my lab. To invoke, you’d simply run:

# dbgnsrmmd.sh level

Where level is a number between 0 (for off) and 99 (for … “are you insane???”). Running it on one of my lab servers, it works as follows:

[root@nox bin]# dbgnsrmmd.sh 9
dbgcommand -p 4972 Debug=9
dbgcommand -p 4977 Debug=9
dbgcommand -p 4979 Debug=9
dbgcommand -p 4982 Debug=9
dbgcommand -p 4991 Debug=9
dbgcommand -p 4999 Debug=9
Note that when you invoke dbgcommand against a sub-daemon such as nsrmmd (as opposed to nsrd itself), you won’t get an alert in the daemon.{raw|log} file to indicate the debug level has changed.
 

Or, remember quantum physics.

There’s a basic rule of quantum physics, that being the observer effect – the more you watch something, the more likely you are to impact that thing you are watching*.

The observer effect plays out quite a lot in IT, and I like to remind people that it also comes into effect when using NetWorker. However, that’s not just a “Murphy’s law” style scenario where you if spend 3 hours trying to debug a problem and then ask someone to have a look at it, it magically disappears.

No, it’s also the scenario of trying to watch an error condition in debug mode only to find it vanish. Now, this doesn’t happen all the time, but it is likely to happen in certain scenarios, such as race conditions. When you have a race condition, the observer effect can definitely come into play.

Consider it this way – when NetWorker is running without any debug mode enabled, it’s pumping through its tasks as fast as the overall environment (including its own coding) will allow it to complete.

However, when you put NetWorker into debug mode, it has to interrupt its normal flow more often to spit out additional information about what it is doing, what feedback it is getting, and so on. That’s feedback and output that it ordinarily wouldn’t have to deal with. That is, you’re artificially slowing down the run-time of the product. If NetWorker is experiencing a problem that is timing or sequence related, the slowing down of NetWorker may result in the condition being avoided.

Thus, when debugging NetWorker (i.e., putting it into debug mode for extended analysis), it’s reasonably important that you:

  • Start at a low number (e.g., 3, or 9 at the most);
  • Determine whether the problem still occurs;
  • If the problem does still occur but you don’t get enough information, increase debug levels incrementally;
  • If the problem does not occur, decrease the debug levels by “half at a time” (or as close to as possible) and note when the error reappears.

Or more precisely – if planning on using debug mode, don’t just jump in and set it to a ridiculously high number. Not only will it produce vast amounts of logs for you to wade through, it also has the potential of preventing the error from happening at all, which just puts you back to square 1.


* Strangely enough, owners of cats may also be well aware of this. The more you stare at a cat, the more likely it is to stop what it’s doing and react.

 

Those new to NetWorker may wonder what the purpose of the “/nsr/debug” directory is. When working in Windows environments, the path to the directory will be different, but it’s still in the same relative location – a child directory to the “nsr” directory on the backup server.

This is a somewhat special directory that the NetWorker server monitors for files whose presence will change the behaviour of the server on the fly. To my knowledge, there’s never been a definitive list published of files that can be created in the debug directory (the same as there’s never been a definitive list published externally to EMC of all the allowed NSR_ environment variables).

While this is good, in that it protects novices from shooting themselves in the foot, power users such as myself and many others have long been curious about what other delights are to be found in the /nsr/debug directory.

To change the behaviour of the NetWorker server, one would simply create a (optionally empty) file with the correct filename (under Unix, this would be as simple as “touch filename”). Files that still serve some purpose are:

  • no_striped_recover – By default NetWorker will attempt to read from as many pieces of media as it is physically able to simultaneously to faciltiate a recovery. If you’ve got ten tape drives and you need to read from ten tapes, NetWorker can mount all ten and start the recovery in parallel. However, in some circumstances (normally, a Linux only problem), if NetWorker would need to mount more media than can be physically supported by the number of tape drives, the entire recovery may hang. If this is happening, touch no_striped_recover in the /nsr/debug directory to have NetWorker process the recovery sequentially.
  • no_immediate_save – These days the main differences between Power and Network edition in NetWorker are the number of devices and units of parallelism per storage node. This wasn’t always the case – at some point in the past, another difference was that a storage node’s client process could communicate with local nsrmmds via shared memory, rather than through the TCP/IP stack. However, this was eventually dropped, and all storage nodes, regardless of whether you’re on Power or Network edition, have their own client<->nsrmmd communications go through shared memory instead of TCP/IP. The no_immediate_save option is there to turn this feature off if you want. For example, if you have a dedicated storage node that is also a very large Oracle server, it may be that all shared memory is always occupied by Oracle processes, and that it would be more efficient to have client<->nsrmmd communications go through TCP/IP. (Truth be told I think it’s still there more for handling of esoteric bugs or systems where shared memory has, for some reason, been disabled.)

If you’re still on a pre-NetWorker 7.3.2 release, you’re also lucky enough to be able to create a file called no_suppress, which turns off that annoying “x lines of output has been suppressed” message that appears in savegroup completions when you need to debug errors. Unfortunately, someone in their infinite wisdom in NetWorker development* decided that this is a bad thing and pulled it out. Unfortunately, while supposedly there have been various “changes” that allow you to see this information, you can’t, unless you run the backup command manually from the client, which sort of defeats the purposes of true and accurate debugging of issues! (EMC PowerLink documents and manual pages insist you can see it by reviewing the /nsr/tmp/sg/* information and the messages/daemon.raw files. This is, ahem, inaccurate.)

So, if you’re concerned about say, why there is an /nsr/debug directory on your backup server (or client, or storage node), there’s no need to be – it’s all part of the general architecture of NetWorker, and is there to serve a real purpose when needed.


* Prefacing what I’m about to say with I don’t have access to the source code so this is all speculation – I believe this has been done due to the inherent architectural limitations of nsrjobd and using a RAP database to store jobs data. This results in the artificial need to reduce as much as possible information that may appear in the savegroup completion notifications. That’s just my theory, but everything I see in the behaviour of the jobs database backs that theory up.

 

Having a strong support background, and having a preference for working in software rather than hardware, when I use NetWorker’s nsrjb utility with tape libraries I like to get as much feedback as possible on operations.

When performing command line operations with nsrjb, this is relatively straight forward – just add on a few “-v” options to the operation. Indeed, my customers sometimes probably tire of hearing me start every nsrjb command with:

# nsrjb -vvv

It’s become such a force of habit that I literally don’t consider running it without those three verbose flags any more*. If you’re not sure why you would do this, or have never run nsrjb with verbose flags, you should go and do it – you get a lot more information about the progress of the operation, which can actually be very important if you’re debugging or otherwise trying to resolve hanging hardware operations.

Being able to run nsrjb in verbose mode is all well and good, but it doesn’t help you diagnose periodic or random failures if they’re only happening during backup operations. To keep track of progress on these, you need to use a setting within the jukebox resource – the debug trace level option.

This updates the daemon.raw log file with extra details about autochanger operations, in a similar way to the additional details you get when you run nsrjb with additional verbosity flags. It can be set to a number, and like the debug option for a lot of NetWorker commands, defaults to 0 (for no additional information), and can be set to ever larger numbers that generate a lot of extra information. It doesn’t generate the same information as running nsrjb with multiple “-v” flags, but it still helps.

Now, before I go further, I’d suggest that this is probably something you shouldn’t leave enabled all the time. That is, if you’re having an intermittent problem with a tape library, consider turning this on so that you have the option of reviewing log files from overnight backups to see if there’s any relevant/helpful information from the time of failures. However, when you finish your debugging, turn it back off by setting it to zero.

There’s two ways you can set it – either from the command line, using nsradmin, or from NetWorker Management Console when viewing a jukebox resource with “diagnostic” details turned on.

Let’s look at nsradmin first:

# nsradmin -s nox
NetWorker administration program.
Use the "help" command for help, "visual" for full-screen mode.
nsradmin> show name:; debug trace level:
nsradmin> print type: NSR jukebox
                        name: LTO1_LIB;
           debug trace level: 0;
nsradmin> update debug trace level: 9
           debug trace level: 9;
Update? y
updated resource id 31.0.255.110.0.0.0.0.60.6.146.73.0.0.0.0.10.0.0.1(5348)
nsradmin> print
                        name: LTO1_LIB;
           debug trace level: 9;

It’s as simple as that!

If you want to see it from within NMC, you’d make the changes in the “Debug Trace Level” field, as per the screen-shot below:

Advanced tab of Autochanger properties

Advanced tab of Autochanger properties

(The debug trace level setting is the bottom-most option on the left-hand side of this tab.)

You’ll find that debug level of 9 will more than likely be more than enough for any regular fault finding exercise, as the amount of information produced by this is quite high. For instance, at the bottom of this post I’ve got output from a single operation, “nsrjb -lvvv -S 1″ to the volume in slot one into a drive.

Remember when you’re done of course to turn the trace level back off by setting the debug trace level to zero.

nox nsrlcpd getenv_ulong(NSR_LCPD_BASEELEMOFFSET=0,0:65535, 0) returning 0
nox nsrlcpd getenv_ulong(NSR_LCPD_NO_LOAD_EXP=0,0:1, 0) returning 0
nox nsrlcpd getenv_ulong(NSR_LCPD_NOVUELMASC=0,0:1, 0) returning 0
nox nsrlcpd getenv_ulong(NSR_LCPD_AUTOEJECT=0,0:1, 0) returning 0
nox nsrlcpd getenv_ulong(NSR_LCPD_AUTOEXT=0,0:1, 0) returning 0
nox nsrlcpd getenv_ulong(NSR_LCPD_AUTOEXT2=0,0:1, 0) returning 0
nox nsrlcpd getenv_ulong(NSR_LCPD_RES_DRVACCESS=0,0:1, 0) returning 0
nox nsrlcpd Event LE_CONTINUE raised for cmdid 16055
nox nsrlcpd Processing: cmdid 16055 & cmdop 12 for event: LE_CONTINUE
nox nsrlcpd Event LE_COMP_OK raised for cmdid 16055
nox nsrlcpd Responder woke up due to event
nox nsrlcpd LCPD_HDR: server:nox.anywebdb.com vers:1 jbid:LTO1_LIB cmdid:16055 is_master=yes cmdop:UPDATE.
nox nsrlcpd \ncomplete :no
nox nsrlcpd \n
nox nsrmmgd Loading volume `Clone.001' from slot `1' into device `/dev/nst0'.
nox nsrlcpd LCPD_HDR: server:nox.anywebdb.com vers:1 jbid:LTO1_LIB cmdid:16056 is_master=no cmdop:LOAD.
nox nsrlcpd     MOVE MEDIA INFO
nox nsrlcpd         source : S:000031
nox nsrlcpd         source attrlist : NULL
nox nsrlcpd         dest   : D:000001
nox nsrlcpd Event LE_NEWCMD raised for cmdid 16056
nox nsrlcpd Processing: cmdid 16056 & cmdop 4 for event: LE_NEWCMD
Unable to render the following message: freeing unused errinfo with msgid 14947

nox nsrlcpd Event LE_CONTINUE raised for cmdid 16056
nox nsrlcpd Processing: cmdid 16056 & cmdop 4 for event: LE_CONTINUE
nox nsrlcpd lg_lstat(): Calling lstat64().
nox nsrlcpd lg_open(): Calling open64().
nox nsrlcpd lcpd_last_status_set: OK->OK (requested OK).
nox nsrlcpd Event LE_CONTINUE raised for cmdid 16056
nox nsrlcpd Processing: cmdid 16056 & cmdop 4 for event: LE_CONTINUE
nox nsrlcpd lg_lstat(): Calling lstat64().
nox nsrlcpd lg_open(): Calling open64().
nox nsrlcpd LCPD_HDR: server:nox.anywebdb.com vers:1 jbid:LTO1_LIB cmdid:16057 is_master=no cmdop:STATUS.
nox nsrlcpd Responder woke up due to event
nox nsrlcpd LCPD_HDR: server:nox.anywebdb.com vers:1 jbid:LTO1_LIB cmdid:16057 is_master=yes cmdop:STATUS.
nox nsrlcpd pid:7606 lcp_rpc_vers:1, jb_list_len:1
nox nsrlcpd \n
nox nsrlcpd lcpd_last_status_set: OK->OK (requested OK).
Unable to render the following message: freeing unused errinfo with msgid 14947

nox nsrlcpd Event LE_CONTINUE raised for cmdid 16056
nox nsrlcpd Processing: cmdid 16056 & cmdop 4 for event: LE_CONTINUE
nox nsrlcpd Event LE_COMP_OK raised for cmdid 16056
nox nsrlcpd Responder woke up due to event
nox nsrlcpd LCPD_HDR: server:nox.anywebdb.com vers:1 jbid:LTO1_LIB cmdid:16056 is_master=yes cmdop:LOAD.
nox nsrlcpd \ncomplete :no
nox nsrlcpd offset:0    tag:S:000031    ok:yes full: no barcode:       ? dest:D:000001 bay:?
nox nsrlcpd \n
nox nsrlcpd offset:0      tag:D:000001    ok:yes full:yes barcode:       ? source:S:000031 serial#:? bay:?
nox nsrlcpd \n
nox nsrd /dev/nst0 Verify label operation in progress
nox nsrd /dev/nst0 Mount operation in progress
nox nsrlcpd LCPD_HDR: server:nox.anywebdb.com vers:1 jbid:LTO1_LIB cmdid:16058 is_master=no cmdop:STATUS.
nox nsrlcpd Responder woke up due to event
nox nsrlcpd LCPD_HDR: server:nox.anywebdb.com vers:1 jbid:LTO1_LIB cmdid:16058 is_master=yes cmdop:STATUS.
nox nsrlcpd pid:7606 lcp_rpc_vers:1, jb_list_len:1
nox nsrlcpd \n
nox nsrd [Jukebox `LTO1_LIB', operation # 30]. Finished with status: succeeded
nox nsrlcpd Processing: cmdid 4294967295 & cmdop 17 for event: LE_NEWCMD
nox nsrlcpd Event ? raised for cmdid 4294967295
nox nsrlcpd Processing: cmdid 4294967295 & cmdop 17 for event: ?
nox nsrlcpd lg_lstat(): Calling lstat64().
nox nsrlcpd lg_open(): Calling open64().
nox nsrlcpd lcpd_last_status_set: OK->OK (requested OK).
nox nsrlcpd Polling current hw status jukebox:LTO1_LIB status:OK
nox nsrlcpd Event ? raised for cmdid 4294967295
nox nsrlcpd Processing: cmdid 4294967295 & cmdop 17 for event: ?
nox nsrlcpd Event LE_COMP_OK raised for cmdid 4294967295


* With the one exception being when it’s used in scripting.

debug trace level option.

 

Note: Please understand I’m not suggesting you play around with this feature randomly or without good reason. You shouldn’t make use of this unless you’re advised to by support – or unless you really, really know what you’re doing.

In the past if you’ve had a fairly serious issue with NetWorker you’ll probably have been advised to temporarily stop NetWorker, and restart it in debug mode. For example, support may have asked you to run, say:

# nsrd -D9

Which launches the main NetWorker server process (the “master” process) in debug mode level 9. There’s 99 levels of debug, ranging from 1 to 99, with the special level of “0″ being “off”. Debug mode level 1 is not really all that interesting; debug model level 3 can be useful sometimes for tracing name resolution issues. Level 9 produces fairly interesting details for people in support roles – I frequently make use of it when trying to debug issues customers may be encountering. (After that, the levels start to get less useful to anyone except for EMC engineers, and level 99 falls into the what, are you insane? category.)

There’s a problem with this strategy of stopping NetWorker and rerunning it though, as you may have already guessed – sometimes stopping and starting effectively kills off the problem you’re having. What is needed is the ability to put NetWorker into debug mode on the fly.

To address this, there’s been a support tool floating around for a while that EMC support would occasionally pass on to customers to help out with enabling debug mode without having to restart NetWorker. It’s called dbgcommand. However, dbgcommand wasn’t provided by default, and was typically only handed out either (a) when it was explicitly requested or (b) when engineering asked support to get the customer to run it.

As you can imagine, this has been a tad frustrating for just about everyone involved. Actually, I’ll rephrase that this has been quite frustrating for everyone involved.

Thanks in part to various people (e.g., my colleagues and I) continuing to ask for it, the pseudo-magical handing out of dbgcommand has been dropped, and it’s now included in NetWorker as of 7.5. [Note: I think it's also in 7.4.4...]

So, how do you use it?

Surprisingly easily:

# dbgcommand -p pid command

Where pid is the process ID of the NetWorker daemon you want to interact with.

Typically your usage would be say:

# dbgcommand -p pid Debug=X

Where X is a value between 0 and 99, or perhaps:

# dbgcommand -p pid FlushDnsCache

For example, if the NetWorker server process, /usr/sbin/nsrd, were running with a PID of 23278, you could put it in debug mode level 9 without a restart by typing the command:

# dbgcommand -p 23278 Debug=9

When you adjust a processes debug level, it gets logged in the appropriate daemon log – for example:

0 1234248435 2 0 0 3051682432 23277 0 nox nsrd 2 %s 1 0 54 Changing Debug level of process (id 23278) from 0 to 9
0 1234248446 2 0 0 3051682432 23277 0 nox nsrd 2 %s 1 0 54 Changing Debug level of process (id 23278) from 9 to 0

Or:

0 1234249625 2 0 0 3051682432 23277 0 nox nsrd 2 %s 1 0 56 Changing FlushDnsCache of process (id 23278) from 0 to 1

So there you have it – no longer do you have to stop and restart NetWorker to run in debug mode, you can alter the debug level of whatever NetWorker process you need to on the fly with minimum fuss.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha