Quibbles – nsrmmd vs process IDs

OK, there’s not a lot about NetWorker that drives me nuts. I think I’ve done only one other “Quibbles” topic here so far, but I’ve reached the point on this one where I’d like to vent some exasperation.

There are times – not often, but they occasionally happen – where for some reason or another, a device will lock up and become unresponsive. When this reaches a point where the only way to recover is to either kill the controlling nsrmmd process or restarting NetWorker, things get tough.

The reason for this is that NetWorker does not, anywhere, provide a mapping between each nsrmmd, the device it controls and the process ID for that device.

Honestly, this is one of these basic administrative usability issues for which there is no excuse that it hasn’t been resolved and available for the last 5 years, if not the last 10 years. It comes down to either laziness or apathy – people have been asking for it long enough that with all the changes done to nsrmmd over the years, it should have been added a long time ago.

What do you think?

5 thoughts on “Quibbles – nsrmmd vs process IDs”

  1. Hi Preston. You’re absolutely right here.
    Actually I personally think NetWorker should be able to display the process name and ID for all NetWorker tasks e.g. Clone and Stage sessions are good examples where it would make life a tone easier if you knew the Process ID.

    Here’s a scenario… You have more than one clone session running and “oh oooooh” the right click “stop” on a group doesn’t work, you want to kill only one of the two clone jobs and leave the other to complete.

    If NMC displayed the PID (process ID) then it would be very easy to do. This always reminded me of Russian roulette with less consequences.

    Ive suggested this change to a couple of the right people at EMC so from here on in…I can only hope.

  2. Hi,

    You can get the pid informationen with dbgcommand that is shipped with networker. It’s not perfect, but anyway it is pretty easy to get the info when you need it.

    1. Hi Rickard,

      Thanks for the tip. Given how closely EMC used to guard dbgcommand it’s still turning up surprises. I like this, but I still feel that it’s inappropriate that this information isn’t readily available in the device properties, particularly given how long people have been asking for it.

      For anyone interested, the process is:

      1. Find the nsrd process ID
      2. Run:

      dbgcommand -p nsrdPID PrintDevInfo

      3. Review the output in /nsr/logs/daemon.raw file.
      4. Compare against a “ps -eaf | grep nsrmmd”, or other process listing to retrieve nsrmmd numbers.

      For example:

      nox nsrd sn_mmd_if: nox
      nox nsrd sn_ndmp:
      nox nsrd sn_max_mmds: 0
      nsrd sn_excess_mmds: 0
      nox nsrd 	sn_num_devices: 11
      nox nsrd 		d_device: /d/nsr/idata/backup
      nox nsrd 		d_device: /d/nsr/03
      nox nsrd 		d_device: /d/nsr/03/_AF_readonly
      nox nsrd 		d_device: /d/nsr/02
      nox nsrd 		d_device: /d/nsr/02/_AF_readonly
      nox nsrd 		d_device: /d/nsr/01
      nox nsrd 		d_device: /d/nsr/idata/clone/_AF_readonly
      nox nsrd 		d_device: /d/nsr/idata/backup/_AF_readonly
      nox nsrd 		d_device: /d/nsr/01/_AF_readonly
      nox nsrd 		d_device: /dev/nst0
      nox nsrd 		d_device: /d/nsr/idata/clone
      nsrd n	sn_num_mmds: 11
      nox nsrd 		mm_number: 1
      nox nsrd 		mm_number: 2
      nox nsrd 		mm_number: 3
      nox nsrd 		mm_number: 4
      nox nsrd 		mm_number: 5
      nox nsrd 		mm_number: 6
      nox nsrd 		mm_number: 7
      nox nsrd 		mm_number: 8
      nox nsrd 		mm_number: 9
      nox nsrd 		mm_number: 10
      nox nsrd 		mm_number: 11
      nox nsrd n} nox
      

      Then:

      [root@nox ~]# ps -eaf | grep nsrmmd
      root      5864  5854  0 10:27 ?        00:00:00 /usr/sbin/nsrmmdbd
      root      6036  5854  0 10:28 ?        00:00:21 /usr/sbin/nsrmmd -n 1
      root      6042  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 2
      root      6047  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 3
      root      6052  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 4
      root      6057  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 5
      root      6062  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 6
      root      6071  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 7
      root      6076  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 8
      root      6081  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 9
      root      6086  5854  0 10:28 ?        00:00:00 /usr/sbin/nsrmmd -n 10
      root      7646  5854  0 16:29 ?        00:00:00 /usr/sbin/nsrmmd -n 11
      

      I agree, in an emergency, this will work – and very grateful you pointed it out – but I still think this is something that needs to be improved on.

  3. Hi Preston,

    This is somewhat along the line of process IDs and killing them. Have you had any experience in killing hung nsrjb processes via the ‘NSR operation status’ resource in nsradmin?

    We had a hung nsrjb process that was then causing all other jukebox operations to queue up behind it. In the end, we killed the NetWorker services and brought them back up. However, I came across an alternative solution when perusing the command reference that I was hoping you could comment on:

    Could I use nsradmin to update the ‘cancellation’ attribute of the hung operation instance? The command reference for NSR_OP states that this cancellation attribute can have, among other things, a value of “full” or “immediate” assigned to it: “full” would be akin to gracefully killing the process, whereas “immediate” would be killing it immediately.

    Would you, or anybody else, mind giving your thoughts on the above?

    1. Hi James,

      You certainly can use the operation status flag to abort the individual operation if you wish; I believe the primary difference however between “full” and “immediate” is that “full” instructs the controlling daemon to hang around for the operation to terminate after sending the appropriate halt signal, whereas “immediate” just returns immediately.

      In theory, neither option works too well if the issue causing the failure is sufficiently low level that it’s causing a SCSI hang/block operation. I’d probably liken the two more to Oracle’s graceful vs non-graceful shutdown. I.e., a “full” in the NSR operation would be akin to Oracle’s “shutdown immediate”, whereas an “immediate” in the NSR operation would be akin to Oracle’s “shutdown abort”.

      I.e., neither are particularly cleaner, but one leaves slightly less mess for you to deal with later 🙂

      Cheers.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.