Debugging – device device daemons on a storage node

On one of my lab environments I recently had a problem where the devices (AFTD) on a firewalled storage node suddenly stopped working.

This manifested in a fairly odd way:

  • NetWorker was still running on the storage node;
  • nsrports was still perfectly functional in both directions;
  • echo “print type: NSRLA” | nsradmin -p 390113 -s storageNode -i also worked perfectly normally.

The storage node could even initiate backups … it just couldn’t write them, because the daemons weren’t running and it was configured to use itself for backups.

So what was going wrong?

After stopping and restarting the services a couple of times around checking firewall rules, I decided it couldn’t be the firewall – I was in full control of it, and while I’m not in any means an expert on firewalls, I was 100% certain that there’d been no changes to firewall rules at all in the last few months.

Next step was the logs, which revealed a particularly odd error, one I’d not seen before:

42503 06/04/13 18:03:55 4 2 12 1103501632 3915 0 mondas nsrmmd RPC severe Remote system error - No route to host 
57925 06/04/13 18:03:57 2 0 0 1103501632 3915 0 mondas nsrmmd NSR warning Exiting idle nsrmmd #8 after 3 unsuccessful remap attempts to /nsr/tmp/snmd_mmf.map memory map file. 
83447 06/04/13 18:03:57 2 0 0 1103501632 3915 0 mondas nsrmmd NSR warning shutdown nsrmmd 8 with pid 3915. 
42503 06/04/13 18:03:55 4 2 12 2396641600 3916 0 mondas nsrmmd RPC severe Remote system error - No route to host 
57925 06/04/13 18:03:58 2 0 0 2396641600 3916 0 mondas nsrmmd NSR warning Exiting idle nsrmmd #9 after 3 unsuccessful remap attempts to /nsr/tmp/snmd_mmf.map memory map file. 
83447 06/04/13 18:03:58 2 0 0 2396641600 3916 0 mondas nsrmmd NSR warning shutdown nsrmmd 9 with pid 3916. 
33638 06/04/13 18:04:01 1 5 0 1013503696 3915 0 mondas nsrmmd NSR notice Shutting down nsrmmd #8, with PID 3915, at HOST mondas 
33638 06/04/13 18:04:01 1 5 0 2306643664 3916 0 mondas nsrmmd NSR notice Shutting down nsrmmd #9, with PID 3916, at HOST mondas 
42503 06/04/13 18:04:03 4 2 12 4123437376 3909 0 mondas nsrmmd RPC severe Remote system error - No route to host 
57925 06/04/13 18:04:05 2 0 0 4123437376 3909 0 mondas nsrmmd NSR warning Exiting idle nsrmmd #2 after 3 unsuccessful remap attempts to /nsr/tmp/snmd_mmf.map memory map file.

That file is new in NetWorker 8, and belongs to the nsrsnmd daemon, a new process which controls daemons on each individual storage node, alleviating that control process from nsrd on the server itself, and remains on disk between restarts of the daemons.

A search of support.emc.com yielded nothing for this particular error – not unusual for a fairly esoteric looking error on a reasonably newish release of NetWorker, so without many more options to immediately try, I decided to try out the technique so disliked by engineering – I shutdown NetWorker and removed the /nsr/tmp directory. Upon restart … Voilà! All device daemons started up.

I’m guessing somehow the file became corrupted; looking back over my lab logs, backup failures had started about 6 hours after a power outage, and while I’d managed to perform a controlled shutdown on UPS, the only thing I can think of is that the power surge before the outage may have caused a minor glitch. I doubt I’ll ever know the exact cause.

Not fully knowing the purpose of the /nsr/tmp/snmd_mmf.map file, I don’t know why it isn’t deleted on initial daemon startup, but there’s likely a reason behind it.

In the meantime, if suddenly all your daemons stop working on a storage node under NetWorker 8, that file may be a candidate for removal.

2 thoughts on “Debugging – device device daemons on a storage node”

  1. You sir are a lifesaver =) I just upgraded from 8.0.0.4 to 8.0.2 and had this issue on all my storage nodes. Fix worked like a charm..

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.