Sometimes it’s helpful to run NetWorker in debug mode – but sometimes, you just want to throw the nsrmmd processes into debug mode, and depending on your site, there may be a lot of them.

So, I finally got around to writing a “script” to throw all nsrmmd processes into debug mode. It hardly warrants being a script, but it may be helpful to others. Of course, this is Unix only – I’ll leave it as an exercise to the reader to generate the equivalent Windows script.

The entire script is as follows:

#!/bin/sh

PLATFORM=`uname`

if [ "$PLATFORM" = "Linux" ]
then
	PROCLIST=`ps -C nsrmmd -o pid | grep -v PID`
elif [ "$PLATFORM" = "SunOS" ]
then
	PROCLIST=`ps -ea -o pid,comm | grep 'nsrmmd$' | awk '{print $1}'`
fi

DBG=$1

for pid in $PROCLIST
do
	echo dbgcommand -p $pid Debug=$DBG
	dbgcommand -p $pid Debug=$DBG
done

The above is applicable only to Solaris and Linux so far – I’ve not customised for say, HPUX or AIX simply because I don’t have either of those platforms hanging around in my lab. To invoke, you’d simply run:

# dbgnsrmmd.sh level

Where level is a number between 0 (for off) and 99 (for … “are you insane???”). Running it on one of my lab servers, it works as follows:

[root@nox bin]# dbgnsrmmd.sh 9
dbgcommand -p 4972 Debug=9
dbgcommand -p 4977 Debug=9
dbgcommand -p 4979 Debug=9
dbgcommand -p 4982 Debug=9
dbgcommand -p 4991 Debug=9
dbgcommand -p 4999 Debug=9
Note that when you invoke dbgcommand against a sub-daemon such as nsrmmd (as opposed to nsrd itself), you won’t get an alert in the daemon.{raw|log} file to indicate the debug level has changed.
 

Everyone who has worked with ADV_FILE devices knows this situation: a disk backup unit fills, and the saveset(s) being written hang until you clear up space, because as we know savesets in progress can’t be moved from one device to another:

Savesets hung on full ADV_FILE device until space is cleared

Honestly, what makes me really angry (I’m talking Marvin the Martian really angry here) is that if a tape device fills and another tape of the same pool is currently mounted, NetWorker will continue to write the saveset on the next available device:

Saveset moving from one tape device to another

What’s more, if it fills and there’s a drive that currently does have a tape mounted, NetWorker will mount a new tape in that drive and continue the backup in preference to dismounting the full tape and reloading a volume in the current drive.

There’s an expression for the behavioural discrepancy here: That sucks.

If anyone wonders why I say VTLs shouldn’t need to exist, but I still go and recommend them and use them, that’s your number one reason.

 

For a while now I’ve been working with EMC support on an issue that’s only likely to strike sites that have intermittent connectivity between the server and storage nodes and that stage from ADV_FILE on the storage node to ADV_FILE on the server.

The crux of the problem is that if you’re staging from storage node to server and comms between the sites are lost for long enough that NetWorker:

  • Detects the storage node nsrmmd processes have failed, and
  • Attempts to restart the storage node nsrmmd processes, and
  • Fails to restart the storage node nsrmmd processes

Then you can end up in a situation where the staging aborts in an ‘interesting’ way. The first hint of the problem is that you’ll see a message such as the following in your daemon.raw:

68975 10/15/2009 09:59:05 AM  2 0 0 526402000 4495 0 tara.pmdg.lab nsrmmd filesys_nuke_ssid: unable to unlink /backup/84/05/notes/c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342 on device `/backup’: No such file or directory

(The above was rendered for your convenience.)

However, if you look for the cited file, you’ll find that it doesn’t exist. That’s not quite the end of the matter though. Unfortunately, while the saveset file that was being staged didn’t stay on disk, its media database details did. So in order to restart staging, it becomes necessary to first locate the saveset in question and delete the media database entry for the (failed) server disk backup unit copy. Interestingly, this is only ever to be found on the RW device, not the RO device:

[root@tara ~]# mminfo -q "ssid=c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342"
 volume        client       date      size   level  name
Tara.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001.RO    fawn      10/15/2009 1287 MB manual  /usr/share

We had hoped that it was fixed in 7.5.1.5, but my tests aren’t showing that to be the case. Regardless, it’s certainly around in 7.4.x as well and (given the nature of it) has quite possibly been around for a while longer than that.

As I said at the outset, this isn’t likely to affect many sites, but it is something to be aware of.

 

In many environments with storage nodes, a common requirement is to share backup devices between the server and/or storage nodes (regardless of whether the storage nodes are dedicated or full). The primary goal is to reduce the number of devices, or the number of tape libraries required in order to minimise cost while still maximising flexibility of the environment.

There are two mechanisms available for device sharing. These are:

  • Library sharing – free from any licensing, this is the cheapest but least flexible
  • Dynamic drive sharing – requiring additional licenses, this is more flexible but comes at a higher cost in terms of maintenance, debugging and complexity.

It’s easiest to gain an understanding of how these two options work with some diagrams.

First, let’s consider library sharing:

Conceptual diagram of library sharing

Conceptual diagram of library sharing

(Not shown: connection to tape robot head – i.e., control port connection.)

In this configuration, more than one host connects to specific devices in the tape library. These are hard or permanent connections; that is, once a device has been allocated to one server/storage node, it stays allocated to that host until the library is reconfigured.

This is a static allocation of resources that has the backup administrator allocate a specific number of devices per server/storage node based on the expected requirements of the environment. For instance, in the example above, the server has permanent mappings to 3 of the 6 tape drives in the library; the full storage node has permanent mappings to 2 of the drives, and the dedicated storage node has a permanent mapping to the one remaining drive in the tape library.

The key advantages of this allocation method are:

  • Zero licensing cost,
  • Guaranteed device availability,
  • Per-host/device isolation, preventing faults on one system from cascading to another.

The disadvantages of this allocation method are:

  • No dynamic reallocation of resources in the event of requirement spikes that were not anticipated
  • Can’t be reconfigured “on the fly”
  • If a backup device fails and a host only has access to one device, it won’t be able to backup or recover without configuration changes.

Where you would typically use this allocation method:

  • In VTLs – since NetWorker licenses VTLs by capacity*, you can allocate as many virtual drives as you want, providing each host with more than a certain amount of data in the datazone with one or more virtual drives, significantly reducing LAN impact of backup.
  • In PTLs where backup/recovery load is shared reasonably equally by two or more storage nodes (counting the server as a storage node in this context) and having only one library is desirable.

Moving on to dynamic drive sharing, this model resembles the following:

Conceptual overview of dynamic drive sharing

Conceptual overview of dynamic drive sharing

In this model, licenses are purchased, on a per-drive basis, for dynamic sharing. (So if you have 6 tape drives, such as in the above example, you would need up to 6 dynamic drive sharing licenses – you don’t have to share every device in the library – some could remain statically mapped if desired.)

When a library with dynamically shared drives is setup within NetWorker, the correct path to the device, on a per-host basis, will be established for that device within the configuration. This might mean that the device “/dev/nst0″ on the backup server might be known in the configuration as being all three of the following:

  • /dev/nsto
  • rd=stnode:/dev/rmt/0cbn
  • rd=dedstnode:\\.\Tape0

When a host with dynamic access to drives needs a device (for either backup or recovery), the NetWorker server (or whichever host has control over the actual tape robot) will load a tape into a free, mappable drive, then notify the storage node which device it should use to access media. The storage node will then use the media until it no longer needs to, with the host that controls the robot handling any post-use unmounts or media changes.

The advantages of dynamic drive sharing are:

  • With maximal device sharing enabled, resource spikes can be handled by dynamically allocating a useful number of drives to host(s) that need them at any given time,
  • Fewer drives are typically required than would be in a library sharing/statically mapped allocation method.

The disadvantages of dynamic drive sharing are:

  • With multiple hosts able to see the same devices, isolating devices from SCSI resets and other SAN events from non-accessing hosts requires constant vigilance. HBAs and SAN settings must be configured appropriately, and these settings must be migrated/checked every time drivers change, systems are updated, etc.
  • It is relatively easy to misconfigure dynamic drive sharing by planning to use too few physical tape drives than are really necessary. (I.e., it’s not that cheap, from a hardware perspective either.)
  • Each drive that is dynamically shared requires a license.
  • Unlike products such as say, NetBackup, due to the way nsrmmd’s work and don’t share nicely with each other, volumes must be unmounted before devices are transferred from one host with dynamic drive sharing to another, even if both hosts will be using the same volume. (This falls into the “lame” feature category.)

Where you would typically use dynamic drive sharing includes areas such as:

  • A small, select number of hosts with significant volumes of data require LAN-free backups,
  • With small/isolated storage nodes that are still SAN connected (e.g., DMZ storage nodes).

A lot of the architectural reasons as to why dynamic drive sharing was originally developed has in some senses gone away with greater penetration of VTLs into the backup arena. Given that it’s a straight forward proposition to configure a large number of virtual tape drives, instead of messing about with dynamic drive sharing one can instead choose to just use library sharing in VTL environments to achieve the best of both worlds.


* Currently non-EMC VTLs, while still licensed by capacity, typically co-receive unlimited autochanger licenses. Even so, such licenses are not limited by the the number of virtual tape drives.

 

OK, there’s not a lot about NetWorker that drives me nuts. I think I’ve done only one other “Quibbles” topic here so far, but I’ve reached the point on this one where I’d like to vent some exasperation.

There are times – not often, but they occasionally happen – where for some reason or another, a device will lock up and become unresponsive. When this reaches a point where the only way to recover is to either kill the controlling nsrmmd process or restarting NetWorker, things get tough.

The reason for this is that NetWorker does not, anywhere, provide a mapping between each nsrmmd, the device it controls and the process ID for that device.

Honestly, this is one of these basic administrative usability issues for which there is no excuse that it hasn’t been resolved and available for the last 5 years, if not the last 10 years. It comes down to either laziness or apathy – people have been asking for it long enough that with all the changes done to nsrmmd over the years, it should have been added a long time ago.

What do you think?

 

The fantastic thing about NetWorker is that being a three-tier architecture, a datazone may encompass far more than just a single site or datacentre. That is, you can design a system where the NetWorker server is in Sydney, and you have storage nodes in Melbourne, Adelaide, Perth, Brisbane, Darwin and Hobart. The server would be responsible for coordinating all backups and storing/retrieving data from the Sydney datacentre, and each storage node would be responsible for the storage/retrieval of backups local to that datazone.

(Or, to use a non-Australian example, you could have a datazone where your backup server is in London, and you have storage nodes in Paris, New York and Cape Town.)

When a NetWorker datazone encompasses only a single datacentre, there’s usually very little tweaking that needs to be done to the server <-> storage node communications, once they’re established. However, when we start talking about datazones that encompass WANs, we do have to take into account the level of latency between the storage nodes and the backup server.

Luckily, there’s settings within NetWorker to account for this. Specifically, there are three key settings, all maintained within the NetWorker server resource itself. These are:

  • nsrmmd polling interval
  • nsrmmd restart interval
  • nsrmmd control timeout

To view these settings in the NetWorker management console, you first have to turn on diagnostic mode. Then, right click the server (absolute topmost entry in the configuration tree) and choose “Properties”. These settings are maintained in the “Media” pane:

Controlling nsrmmd settings in NMC

Controlling nsrmmd settings in NMC

So, what do each of these settings do?

  • nsrmmd polling interval – This is the number of minutes that elapses between times that the NetWorker master process (nsrd) probes the nsrmmd to determine that it is still running. You could think of it as the heartbeat parameter. By default, this is 3 minutes.
  • nsrmmd restart interval – This is how long, in minutes, NetWorker will wait between restart attempts of an nsrmmd process. By default, this is 2 minutes.
  • nsrmmd control timeout – This is the number of minutes NetWorker waits for storage node requests to be completed. By default, this is 5 minutes.

Note that NetWorker is intelligent about this – the man pages for instance explicitly refers to “remote nsrmmd” in each of the first two options, meaning that we should expect local nsrmmd processes on the backup server itself to be dealt with faster, even if these settings are increased.

All these settings work well for regular-sized LAN-contained datazones. However, they may not be optimal in either of the following two scenarios:

  • Very busy datazones that have a large number of devices, even if they’re in the same LAN;
  • WAN-connected datazones.

In either of these scenarios, if you’re seeing periodic phases where NetWorker goes through restarting nsrmmd processes, particularly if this is happening during backups, then it’s a good idea to try to bump up these values to something more compatible with your environment.

My first recommendation, that works for most sites without any further tweaking, is to double each of the first two settings – i.e., increase nsrmmd polling interval to 6 minutes, increase nsrmmd restart interval to 4 minutes, and increase nsrmmd control timeout from 5 to 7 minutes. (I don’t think it’s usually necessary to double nsrmmd control timeout, because usually the delay in such timeouts are caused by devices, not the bandwidth of the connection, and therefore you don’t need to drastically increase the value.)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha