For a while now I’ve been working with EMC support on an issue that’s only likely to strike sites that have intermittent connectivity between the server and storage nodes and that stage from ADV_FILE on the storage node to ADV_FILE on the server.
The crux of the problem is that if you’re staging from storage node to server and comms between the sites are lost for long enough that NetWorker:
- Detects the storage node nsrmmd processes have failed, and
- Attempts to restart the storage node nsrmmd processes, and
- Fails to restart the storage node nsrmmd processes
Then you can end up in a situation where the staging aborts in an ‘interesting’ way. The first hint of the problem is that you’ll see a message such as the following in your daemon.raw:
68975 10/15/2009 09:59:05 AM 2 0 0 526402000 4495 0 tara.pmdg.lab nsrmmd filesys_nuke_ssid: unable to unlink /backup/84/05/notes/c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342 on device `/backup’: No such file or directory
(The above was rendered for your convenience.)
However, if you look for the cited file, you’ll find that it doesn’t exist. That’s not quite the end of the matter though. Unfortunately, while the saveset file that was being staged didn’t stay on disk, its media database details did. So in order to restart staging, it becomes necessary to first locate the saveset in question and delete the media database entry for the (failed) server disk backup unit copy. Interestingly, this is only ever to be found on the RW device, not the RO device:
[root@tara ~]# mminfo -q "ssid=c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342" volume client date size level name Tara.001 fawn 10/15/2009 1287 MB manual /usr/share Fawn.001 fawn 10/15/2009 1287 MB manual /usr/share Fawn.001.RO fawn 10/15/2009 1287 MB manual /usr/share
We had hoped that it was fixed in 7.5.1.5, but my tests aren’t showing that to be the case. Regardless, it’s certainly around in 7.4.x as well and (given the nature of it) has quite possibly been around for a while longer than that.
As I said at the outset, this isn’t likely to affect many sites, but it is something to be aware of.
I’ve seen a few of these on 7.5.1 (with 7.6 [?] nsrd, nsrclone, ansrd, nsrmmd and nsrmmdbd binaries), but not many so far. However, we’re staging from disk to tape, all on the same host. Querying the ssids with mminfo finds nothing. So far it has not caused staging to hang, AFAICT. From what you’ve said, it seems there’s apparently little need to report it, but hopefully it won’t become a significant issue before it’s fixed.
Interesting…
We have networker 7.6.0 build 148 installed on 32bit windows 2003 SP2 server. This is the networker server.
Client is 7.5 SP3., This is the virtual client running windows 2008 R2 64bit on ESXi server.
This backup server is attached to Clariion CX4 and we do a backup to disk.
A 2T LUN is presented to the server and can be seen as internal disk by the server
I’ve used diskpart to set the alignment
DISKPART> create partition primary align=64
DiskPart succeeded in creating the specified partition.
DISKPART> list partition
Partition ### Type Size Offset
————- —————- ——- ——-
* Partition 1 Primary 1183 GB 64 KB
then I assign a drive letter to the disk
Then format the disk to NTFS > default unit allocation
Go to networker create a device for AFTD > lable and mount the device > verify the path to AFTD
Now when I try to backup the save set I get
0 12-August-2010 16:18:15 2 0 0 7932 6964 0 bkpsrv nsrmmd 08/12/10 16:18:15 nsrmmd #1: There was an error creating the file: “K:AFTD_windows_clients30359c87289d-00000006-f9639226-4c639226-00081f00-ae9d5429” errno: 2
The disk partition formatted is GTP. The backup set is about 1.5T. the backup runs fine till about 367G and then restarts.
EMC can’t help….as they can’t figure out what to do.