Jul 122012

As mentioned in my main NetWorker 8 introduction article, one of the biggest architectural enhancements in NetWorker v8 is a complete overhaul of the backup to disk architecture. This isn’t an update to AFTDs, it’s a complete restart – the architecture that was is no longer, and it’s been replaced by something newer, fresher, and more scalable.

In order to understand what’s involved in the changes, we first need to step back and consider how the AFTD architecture works in NetWorker v7.x. For the purposes of my examples, I’m going to consider a theoretical 10TB of disk backup capacity available to a NetWorker server. Now, under v7.x, you’d typically end up with filesystems and AFTDs that look like the following:

AFTD changes - v7.x AFTD

(In the above diagram, and those to follow, a red line/arrow indicates a write session coming into the AFTD, and a green line/arrow indicates a read session coming out of the AFTD.)

That is, you’d slice and dice that theoretical 10TB of disk capacity into a bunch of smaller sized filesystems, with typically one AFTD per filesystem. In 7.x, there are two nsrmmd processes per AFTD – one to handle write operations (the main/real path to the AFTD), and one to handle read operations from the AFTD device – the shadow or _AF_readonly path on the volume.

So AFTDs under this scenario delivered simultaneous backup and recovery operations by a bit of sleight-of-hand; in some ways, NetWorker was tricked into thinking it was dealing with two different volumes. In fact, that’s what lead to there being two instances of a saveset in the media database for any saveset created on AFTD – one for the read/write volume, and one of the read-only volume, with a clone ID of 1 less than the instance on the read/write volume. This didn’t double the storage; the “trick” was largely maintained in the media database, with just a few meta files maintained in the _AF_readonly path of the device.

Despite the improved backup options offered by v7.x AFTDs, there were several challenges introduced that somewhat limited the applicability of AFTDs in larger backup scenarios. Much as I’ve never particularly liked virtual tape libraries (seeing them as a solution to a problem that shouldn’t exist), I found myself typically recommending a VTL for disk backup in NetWorker ahead of AFTDs. The challenges in particular, as I saw them, were:

  • Limits on concurrency between staging, cloning and recovery from AFTDs meant that businesses often struggled to clone and reclaim space non-disruptively. Despite the inherent risks, this lead to many decisions not to clone data first, meaning only one copy was ever kept;
  • Because of those limits, disk backup units would need to be sliced into smaller allotments – such as the 5 x 2TB devices cited in the above diagram, so that space reclamation would be for smaller, more discrete chunks of data, but spread across more devices simultaneously;
  • A saveset could never exceed the amount of free space on an AFTD – NetWorker doesn’t support continuing a saveset from one full AFTD to another;
  • Larger savesets would be manually sliced up by the administrator to fit on an AFTD, introducing human error, or would be sent direct to tape, potentially introducing shoe-shining back into the backup configuration.

As a result of this, a lot of AFTD layouts saw them more being used as glorified staging units, rather than providing a significant amount of near-line recoverability options.

Another, more subtle problem from this architecture was that nsrmmd itself is not a process geared towards a high amount of concurrency; while based on its name (media multiplexor daemon) we know that it’s always been designed to deal with a certain number of concurrent streams for tape based multiplexing, there are limits to how many savesets an nsrmmd process can handle simultaneously before it starts to dip in efficiency. This was never so much an issue with physical tape – as most administrators would agree, using multiplexing above 4 for any physical tape will continue to work, but may result in backups which are much slower to recover from if a full filesystem/saveset recovery is required, rather than a small random chunk of data.

NetWorker administrators who have been using AFTD for a while will equally agree that pointing a large number of savesets at an AFTD doesn’t guarantee high performance – while disk doesn’t shoe-shine, the drop-off in nsrmmd efficiency per saveset would start to be quite noticeable at a saveset concurrency of around 8 if client and network performance were not the bottleneck in the environment.

This further encouraged slicing and dicing disk backup capacity – why allocate 10TB of capacity to a single AFTD if you’d get better concurrency out of 5 x 2TB AFTDs? To minimise the risk of any individual AFTD filling while others still had capacity, you’d configure the AFTDs to each have target sessions of 1 – effectively round-robbining the  starting of savesets across all the units.

I think that pretty much provides a good enough overview about AFTDs under v7.x that can talk about AFTDs in NetWorker v8.

Keeping that 10TB of disk backup capacity in play, under NetWorker v8, you’d optimally end up with the following configuration:

AFTD changes - v8.x AFTD

You’re seeing that right – multiple nsrmmd processes for a single AFTD – and I don’t mean a read/write nsrmmd and a shadow-volume read-only nsrmmd as per v7.x. In fact, the entire concept and implementation of the shadow volume goes away under NetWorker v8. It’s not needed any longer. Huzzah!

Assuming dynamic nsrmmd spawning (yes), and up to certain limits, NetWorker will now spawn one nsrmmd process for an AFTD each time it hits the target sessions setting for the AFTD volume. That raises one immediate change – for a v8.x AFTD configuration, bump up the target sessions for the devices. Otherwise you’ll end up spawning a lot more nsrmmd processes than are appropriate. Based on feedback from EMC, it would seem that the optimum target setting for a consolidate AFTD is 4. Assuming linear growth, this would mean that you’d have the following spawning rate:

  • 1 saveset, 1 x nsrmmd
  • 2 savesets, 1 x nsrmmd
  • 3 savesets, 1 x nsrmmd
  • 4 savesets, 1 x nsrmmd
  • 5 savesets, 2 x nsrmmd
  • 6 savesets, 2 x nsrmmd
  • 7 savesets, 2 x nsrmmd
  • 8 savesets, 2 x nsrmmd
  • 9 savesets, 3 x nsrmmd

Your actual rate may vary of course, depending on cloning, staging and recovery operations also being performed at the same time. Indeed, for the time being at least, NetWorker dedicates a single nsrmmd process to any nsrclone or nsrstage operation that is run (either manually or as a scheduled task), and yes, you can actually simultaneously recover, stage and clone all at the same time. NetWorker handles staging in that equation by blocking capacity reclamation while a process is reading from the AFTD – this prevents a staging operation removing a saveset that is needed for recovery or cloning. In this situation, a staging operation will report as per:

[root@tara usr]# nsrstage -b Default -v -m -S 4244365627
Obtaining media database information on server tara.pmdg.lab
80470:nsrstage: Following volumes are needed for cloning
80471:nsrstage:         AFTD-B-1 (Regular)
5874:nsrstage: Automatically copying save sets(s) to other volume(s)
Starting migration operation for Regular save sets...
6217:nsrstage: ...from storage node: tara.pmdg.lab
81542:nsrstage: Successfully cloned all requested Regular save sets (with new cloneid)
79629:nsrstage: Clones were written to the following volume(s) for Regular save sets:
6359:nsrstage: Deleting the successfully cloned save set 4244365627
Recovering space from volume 15867081 failed with the error 'volume 
mounted on ADV_FILE disk AFTD1 is reading'.
Refer to the NetWorker log for details.
89216:nsrd: volume mounted on ADV_FILE disk AFTD1 is reading

When a space reclamation is subsequently run (either as part of overnight reclamation, or an explicit execution of nsrim -X), the something along the following lines will get logged:

nsrd NSR info Index Notice: nsrim has finished crosschecking the media db 
nsrd NSR info Media Info: No space was recovered from device AFTD2 since there was no saveset eligible for deletion. 
nsrd NSR info Media Info: No space was recovered from device AFTD4 since there was no saveset eligible for deletion. 
nsrd NSR info Media Info: Deleted 87 MB from save set 4244365627 on volume AFTD-B-1 
nsrd NSR info Media Info: Recovered 87 MB by deleting 1 savesets from device AFTD1. 
nsrsnmd NSR warning volume (AFTD-B-1) size is now set to 2868 MB

Thus, you shouldn’t need to worry about concurrency any longer.

Another change introduced to the AFTD architecture is the device name is now divorced from the path. This is stored in a different field. For example:

AFTD path/device name separation

In the above example, the AFTD device has a device name of AFTD1, and the path to it is /d/backup1 – or to be more accurate, the first path to it is /d/backup1. I’ll get to the real import of that in a moment.

There is a simpler management benefit: previously if you setup an AFTD and needed to later change the path that it was mounted from, you had to do the following:

  1. Unmount the disk backup unit within NetWorker
  2. Delete the AFTD definition within NetWorker
  3. Create a new AFTD definition pointing to the new location
  4. At the OS, remount the AFTD filesystem at the new location
  5. Mount the AFTD volume from its new location in NetWorker

This was a tedious process, and the notion of “deleting” an AFTD, even though it had no bearing on the actual data stored on it, did not appeal to a lot of administrators.

However, the first path specified to an AFTD in the “Device Access Information” field refers to the mount point on the owner storage node, so under v8, all you’ve got to do in order to relocate the AFTD is:

  1. Unmount the disk backup unit within NetWorker.
  2. Remount the AFTD filesystem in its new location.
  3. Adjust the first entry in the “Device Access Information” field.
  4. Remount the AFTD volume in its new location.

This may not seem like a big change, but it’s both useful and more logical. Obviously another benefit of this is that you no longer have to remember device paths when performing manual nsrmm operations against AFTDs – you just specify the volume name. So your nsrmm commands would go from:

# nsrmm -u -f /d/backup1

to, in the above example:

# nsrmm -u -f AFTD1

The real benefit of this though is what I’ve been alluding to by talking about the first mount point specified. You can specify alternate mount points for the device. However, these aren’t additional volumes on the server – they’re mount points as seen by clients. This allows a bypass of using nsrmmd to perform a write to a disk backup unit from the client, and instead sees the client write via whatever operating system mount mechanism it’s used (CIFS or NFS).

In this configuration, your disk backup environment can start to look like the following:

AFTD changes - v8.x AFTD with NFS

You may be wondering what advantage this offers, given the backup is still written across the network – well, past an initial negotiation with nsrmmd on the storage node/server for the file to write to, the client then handles the rest of the write itself, not bothering nsrmmd again. In other words – so long as the performance of the underlying filesystem can scale, the number of streams you can write to an AFTD can expand beyond the maximum number of nsrmmd processes an AFTD or storage node can handle.

At this point, the “Device Access Information” is used as a multi-line field, with each subsequent line representing an alternate path the AFTD is visible at:

AFTD multimount

So, the backup process will work such that if the pool allows the specified AFTD to be used for backup, and the AFTD volume is visible to a client with “Client direct” setting enabled (a new option in the Client resource), on one of the paths specified, then the client will negotiate access to the device through a device-owner nsrmmd process, and go on to write the backup itself.

Note that this isn’t designed for sharing AFTDs between storage nodes – just between a server/storage node and clients.

Also, in case you’re skeptical, if your OS supports gathering statistics from the network-mount mechanism in use, you can fairly readily see whether NetWorker is honouring the direct access option. For instance, in the above configuration, I had a client called ‘nimrod’ mounting the disk backup unit via NFS from the backup server; before the backup started on a new, fresh mount, nfsstat showed no activity. After the backup, nfsstat yielded the following:

[root@nimrod ~]# nfsstat
Client rpc stats:
calls      retrans    authrefrsh
834        0          0
Client nfs v3:
null         getattr       setattr      lookup       access       readlink 
0         0% 7          0% 3         0% 4         0% 9         1% 0         0% 
read         write         create       mkdir        symlink      mknod 
0         0% 790       94% 1         0% 0         0% 0         0% 0         0% 
remove       rmdir         rename       link         readdir      readdirplus 
1         0% 0          0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo        pathconf     commit 
10        1% 2          0% 0         0% 6         0%

NetWorker also reports in the savegroup completion report that a client successfully negotiated direct file access (DFA) for a backup, too:

* nimrod:All savefs nimrod: succeeded.
* nimrod:/boot 86705:save: Successfully established DFA session with adv_file device for save-set ID '1694225141' (nimrod:/boot).
V nimrod: /boot                    level=5,          0 KB 00:00:01      0 files
* nimrod:/boot completed savetime=1341899894
* nimrod:/ 86705:save: Successfully established DFA session with adv_file device for save-set ID '1677447928' (nimrod:/).
V nimrod: /                        level=5,         25 MB 00:01:16    327 files
* nimrod:/ completed savetime=1341899897
* nimrod:index 86705:save: Successfully established DFA session with adv_file device for save-set ID '1660670793' (nox:index:65768fd6-00000004-4fdfcfef-4fdfd013-00041a00-3d2a4f4b).
 nox: index:nimrod                 level=5,        397 KB 00:00:00     25 files
* nimrod:index completed savetime=1341903689

Finally, logging is done in the server daemon.raw to also indicate that a DFA save has been negotiated:

91787 07/10/2012 05:00:05 PM  1 5 0 1626601168 4832 0 nox nsrmmd NSR notice Save-set ID '1694225141' 
(nimrod:/boot) is using direct file save with adv_file device 'AFTD1'.

The net result of all these architectural changes to AFTDs is a significant improvement over v7.x AFTD handling, performance and efficiency.

(As noted previously though, this doesn’t apply just to AFTDs, for what it’s worth – that client direct functionality also applies to Data Domain Boost devices, which allows a Data Domain system to integrate even more closely into a NetWorker environment. Scaling, scaling, scaling: it’s all about scaling.)

In order to get the full benefit, I believe sites currently using AFTDs will probably go through the most pain; those who have been using VTLs may have a cost involved in the transition, but they’ll be able to transition to an optimal architecture almost immediately. However, sites with multiple smaller AFTDs won’t see the full benefits of the new architecture until they redesign their backup to disk environment, increasing the size of backup volumes. That being said, the change pain will be worth it for the enhancements received.

Jul 112012


Those of us who have been using NetWorker for many years know it’s been a long time between major version numbers. That’s not to say that NetWorker has sat still; after an earlier period of slower development earlier in the 7.x tree, we saw in NetWorker 7.6 that EMC had well and truly re-committed to the product. At the time of its release, many of us wondered why NetWorker 7.6 hadn’t been numbered NetWorker 8. The number of changes from 7.5 to 7.6 were quite large, and the service packs for 7.6 also turned out to be big incremental improvements of some form or another.

Now though, NetWorker 8 has arrived, and one thing is abundantly clear: it most certainly is a big enough collection of improvements and changes to warrant a major version number increase. More importantly, and beyond the number of changes, it warrants the v8 moniker on the basis of the type of changes: this is not really NetWorker v7+1 under the hood, but a significantly newer beast.


I’ve gone back and forth on whether I should try to do a single blog post outlining NetWorker 8, or a series of posts. The solution? Both. This piece will be a brief overview; I’ll then do some deeper dives into specific areas of changes within the product.

If you’re interested in the EMC official press release for NetWorker 8, by the way, you’ll find it here. EMC have also released a video about the new version, available here.


The first, most important thing I’ll say about NetWorker 8 is that you must take time to read both the release notes and the upgrade guide before you go about updating your system to the new version. I cannot stress this enough: if you are not prepared to carefully, thoroughly read the release notes and the upgrade guide, do not upgrade at all.

(To find those documents, mosey over to PowerLink – I expect it will be updated within the next 12-24 hours with the official documents and downloads.)

An upgrade technique which I’ve always advocated, as of NetWorker 8, now becomes mandatory: you must upgrade all your storage nodes prior to upgrading the NetWorker server. I.e., after preliminary index and bootstrap backups, your upgrade sequence should look like the following:

  1. Stop NetWorker on server and storage nodes.
  2. Remove the nsr/tmp directory on the server and the storage nodes.
  3. Upgrade NetWorker on all storage nodes.
  4. Start NetWorker on storage nodes.
  5. Upgrade NetWorker on server.
  6. Start NetWorker on the server.
  7. Perform post-upgrade tests.
  8. Schedule client upgrades.

Any storage node not upgraded to NetWorker 8 will not be able to communicate with the NetWorker v8 server for device access. This has obvious consequences if your environment is using storage nodes locked by reasons of client compatibility, release compatibility or embedded functionality, to a 7.x release. E.g., customers using the EDL embedded storage node will not be upgrade to NetWorker 8 unless EMC comes through with an 8.0 EDL storage node update. (I genuinely do not know if this will happen, for what it’s worth.)

If you have a lot of storage nodes, and upgrading both the server and the storage nodes in the same day is not an option, then you can rest assured that a NetWorker 8 storage node is backwards compatible with a NetWorker 7.6.x server. That is, you can over however many days you need to, upgrade your storage nodes to v8, before upgrading your NetWorker server. I’d still suggest though that for best operations you should upgrade the storage node(s) and server all on the same day.

The upgrade guide also states that you can’t jump from a NetWorker 7.5.x (or lower, presumably) release to NetWorker 8. You must first upgrade the NetWorker server to 7.6.x, before subsequently upgrading to NetWorker 8. (In this case, the upgrade will at bare minimum involve upgrading the server to 7.6, starting it and allowing the daemons to fully stabilise, then shutting it down and continuing the upgrade process.)

The enhancements

Working through the release notes, I’m categorising the NetWorker updates into 5 core categories:

  1. Architectural – Deep functionality and/or design changes to how NetWorker operates. This will likely also result in some changes of the types listed below, but the core detail of the change should be considered architectural.
  2. Interface – Changes to how you can interact with NetWorker. Usually this means changes to NMC.
  3. Operational – Changes which primarily affect backup and recovery operations, or may represent a reduction in functionality between this and the previous version.
  4. Reporting – Changes to how NetWorker reports information, either in logs, the GUI, or via the command line.
  5. Security – Changes that directly affect or relate to system security.

For the most part, documentation updates have been limited to updates based on the changes in the software, so I’m disinclined to treat that as a separate category of updates in and of itself.

Architectural Changes

As you may imagine with a major version change, this area saw, in my opinion, the most, and biggest changes to NetWorker for quite some time.

When NetWorker 7.0 came out, the implementation of AFTD was a huge change compared to the old file-type devices. We finally had disk backup that behaved like disk, and many long-term NetWorker users, myself included, were excited about the potential it offered.

Yet, as the years went on, the limitations of the AFTD architecture caused just as much angst as the improvements had solved. While the read-write standard volume and read-only shadow volume implementation were useful, they were in actual fact just trickery to get around read-write concurrency. There were also several other issues, most notably the inability to have savesets move from one disk backup unit to another if an AFTD volume filled, and the perennial problem of not being able to simultaneously clone (or stage) from a volume whilst recovering from it.

Now, if you’re looking through the release notes and hoping to see reference to savesets being able to fill one disk backup unit and continue seamlessly onto another disk backup unit, you’re in for a shock: it’s not there. If like me, you’ve been waiting quite a long time for that feature to arrive, you’re probably as disappointed as I was when I participated in the Alpha for NetWorker 8. However, all is not lost! The reason it’s not there is that it’s no longer required at all.

Consider why we wanted savesets that fill one AFTD volume to fail over to another. Typically it was because:

  • Due to inability to have concurrent clone/read operations, data couldn’t be staged from them fast enough;
  • AFTDs did not play well with each other when more than one device shared the same underlying disk;
  • Backup performance was insufficient when writing a large number of savesets to a single very large disk backup unit.

In essence, if you had 10TB of disk space (formatted) to present to NetWorker for disk backup, a “best practice” configuration would have typically seen this presented as 5 x 2TB filesystems, and one AFTD created per filesystem. So, we were all clamouring for saveset continuation options because of the underlying design.

With version 8, the underlying design has been changed. It’s likely going to necessitate a phased transition of your disk backup capacity if you’re currently using AFTDs, but the benefits are very likely to outweigh that inconvenience. The new design allows for multiple nsrmmd processes to concurrently access (read and write) an AFTD. Revisiting that sample 10TB disk backup allocation, you’d go from 5 x 2TB filesystems to a single 10TB filesystem instead, while being able to simultaneously backup, clone, recover, stage, and do it more efficiently.

The changes here are big enough that I’ll post a deep dive article specifically on it within the next 24 hours. For now, I’ll move on.

Something I didn’t pickup during the alpha and beta testing relates to client read performance – previous versions of NetWorker always read chunks of data from the client at a fixed, 64KB size. (Note, this isn’t related to the device write block size.) You can imagine that a decade or so ago, this would have been a good middle ground in terms of efficiency. However, these days clients in an enterprise backup environment are just as likely to be powerhouse machines attached to high speed networking and high speed storage. So now, a NetWorker client will dynamically alter the block sized it uses to read data, based on an ongoing NetWorker measurement of its performance.

NetWorker interprocess communications have been improved where devices are concerned. For version 7.x and lower, the NetWorker server daemon (nsrd) has always been directly responsible with communicating with and controlling individual nsrmmd processes, either locally or on storage nodes. As the number of devices increase within an environment, this has seen a considerable load placed on the backup server – particularly during highly busy times. In simple terms, the communication has looked like this:

Pre v8 nsrd/nsrmmd communications

While no doubt simpler to manage, this has resulted in nsrd having to service a lot more interrupts, reducing its efficiency, and thereby reducing the efficiency of the overall backup environment. So, as of v8, a new daemon has been introduced – nsrsnmd, which is a storage node media multiplexor daemon manager. The communications path now sees nsrmmd processes on each storage node managed by a local nsrsnmd, which in turn receives overall instruction by NetWorker. The benefit of this is that nsrd now has far fewer interruptions, particularly in larger environments, to deal with:

nsrd/nsrsnmd/nsrmmd communications, v8

Undoubtedly, larger datazones will see benefit from this change.

The client direct feature has been expanded considerably in NetWorker 8. You can nominate a client to do client-direct backups, update the paths details for an AFTD to include the client pathname to the device, and the client can subsequently write directly to the AFTD itself. This, too, I’ll cover more in the deep-dive on AFTD changes. By the way, that isn’t just for AFTDs – if you’ve got a Data Domain Boost system available, the Client Direct attribute equally works there, too. Previously this had been pushed out to NetWorker storage nodes – now it extends to the end clients.

There’s some useful changes for storage nodes now – you can control at the storage node level whether the system is enabled or disabled: much easier than say, enabling or disabling all the devices attached to the storage node. Further, you can now specify, at the storage node level, what are the clone storage nodes for backups on that storage node. In short: configuration and maintenance relating to storage nodes has become less fiddly.

The previous consolidated backup level has been thrown away and rewritten as synthetic fulls. A bit like source based deduplication this can hide real issues – i.e., while it makes it faster to achieve a full backup, it doesn’t necessarily improve recovery performance (aside from reducing the number of backups required to achieve a recovery.) I’ll do a deep dive on synthetic fulls, too – though in a couple of weeks.

An architectural change I disagree with: while not many of us used ConnectEMC, it appears to be going away (though is still technically present in NetWorker), and is now replaced by a “Report Home” functionality. That, I don’t mind. What I do mind is that it is enabled by default. Yes, your server will automatically email a third party company (i.e., EMC) when a particular fault occurs in NetWorker under v8 and higher. Companies that have rigorous security requirements will undoubtedly be less than pleased over this. There is however, a way to disable it, though not as I’ve noticed as yet, from within NMC. From within nsradmin, you’ll find a resource type called “NSR report home”; change the email address for this from NetWorkerProfile@emc.com to an internal email address for your own company, if you have concerns. (E.g., Sunday morning, just past, at 8am: I just had an entire copy of my lab server’s configuration database emailed to me, having changed the email address. I would argue that this feature, being turned on by default, is a potential security issue.)

A long-term nice to have change is the introduction of a daemon environment file. Previously it was necessary to establish system-wide environment variables, which where easy to forget about – or, if on Unix, it was tempting to insert NetWorker specific environment variables into the NetWorker startup script (/etc/init.d/networker). However, if they were inserted there, they’d equally be lost when you upgraded NetWorker and forgot to save a copy of that file. That’s been changed now with a nsr/nsrrc file, which is read on daemon startup.

Home Base has been removed in NetWorker 8 – and why wouldn’t it? EMC has dropped the product. Personally I think this isn’t necessarily a great decision, and EMC will end up being marketed against for having made it. I think the smarter choice here would have been to either open source Home Base, or make it a free component of NetWorker and Avamar getting only minimal upgrades to remain compatible with new OS releases. While BMR is no longer as big a thing as it used to be before x86 virtualisation, Home Base supported more than just the x86 platform, and it seems like a short-sighted decision.

Something I’m sure will be popular in environments with a great many tape drives (physical or virtual) is that preference is now given in recovery situations towards loading volumes for recovery into read-only devices if they’re available. It’s always been a good idea when you have a strong presence of tape in an environment to keep at least one drive marked as read-only so it’s free for recoveries, but NetWorker would just as likely load a recovery volume into a read-write enabled volume. No big deal, unless it then disrupts backups. That should happen less often now.

There’s also a few minor changes too – the SQL Anywhere release included with NMC has been upgraded, as has Apache, and in what will be a boon to companies that have NetWorker running for very long periods of time without a shutdown: the daemon.raw log can now be rolled in real-time, rather than on daemon restart. Finally on the architectural front, Linux gets a boost in terms of support for persistent named paths to devices: there can now be more than 1024 of them. I don’t think this will have much impact for most companies, but clearly there’s at least one big customer of EMC who needed it…

Interface Changes

Work has commenced to allow NMC to have more granular control over backups. You can now right-click on a save session, regardless of whether it originates from a group that has been started from the command line or automatically. However, you still can’t (as yet) stop a manual save, nor can you stop a cloning session. More work needs to be done here – but it is a start. Given the discrepancies between what the release notes say you should be able to stop, and what NMC actually lets you stop, I expect we’ll see further changes on this in a patch or service pack for NetWorker 8.

Multiple resource editing comes into NMC. You can select a bunch of resources, right-click on the column that you wish to change to the same value for each of them, and make the change to all at once:

Multi resource edit 1 of 2

Multi resource edit 2 of 2

While this may not seem like a big change, I’m sure there’s a lot of administrators who work chiefly in NMC that’ll be happy to see the above.

While I’ve not personally used the NetWorker Module for Microsoft Applications (NMM) much, everyone who does use it regularly says that it doesn’t necessarily become easier to use with time. Hopefully now, with NMM configuration options added to the New Client Wizard in NMC, NMM will become easier to deal with.

Operational Changes

There’s been a variety of changes around bare metal recovery. The most noticeable, as I mentioned under architecture, was the removal of EMC Home Base. However, EMC have introduced custom recovery ISOs for the x86 and x86_64 Windows platforms (Windows 7, Windows 2008 R2 onwards). Equally though, Windows XP/2003 ASR based recovery is no longer supported for NetWorker clients running NetWorker v8. However, given the age of these platforms I’d suggest it was time to start phasing out functionality anyway. While we’re talking about reduction in functionality, be advised that NMM 2.3 will not work with NetWorker 8 – you need to make sure that you upgrade your NMM installs first (or concurrently) to v2.4, which has also become recently available.

While NetWorker has always recycled volumes automatically (despite ongoing misconceptions about the purpose of auto media management), it does it on an as-needs basis, during operations that require writable media. This may not always be desirable for some organisations, who would like to see media freshly labeled and waiting for NetWorker when it comes time to backup or clone. This can now be scheduled as part of the pool settings:

Volume recycling management

Of course, as with anything to do with NetWorker pools, you can’t adjust the bootstrap pools, so if you want to use this functionality you should be using your own custom pools. (Or to be more specific: you should be using your own custom pools, no matter what.) When recycling occurs, standard volume label deletion/labelling details are logged, and a log message is produced at the start of the process along the lines of:

tara.pmdg.lab nsrd RAP notice 4 volumes will be recycled for pool `Tape' in 
jukebox `VTL1'.

NDMP doesn’t get left out – so long as you’ve entered the NDMP password details into a client resource, you can now select savesets by browsing the NDMP system, which will be a handy feature for some administrators. If you’re using NetApp filers, you get another NDMP benefit: checkpoint restarts are now supported on that platform.

NetWorker now gets a maintenance mode – you can either tell NetWorker to stop accepting new backup sessions, new recovery sessions, or both. This is configured in the NetWorker server resource:

NetWorker Maintenance Mode

Even recently I’d suggested on the NetWorker mailing list that this wasn’t a hugely necessary feature, but plenty of other people disagreed, and they’ve convinced me of its relevance – particularly in larger NetWorker configurations.

For those doing NetWorker client side compression, you can all shout “Huzzah!”, for EMC have introduced new client side compression directives: gzip and bzip2, with allowance to specify the level, as well, of the compression. Never fear though – the old compressasm is still there, and the default. While using these new compression options can result in a longer backup at higher CPU utilisation, it can have a significant decrease in the amount of data to be sent across the wire, which will be of great benefit to anyone backing up through problematic firewalls or across WANs. I’ll have a follow-up article which outlines the options and results in the next day or so.

Reporting Changes

Probably the biggest potential reporting change in NetWorker 8 is the introduction of nsrscm_filter, an EMC provided savegroup filtering system. I’m yet to get my head around this – the documentation for it is minimal to the point where I wonder why it’s been included at all, and it looks like it’s been written by the developer for developers, rather than administrators. Sorry EMC, but for a tool to be fundamentally useful, it should be well documented. To be fair, I’m going through a period of high intensity work and not always of a mood to trawl through documentation when I’m done for the day, so maybe it’s not as difficult as it looks – but it shouldn’t look that difficult, either.

On the flip-side, checking (from the command line) what went on during a backup now is a lot simpler thanks to a new utility, nsrsgrpcomp. If you’ve ever spent time winnowing through /nsr/tmp/sg to dig out specific details for a savegroup, I think you’ll fall in love with this new utility:

[root@nox ~]# nsrsgrpcomp
 nsrsgrpcomp [ -s server ] groupname
 nsrsgrpcomp [ -s server ] -L [ groupname ]
 nsrsgrpcomp [ -s server ] [ -HNaior ] [ -b num_bytes | -l num_lines ] [ -c clientname ] [ -n jobname ] [ -t start_time ] groupname
 nsrsgrpcomp [ -s server ] -R jobid
[root@nox ~]# nsrsgrpcomp belle
belle, Probe, "succeeded:Probe"
* belle:Probe + PATH=/usr/gnu/bin:/usr/local/bin:/bin:/usr/bin:.:/bin:/sbin:/usr/sbin:/usr/bin
* belle:Probe + CHKDIR=/nsr/bckchk
* belle:Probe + README=
* belle:/Volumes/Secure Store 90543:save: The save operation will be slower because the directory 
entry cache has been discarded.
* belle:/Volumes/Secure Store belle: /Volumes/Secure Store level=incr, 16 MB 00:00:09 45 files
* belle:/Volumes/Secure Store completed savetime=1341446557
belle, index, "succeeded:9:index"
* belle:index 86705:save: Successfully established DFA session with adv_file device for 
save-set ID '3388266923' (nox:index:aeb3cb67-00000004-4fdfcfec-4fdfcfeb-00011a00-3d2a4f4b).
* belle:index nox: index:belle level=9, 29 MB 00:00:02 27 files
* belle:index completed savetime=1341446571

Similarly, within NMC, you can now view the notification for a group not only for the most recent group execution, but a previous one – very useful if you need to do a quick visual comparison between executions:

Checking a group status over multiple days

There’s a new notification for when a backup fails to start at the scheduled time – I haven’t played around with this yet, but it’s certainly something I know a lot of admins have been asking to see for quite a while.

I’ve not really played around with NLM license administration in NMC. I have to admit, I’m old-school here, and “grew up” using LLM from the command line with all its somewhat interesting command line options. So I can’t say what NMC used to offer, off-hand, but the release notes tell us that NMC now reports NLM license usage, which is certainly does.

Security Changes

Multi-tenancy has come to NetWorker v8. With it, you can have restricted datazones that give specified users access to a part of NetWorker, but not the overall setup. Unsurprisingly, it looks a lot like what I proposed in November 2009 in “Enhancing NetWorker Security: A theoretical architecture“. There’s only so many ways you could get this security model in place, and EMC obviously agreed with me on it. Except on the naming: I suggested a resource type called “NSR admin zone”; instead, it got named “NSR restricted data zone”. The EMC name may be more accurate, but mine was less verbose. (Unlike this blog post.)

I’d suggest while the aim is for multi-tenancy, at the moment the term operational isolation is probably a closer fit. There’s still a chunk of visibility across restricted datazones unless you lock down the user(s) in the restricted data zone to only being able to run an ad-hoc backup or request a recovery. A user in a restricted datazone, as soon as he/she has the ability to monitor the restricted datazone, for instance, can still run nsrwatch against the NetWorker server and see details about clients they should not have visibility on (e.g., “Client X has had NetWorker Y installed” style messages shouldn’t appear, when X is not in the restricted datazone.) I would say this will be resolved in time.

As such, my personal take on the current state of multi-tenancy within NetWorker is that it’s useful for the following situations:

  • Where a company has to provide administrative access to subsets of the configuration to specific groups or individuals – e.g., in order for DBAs to agree to having their backups centralised, they may want the ability to edit the configuration for their hosts;
  • Where a company intends to provide multi-tenancy backups and allow a customer to only be able to restore their data – no monitoring, no querying, just running the relevant recovery tool.

In the first scenario, it wouldn’t matter if an administrator of a restricted datazone can see details about hosts outside of his/her datazone; in the second instance, a user in a restricted datazone won’t get to see anything outside of their datazone, but it assumes the core administrators/operators of the NetWorker environment will handle all aspects of monitoring, maintaining and checking – and possibly even run more complicated recoveries for the customers themselves.

There’s been a variety of other security enhancements; there’s tighter integration on offer now between NetWorker and LDAP services, with the option to map LDAP user groups to NetWorker user groups. As part of the multi-tenancy design, there’s some additional user groups, too: Security Administrators, Application Administrators, and Database Administrators. The old “Administrators” user group goes away: for this reason you’ll need to make sure that your NMC server is also upgraded to NetWorker v8. On that front, nsrauth (i.e., strong authentication, not the legacy ‘oldauth’) authentication is now mandatory between the backup server and the NMC server when they’re on separate hosts.

Finally, if you’re using Linux in a secure environment, NetWorker now supports SELinux out of the box – while there’s been workarounds available for a while, they’ve not been officially sanctioned ones, so this makes an improvement in terms of support.

In Summary

No major release of software comes out without some quirks, so it would be disingenuous of me to suggest that NetWorker 8.0 doesn’t have any. For instance, I’m having highly variable success in getting NetWorker 8 to function correctly with OS X 10.7/Lion. On one Mac within my test environment, it’s worked flawlessly from the start. On another, after it installed, nsrexecd refused to start, but then after 7.6.3.x was reinstalled, then v8 was upgraded, it did successfully run for a few days. However, another reboot saw nsrexecd refusing to start, citing errors along the lines of “nsrexecd is already running under PID = <x>. Only one instance at a time can be run“, where the cited PID was a completely different process (ranging from Sophos to PostgreSQL to internal OS X processes). In other words, if you’re running a Mac OS X client, don’t upgrade without first making sure you’ve got the old client software also available: you may need it.

Yet, since every backup vendor ends up with some quirks in a dot-0 release (even those who cheat and give it a dot-5 release number), I don’t think you can hold this against NetWorker, and we have to consider the overall benefits brought to the table.

Overall, NetWorker 8 represents a huge overhaul from the NetWorker 7.x tree: significant AFTD changes, significant performance enhancements, the start of a fully integrated multi-tenancy design, better access to backup results, and changes to daemon architecture to allow the system to scale to even larger environments means that it will be eagerly accepted by many organisations. My gut feel is that we’ll have a faster adoption of NetWorker 8 than the standard bell-curve introduction – even in areas where features are a little unpolished, the overall enhancements brought to the table will make it a very attractive upgrade proposition.

Over the coming couple of weeks, I’ll publish some deep dives on specific areas of NetWorker, and put some links at the bottom of this article so they’re all centrally accessible.

Should you upgrade? Well, I can’t answer that question: only you can. So, get to the release notes and the upgrade notes now that I’ve whet your appetite, and you’ll be in a position to make an informed decision on it.

Deep Dives

The first deep dive is available: NetWorker 8 Advanced File Type Devices. (There’ll be a pause of a week or so before the next one, but I’ll have another, briefer article about some NetWorker 8 enhancements in a few days time.)

For an overview of the new synthetic full backup options in NetWorker 8, check out the article “NetWorker 8: Synthetic Fulls“.

Shorter Posts

A quick overview of the changes in client side compression.


Oct 282010

Sometimes it’s helpful to run NetWorker in debug mode – but sometimes, you just want to throw the nsrmmd processes into debug mode, and depending on your site, there may be a lot of them.

So, I finally got around to writing a “script” to throw all nsrmmd processes into debug mode. It hardly warrants being a script, but it may be helpful to others. Of course, this is Unix only – I’ll leave it as an exercise to the reader to generate the equivalent Windows script.

The entire script is as follows:



if [ "$PLATFORM" = "Linux" ]
	PROCLIST=`ps -C nsrmmd -o pid | grep -v PID`
elif [ "$PLATFORM" = "SunOS" ]
	PROCLIST=`ps -ea -o pid,comm | grep 'nsrmmd$' | awk '{print $1}'`


for pid in $PROCLIST
	echo dbgcommand -p $pid Debug=$DBG
	dbgcommand -p $pid Debug=$DBG

The above is applicable only to Solaris and Linux so far – I’ve not customised for say, HPUX or AIX simply because I don’t have either of those platforms hanging around in my lab. To invoke, you’d simply run:

# dbgnsrmmd.sh level

Where level is a number between 0 (for off) and 99 (for … “are you insane???”). Running it on one of my lab servers, it works as follows:

[root@nox bin]# dbgnsrmmd.sh 9
dbgcommand -p 4972 Debug=9
dbgcommand -p 4977 Debug=9
dbgcommand -p 4979 Debug=9
dbgcommand -p 4982 Debug=9
dbgcommand -p 4991 Debug=9
dbgcommand -p 4999 Debug=9
Note that when you invoke dbgcommand against a sub-daemon such as nsrmmd (as opposed to nsrd itself), you won’t get an alert in the daemon.{raw|log} file to indicate the debug level has changed.
Nov 252009

Everyone who has worked with ADV_FILE devices knows this situation: a disk backup unit fills, and the saveset(s) being written hang until you clear up space, because as we know savesets in progress can’t be moved from one device to another:

Savesets hung on full ADV_FILE device until space is cleared

Honestly, what makes me really angry (I’m talking Marvin the Martian really angry here) is that if a tape device fills and another tape of the same pool is currently mounted, NetWorker will continue to write the saveset on the next available device:

Saveset moving from one tape device to another

What’s more, if it fills and there’s a drive that currently does have a tape mounted, NetWorker will mount a new tape in that drive and continue the backup in preference to dismounting the full tape and reloading a volume in the current drive.

There’s an expression for the behavioural discrepancy here: That sucks.

If anyone wonders why I say VTLs shouldn’t need to exist, but I still go and recommend them and use them, that’s your number one reason.

Oct 162009

For a while now I’ve been working with EMC support on an issue that’s only likely to strike sites that have intermittent connectivity between the server and storage nodes and that stage from ADV_FILE on the storage node to ADV_FILE on the server.

The crux of the problem is that if you’re staging from storage node to server and comms between the sites are lost for long enough that NetWorker:

  • Detects the storage node nsrmmd processes have failed, and
  • Attempts to restart the storage node nsrmmd processes, and
  • Fails to restart the storage node nsrmmd processes

Then you can end up in a situation where the staging aborts in an ‘interesting’ way. The first hint of the problem is that you’ll see a message such as the following in your daemon.raw:

68975 10/15/2009 09:59:05 AM  2 0 0 526402000 4495 0 tara.pmdg.lab nsrmmd filesys_nuke_ssid: unable to unlink /backup/84/05/notes/c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342 on device `/backup’: No such file or directory

(The above was rendered for your convenience.)

However, if you look for the cited file, you’ll find that it doesn’t exist. That’s not quite the end of the matter though. Unfortunately, while the saveset file that was being staged didn’t stay on disk, its media database details did. So in order to restart staging, it becomes necessary to first locate the saveset in question and delete the media database entry for the (failed) server disk backup unit copy. Interestingly, this is only ever to be found on the RW device, not the RO device:

[root@tara ~]# mminfo -q "ssid=c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342"
 volume        client       date      size   level  name
Tara.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001.RO    fawn      10/15/2009 1287 MB manual  /usr/share

We had hoped that it was fixed in, but my tests aren’t showing that to be the case. Regardless, it’s certainly around in 7.4.x as well and (given the nature of it) has quite possibly been around for a while longer than that.

As I said at the outset, this isn’t likely to affect many sites, but it is something to be aware of.

Oct 122009

In many environments with storage nodes, a common requirement is to share backup devices between the server and/or storage nodes (regardless of whether the storage nodes are dedicated or full). The primary goal is to reduce the number of devices, or the number of tape libraries required in order to minimise cost while still maximising flexibility of the environment.

There are two mechanisms available for device sharing. These are:

  • Library sharing – free from any licensing, this is the cheapest but least flexible
  • Dynamic drive sharing – requiring additional licenses, this is more flexible but comes at a higher cost in terms of maintenance, debugging and complexity.

It’s easiest to gain an understanding of how these two options work with some diagrams.

First, let’s consider library sharing:

Conceptual diagram of library sharing

Conceptual diagram of library sharing

(Not shown: connection to tape robot head – i.e., control port connection.)

In this configuration, more than one host connects to specific devices in the tape library. These are hard or permanent connections; that is, once a device has been allocated to one server/storage node, it stays allocated to that host until the library is reconfigured.

This is a static allocation of resources that has the backup administrator allocate a specific number of devices per server/storage node based on the expected requirements of the environment. For instance, in the example above, the server has permanent mappings to 3 of the 6 tape drives in the library; the full storage node has permanent mappings to 2 of the drives, and the dedicated storage node has a permanent mapping to the one remaining drive in the tape library.

The key advantages of this allocation method are:

  • Zero licensing cost,
  • Guaranteed device availability,
  • Per-host/device isolation, preventing faults on one system from cascading to another.

The disadvantages of this allocation method are:

  • No dynamic reallocation of resources in the event of requirement spikes that were not anticipated
  • Can’t be reconfigured “on the fly”
  • If a backup device fails and a host only has access to one device, it won’t be able to backup or recover without configuration changes.

Where you would typically use this allocation method:

  • In VTLs – since NetWorker licenses VTLs by capacity*, you can allocate as many virtual drives as you want, providing each host with more than a certain amount of data in the datazone with one or more virtual drives, significantly reducing LAN impact of backup.
  • In PTLs where backup/recovery load is shared reasonably equally by two or more storage nodes (counting the server as a storage node in this context) and having only one library is desirable.

Moving on to dynamic drive sharing, this model resembles the following:

Conceptual overview of dynamic drive sharing

Conceptual overview of dynamic drive sharing

In this model, licenses are purchased, on a per-drive basis, for dynamic sharing. (So if you have 6 tape drives, such as in the above example, you would need up to 6 dynamic drive sharing licenses – you don’t have to share every device in the library – some could remain statically mapped if desired.)

When a library with dynamically shared drives is setup within NetWorker, the correct path to the device, on a per-host basis, will be established for that device within the configuration. This might mean that the device “/dev/nst0” on the backup server might be known in the configuration as being all three of the following:

  • /dev/nsto
  • rd=stnode:/dev/rmt/0cbn
  • rd=dedstnode:\.Tape0

When a host with dynamic access to drives needs a device (for either backup or recovery), the NetWorker server (or whichever host has control over the actual tape robot) will load a tape into a free, mappable drive, then notify the storage node which device it should use to access media. The storage node will then use the media until it no longer needs to, with the host that controls the robot handling any post-use unmounts or media changes.

The advantages of dynamic drive sharing are:

  • With maximal device sharing enabled, resource spikes can be handled by dynamically allocating a useful number of drives to host(s) that need them at any given time,
  • Fewer drives are typically required than would be in a library sharing/statically mapped allocation method.

The disadvantages of dynamic drive sharing are:

  • With multiple hosts able to see the same devices, isolating devices from SCSI resets and other SAN events from non-accessing hosts requires constant vigilance. HBAs and SAN settings must be configured appropriately, and these settings must be migrated/checked every time drivers change, systems are updated, etc.
  • It is relatively easy to misconfigure dynamic drive sharing by planning to use too few physical tape drives than are really necessary. (I.e., it’s not that cheap, from a hardware perspective either.)
  • Each drive that is dynamically shared requires a license.
  • Unlike products such as say, NetBackup, due to the way nsrmmd’s work and don’t share nicely with each other, volumes must be unmounted before devices are transferred from one host with dynamic drive sharing to another, even if both hosts will be using the same volume. (This falls into the “lame” feature category.)

Where you would typically use dynamic drive sharing includes areas such as:

  • A small, select number of hosts with significant volumes of data require LAN-free backups,
  • With small/isolated storage nodes that are still SAN connected (e.g., DMZ storage nodes).

A lot of the architectural reasons as to why dynamic drive sharing was originally developed has in some senses gone away with greater penetration of VTLs into the backup arena. Given that it’s a straight forward proposition to configure a large number of virtual tape drives, instead of messing about with dynamic drive sharing one can instead choose to just use library sharing in VTL environments to achieve the best of both worlds.

* Currently non-EMC VTLs, while still licensed by capacity, typically co-receive unlimited autochanger licenses. Even so, such licenses are not limited by the the number of virtual tape drives.

May 222009

OK, there’s not a lot about NetWorker that drives me nuts. I think I’ve done only one other “Quibbles” topic here so far, but I’ve reached the point on this one where I’d like to vent some exasperation.

There are times – not often, but they occasionally happen – where for some reason or another, a device will lock up and become unresponsive. When this reaches a point where the only way to recover is to either kill the controlling nsrmmd process or restarting NetWorker, things get tough.

The reason for this is that NetWorker does not, anywhere, provide a mapping between each nsrmmd, the device it controls and the process ID for that device.

Honestly, this is one of these basic administrative usability issues for which there is no excuse that it hasn’t been resolved and available for the last 5 years, if not the last 10 years. It comes down to either laziness or apathy – people have been asking for it long enough that with all the changes done to nsrmmd over the years, it should have been added a long time ago.

What do you think?

 Posted by at 7:41 pm  Tagged with:

Dealing with low-bandwidth connection to storage nodes

 Architecture, NetWorker  Comments Off on Dealing with low-bandwidth connection to storage nodes
May 112009

The fantastic thing about NetWorker is that being a three-tier architecture, a datazone may encompass far more than just a single site or datacentre. That is, you can design a system where the NetWorker server is in Sydney, and you have storage nodes in Melbourne, Adelaide, Perth, Brisbane, Darwin and Hobart. The server would be responsible for coordinating all backups and storing/retrieving data from the Sydney datacentre, and each storage node would be responsible for the storage/retrieval of backups local to that datazone.

(Or, to use a non-Australian example, you could have a datazone where your backup server is in London, and you have storage nodes in Paris, New York and Cape Town.)

When a NetWorker datazone encompasses only a single datacentre, there’s usually very little tweaking that needs to be done to the server <-> storage node communications, once they’re established. However, when we start talking about datazones that encompass WANs, we do have to take into account the level of latency between the storage nodes and the backup server.

Luckily, there’s settings within NetWorker to account for this. Specifically, there are three key settings, all maintained within the NetWorker server resource itself. These are:

  • nsrmmd polling interval
  • nsrmmd restart interval
  • nsrmmd control timeout

To view these settings in the NetWorker management console, you first have to turn on diagnostic mode. Then, right click the server (absolute topmost entry in the configuration tree) and choose “Properties”. These settings are maintained in the “Media” pane:

Controlling nsrmmd settings in NMC

Controlling nsrmmd settings in NMC

So, what do each of these settings do?

  • nsrmmd polling interval – This is the number of minutes that elapses between times that the NetWorker master process (nsrd) probes the nsrmmd to determine that it is still running. You could think of it as the heartbeat parameter. By default, this is 3 minutes.
  • nsrmmd restart interval – This is how long, in minutes, NetWorker will wait between restart attempts of an nsrmmd process. By default, this is 2 minutes.
  • nsrmmd control timeout – This is the number of minutes NetWorker waits for storage node requests to be completed. By default, this is 5 minutes.

Note that NetWorker is intelligent about this – the man pages for instance explicitly refers to “remote nsrmmd” in each of the first two options, meaning that we should expect local nsrmmd processes on the backup server itself to be dealt with faster, even if these settings are increased.

All these settings work well for regular-sized LAN-contained datazones. However, they may not be optimal in either of the following two scenarios:

  • Very busy datazones that have a large number of devices, even if they’re in the same LAN;
  • WAN-connected datazones.

In either of these scenarios, if you’re seeing periodic phases where NetWorker goes through restarting nsrmmd processes, particularly if this is happening during backups, then it’s a good idea to try to bump up these values to something more compatible with your environment.

My first recommendation, that works for most sites without any further tweaking, is to double each of the first two settings – i.e., increase nsrmmd polling interval to 6 minutes, increase nsrmmd restart interval to 4 minutes, and increase nsrmmd control timeout from 5 to 7 minutes. (I don’t think it’s usually necessary to double nsrmmd control timeout, because usually the delay in such timeouts are caused by devices, not the bandwidth of the connection, and therefore you don’t need to drastically increase the value.)