Jul 122012
 

As mentioned in my main NetWorker 8 introduction article, one of the biggest architectural enhancements in NetWorker v8 is a complete overhaul of the backup to disk architecture. This isn’t an update to AFTDs, it’s a complete restart – the architecture that was is no longer, and it’s been replaced by something newer, fresher, and more scalable.

In order to understand what’s involved in the changes, we first need to step back and consider how the AFTD architecture works in NetWorker v7.x. For the purposes of my examples, I’m going to consider a theoretical 10TB of disk backup capacity available to a NetWorker server. Now, under v7.x, you’d typically end up with filesystems and AFTDs that look like the following:

AFTD changes - v7.x AFTD

(In the above diagram, and those to follow, a red line/arrow indicates a write session coming into the AFTD, and a green line/arrow indicates a read session coming out of the AFTD.)

That is, you’d slice and dice that theoretical 10TB of disk capacity into a bunch of smaller sized filesystems, with typically one AFTD per filesystem. In 7.x, there are two nsrmmd processes per AFTD – one to handle write operations (the main/real path to the AFTD), and one to handle read operations from the AFTD device – the shadow or _AF_readonly path on the volume.

So AFTDs under this scenario delivered simultaneous backup and recovery operations by a bit of sleight-of-hand; in some ways, NetWorker was tricked into thinking it was dealing with two different volumes. In fact, that’s what lead to there being two instances of a saveset in the media database for any saveset created on AFTD – one for the read/write volume, and one of the read-only volume, with a clone ID of 1 less than the instance on the read/write volume. This didn’t double the storage; the “trick” was largely maintained in the media database, with just a few meta files maintained in the _AF_readonly path of the device.

Despite the improved backup options offered by v7.x AFTDs, there were several challenges introduced that somewhat limited the applicability of AFTDs in larger backup scenarios. Much as I’ve never particularly liked virtual tape libraries (seeing them as a solution to a problem that shouldn’t exist), I found myself typically recommending a VTL for disk backup in NetWorker ahead of AFTDs. The challenges in particular, as I saw them, were:

  • Limits on concurrency between staging, cloning and recovery from AFTDs meant that businesses often struggled to clone and reclaim space non-disruptively. Despite the inherent risks, this lead to many decisions not to clone data first, meaning only one copy was ever kept;
  • Because of those limits, disk backup units would need to be sliced into smaller allotments – such as the 5 x 2TB devices cited in the above diagram, so that space reclamation would be for smaller, more discrete chunks of data, but spread across more devices simultaneously;
  • A saveset could never exceed the amount of free space on an AFTD – NetWorker doesn’t support continuing a saveset from one full AFTD to another;
  • Larger savesets would be manually sliced up by the administrator to fit on an AFTD, introducing human error, or would be sent direct to tape, potentially introducing shoe-shining back into the backup configuration.

As a result of this, a lot of AFTD layouts saw them more being used as glorified staging units, rather than providing a significant amount of near-line recoverability options.

Another, more subtle problem from this architecture was that nsrmmd itself is not a process geared towards a high amount of concurrency; while based on its name (media multiplexor daemon) we know that it’s always been designed to deal with a certain number of concurrent streams for tape based multiplexing, there are limits to how many savesets an nsrmmd process can handle simultaneously before it starts to dip in efficiency. This was never so much an issue with physical tape – as most administrators would agree, using multiplexing above 4 for any physical tape will continue to work, but may result in backups which are much slower to recover from if a full filesystem/saveset recovery is required, rather than a small random chunk of data.

NetWorker administrators who have been using AFTD for a while will equally agree that pointing a large number of savesets at an AFTD doesn’t guarantee high performance – while disk doesn’t shoe-shine, the drop-off in nsrmmd efficiency per saveset would start to be quite noticeable at a saveset concurrency of around 8 if client and network performance were not the bottleneck in the environment.

This further encouraged slicing and dicing disk backup capacity – why allocate 10TB of capacity to a single AFTD if you’d get better concurrency out of 5 x 2TB AFTDs? To minimise the risk of any individual AFTD filling while others still had capacity, you’d configure the AFTDs to each have target sessions of 1 – effectively round-robbining the  starting of savesets across all the units.

I think that pretty much provides a good enough overview about AFTDs under v7.x that can talk about AFTDs in NetWorker v8.

Keeping that 10TB of disk backup capacity in play, under NetWorker v8, you’d optimally end up with the following configuration:

AFTD changes - v8.x AFTD

You’re seeing that right – multiple nsrmmd processes for a single AFTD – and I don’t mean a read/write nsrmmd and a shadow-volume read-only nsrmmd as per v7.x. In fact, the entire concept and implementation of the shadow volume goes away under NetWorker v8. It’s not needed any longer. Huzzah!

Assuming dynamic nsrmmd spawning (yes), and up to certain limits, NetWorker will now spawn one nsrmmd process for an AFTD each time it hits the target sessions setting for the AFTD volume. That raises one immediate change – for a v8.x AFTD configuration, bump up the target sessions for the devices. Otherwise you’ll end up spawning a lot more nsrmmd processes than are appropriate. Based on feedback from EMC, it would seem that the optimum target setting for a consolidate AFTD is 4. Assuming linear growth, this would mean that you’d have the following spawning rate:

  • 1 saveset, 1 x nsrmmd
  • 2 savesets, 1 x nsrmmd
  • 3 savesets, 1 x nsrmmd
  • 4 savesets, 1 x nsrmmd
  • 5 savesets, 2 x nsrmmd
  • 6 savesets, 2 x nsrmmd
  • 7 savesets, 2 x nsrmmd
  • 8 savesets, 2 x nsrmmd
  • 9 savesets, 3 x nsrmmd

Your actual rate may vary of course, depending on cloning, staging and recovery operations also being performed at the same time. Indeed, for the time being at least, NetWorker dedicates a single nsrmmd process to any nsrclone or nsrstage operation that is run (either manually or as a scheduled task), and yes, you can actually simultaneously recover, stage and clone all at the same time. NetWorker handles staging in that equation by blocking capacity reclamation while a process is reading from the AFTD – this prevents a staging operation removing a saveset that is needed for recovery or cloning. In this situation, a staging operation will report as per:

[root@tara usr]# nsrstage -b Default -v -m -S 4244365627
Obtaining media database information on server tara.pmdg.lab
80470:nsrstage: Following volumes are needed for cloning
80471:nsrstage:         AFTD-B-1 (Regular)
5874:nsrstage: Automatically copying save sets(s) to other volume(s)
79633:nsrstage: 
Starting migration operation for Regular save sets...
6217:nsrstage: ...from storage node: tara.pmdg.lab
81542:nsrstage: Successfully cloned all requested Regular save sets (with new cloneid)
        4244365627/1341909730
79629:nsrstage: Clones were written to the following volume(s) for Regular save sets:
        800941L4
6359:nsrstage: Deleting the successfully cloned save set 4244365627
Recovering space from volume 15867081 failed with the error 'volume 
mounted on ADV_FILE disk AFTD1 is reading'.
Refer to the NetWorker log for details.
89216:nsrd: volume mounted on ADV_FILE disk AFTD1 is reading

When a space reclamation is subsequently run (either as part of overnight reclamation, or an explicit execution of nsrim -X), the something along the following lines will get logged:

nsrd NSR info Index Notice: nsrim has finished crosschecking the media db 
nsrd NSR info Media Info: No space was recovered from device AFTD2 since there was no saveset eligible for deletion. 
nsrd NSR info Media Info: No space was recovered from device AFTD4 since there was no saveset eligible for deletion. 
nsrd NSR info Media Info: Deleted 87 MB from save set 4244365627 on volume AFTD-B-1 
nsrd NSR info Media Info: Recovered 87 MB by deleting 1 savesets from device AFTD1. 
nsrsnmd NSR warning volume (AFTD-B-1) size is now set to 2868 MB

Thus, you shouldn’t need to worry about concurrency any longer.

Another change introduced to the AFTD architecture is the device name is now divorced from the path. This is stored in a different field. For example:

AFTD path/device name separation

In the above example, the AFTD device has a device name of AFTD1, and the path to it is /d/backup1 – or to be more accurate, the first path to it is /d/backup1. I’ll get to the real import of that in a moment.

There is a simpler management benefit: previously if you setup an AFTD and needed to later change the path that it was mounted from, you had to do the following:

  1. Unmount the disk backup unit within NetWorker
  2. Delete the AFTD definition within NetWorker
  3. Create a new AFTD definition pointing to the new location
  4. At the OS, remount the AFTD filesystem at the new location
  5. Mount the AFTD volume from its new location in NetWorker

This was a tedious process, and the notion of “deleting” an AFTD, even though it had no bearing on the actual data stored on it, did not appeal to a lot of administrators.

However, the first path specified to an AFTD in the “Device Access Information” field refers to the mount point on the owner storage node, so under v8, all you’ve got to do in order to relocate the AFTD is:

  1. Unmount the disk backup unit within NetWorker.
  2. Remount the AFTD filesystem in its new location.
  3. Adjust the first entry in the “Device Access Information” field.
  4. Remount the AFTD volume in its new location.

This may not seem like a big change, but it’s both useful and more logical. Obviously another benefit of this is that you no longer have to remember device paths when performing manual nsrmm operations against AFTDs – you just specify the volume name. So your nsrmm commands would go from:

# nsrmm -u -f /d/backup1

to, in the above example:

# nsrmm -u -f AFTD1

The real benefit of this though is what I’ve been alluding to by talking about the first mount point specified. You can specify alternate mount points for the device. However, these aren’t additional volumes on the server – they’re mount points as seen by clients. This allows a bypass of using nsrmmd to perform a write to a disk backup unit from the client, and instead sees the client write via whatever operating system mount mechanism it’s used (CIFS or NFS).

In this configuration, your disk backup environment can start to look like the following:

AFTD changes - v8.x AFTD with NFS

You may be wondering what advantage this offers, given the backup is still written across the network – well, past an initial negotiation with nsrmmd on the storage node/server for the file to write to, the client then handles the rest of the write itself, not bothering nsrmmd again. In other words – so long as the performance of the underlying filesystem can scale, the number of streams you can write to an AFTD can expand beyond the maximum number of nsrmmd processes an AFTD or storage node can handle.

At this point, the “Device Access Information” is used as a multi-line field, with each subsequent line representing an alternate path the AFTD is visible at:

AFTD multimount

So, the backup process will work such that if the pool allows the specified AFTD to be used for backup, and the AFTD volume is visible to a client with “Client direct” setting enabled (a new option in the Client resource), on one of the paths specified, then the client will negotiate access to the device through a device-owner nsrmmd process, and go on to write the backup itself.

Note that this isn’t designed for sharing AFTDs between storage nodes – just between a server/storage node and clients.

Also, in case you’re skeptical, if your OS supports gathering statistics from the network-mount mechanism in use, you can fairly readily see whether NetWorker is honouring the direct access option. For instance, in the above configuration, I had a client called ‘nimrod’ mounting the disk backup unit via NFS from the backup server; before the backup started on a new, fresh mount, nfsstat showed no activity. After the backup, nfsstat yielded the following:

[root@nimrod ~]# nfsstat
Client rpc stats:
calls      retrans    authrefrsh
834        0          0
Client nfs v3:
null         getattr       setattr      lookup       access       readlink 
0         0% 7          0% 3         0% 4         0% 9         1% 0         0% 
read         write         create       mkdir        symlink      mknod 
0         0% 790       94% 1         0% 0         0% 0         0% 0         0% 
remove       rmdir         rename       link         readdir      readdirplus 
1         0% 0          0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo        pathconf     commit 
10        1% 2          0% 0         0% 6         0%

NetWorker also reports in the savegroup completion report that a client successfully negotiated direct file access (DFA) for a backup, too:

* nimrod:All savefs nimrod: succeeded.
* nimrod:/boot 86705:save: Successfully established DFA session with adv_file device for save-set ID '1694225141' (nimrod:/boot).
V nimrod: /boot                    level=5,          0 KB 00:00:01      0 files
* nimrod:/boot completed savetime=1341899894
* nimrod:/ 86705:save: Successfully established DFA session with adv_file device for save-set ID '1677447928' (nimrod:/).
V nimrod: /                        level=5,         25 MB 00:01:16    327 files
* nimrod:/ completed savetime=1341899897
* nimrod:index 86705:save: Successfully established DFA session with adv_file device for save-set ID '1660670793' (nox:index:65768fd6-00000004-4fdfcfef-4fdfd013-00041a00-3d2a4f4b).
 nox: index:nimrod                 level=5,        397 KB 00:00:00     25 files
* nimrod:index completed savetime=1341903689

Finally, logging is done in the server daemon.raw to also indicate that a DFA save has been negotiated:

91787 07/10/2012 05:00:05 PM  1 5 0 1626601168 4832 0 nox nsrmmd NSR notice Save-set ID '1694225141' 
(nimrod:/boot) is using direct file save with adv_file device 'AFTD1'.

The net result of all these architectural changes to AFTDs is a significant improvement over v7.x AFTD handling, performance and efficiency.

(As noted previously though, this doesn’t apply just to AFTDs, for what it’s worth – that client direct functionality also applies to Data Domain Boost devices, which allows a Data Domain system to integrate even more closely into a NetWorker environment. Scaling, scaling, scaling: it’s all about scaling.)

In order to get the full benefit, I believe sites currently using AFTDs will probably go through the most pain; those who have been using VTLs may have a cost involved in the transition, but they’ll be able to transition to an optimal architecture almost immediately. However, sites with multiple smaller AFTDs won’t see the full benefits of the new architecture until they redesign their backup to disk environment, increasing the size of backup volumes. That being said, the change pain will be worth it for the enhancements received.

7 New Years Backup Resolutions for Companies

 Backup theory  Comments Off on 7 New Years Backup Resolutions for Companies
Dec 272011
 

New years resolutions for backup

I’d like to suggest that companies be prepared to make (and keep!) 7 new years resolutions when it comes to the field of backup and recovery:

  1. We will test our backups: If you don’t have a testing regime in place, you don’t have a backup system at all.
  2. We will duplicate our backups: Your backup system should not be a single point of failure. If you’re not cloning, replicating or duplicating your backups in some form, your backup system could be the straw that breaks the camel’s back when a major issue occurs.
  3. We will document our backups: As for testing, if your backup environment is undocumented, it’s not a system. All you’ve got is a collection of backups, which, if the right people are around at the right time and in the right frame of mind, you could get a recovery from it. If you want a backup system in place, you not only have to test your backups, you also have to keep them well documented.
  4. We will train our administrators and operators: It never ceases to amaze me the number of companies that deploy enterprise backup software and then insist that administrators and operators just learn how to use it themselves. While the concept of backup is actually pretty simple (“hey, you, back it up or you’ll lose it!”), the practicality of it can be a little more complex, particularly given that as an environment grows in size, so does the scope and the complexity of a backup system. If you don’t have some form of training (whether it’s internal, by an existing employed expert, or external), you’re at the edge of the event horizon, peering over into the abyss.
  5. We will implement a zero error policy: Again, there’s no such thing as a backup system when there’s no zero error policy. No ifs, no buts, no maybes. If you don’t rigorously implement a zero error policy, you’re flipping a coin every time you do a recovery, regardless of what backup product you use. (To learn more about a zero error policy, check out the trial podcast I did where that was the topic.)
  6. We will appoint a Data Protection Advocate: There’s a lot of data “out there” within a company, not necessarily under central IT control. Someone needs to be thinking about it. That someone should be the Data Protection Advocate (DPA). This person should be tasked with being the somewhat annoying person who is present at every change control meeting, raising her or his hand and saying “But wait, how will this affect our ability to protect our data?” That person should also be someone who wanders around the office(s) looking under desks for those pesky departmental servers and “test” boxes that are deployed, the extra hard drives attached to research machines, etc. If you have multiple offices, you should have a DPA per office. (The role of the DPA is outlined in this post, “What don’t you backup?“)
  7. We will assemble an Information Protection Advisory Council (IPAC): Sitting at an equal tier to the change control board, and reporting directly to the CTO/CIO/CFO, the IPAC will liaise with the DPA(s) and the business to make sure that everyone is across the contingencies that are in place for data protection, and be the “go-to” point for the business when it comes to putting new functions in place. They should be the group that sees a request for a new system or service and collectively liaises with the business and IT to ensure that the information generated by that system/service is protected. (If you want to know more about an IPAC and its role in the business, check out “But where does the DPA fit in?“)

And there you have it – the new years resolutions for your company. You may be surprised – while there’ll be a little effort getting these in place, once they’re there, you’re going to find backup, recovery, and the entire information protection process a lot easier to manage, and a lot more reliable.

Basics – Cloning vs Staging

 Basics, NetWorker  Comments Off on Basics – Cloning vs Staging
Mar 212011
 

When people are just starting to get into NetWorker, a common situation is that they get confused about the difference between cloning and staging. (This isn’t helped given NetWorker can report in-progress staging operations as cloning – a perennial source of annoyment.)

So, what’s the difference?

  • A clone operation is where NetWorker duplicates a saveset. It makes a registered copy of the saveset, and at the conclusion of the operation is aware that it has an additional copy.
  • A stage operation is where NetWorker moves a saveset. It first makes a registered copy of the saveset, and then at the conclusion of the operation removes reference to the instance that it copied from.

Typically when we talk about staging, we talk about moving from a disk (media type ‘FILE’ or media type ‘ADV_FILE’) volume to tape. In such a situation where NetWorker stages from an actual FILE/ADV_FILE volume, it not only removes reference to the original saveset, but it actually removes it from the source volume as well. That’s to be expected – it’s a real disk filesystem that NetWorker is accessing, and removing a saveset is as simple as just running an operating system ‘delete’ command.

While it’s not often done, NetWorker does support staging from tape – but obviously when it’s done reading from the source volume, it can’t then selectively erase chunks of savesets from the source tape. Instead, all it does in that situation is delete, from the media database, the reference to the saveset having been on the source volume.

(In case you’re new to NetWorker and intend to now run off and try some staging – make sure, please, before you do, to read the second article ever posted on the NetWorker Blog – “Instantiating Savesets“. It has some very cautionary information about staging operations.)

One final thing – while I said that a clone or stage operation is done against a saveset, it’s not always as simple as that. If you tell NetWorker to clone or stage an individual saveset/saveset instance, that’s exactly what it will do. However, if you tell NetWorker to clone or stage a NetWorker tape volume, it will clone/stage the entire volume, with multiplexing left intact.

Regardless of those caveats though, remember the simple rule – a clone is a copy operation, and a stage is a move operation.

Mar 172011
 

In a previous blog post, I discussed how much I liked the scheduled cloning operations introduced in NetWorker 7.6 SP1. Since then, I’ve had several people comment on it saying that while they’re able to manually start scheduled cloning operations, they’re not able to stop scheduled cloning operations in NMC – regardless of whether they were manually or automatically started.

Now I thought I’d been able to manually stop a scheduled cloning operation via NMC during beta testing, but I may have confused myself with something else, and when I noticed the same issue, it led me to think – can I stop this some other way, maybe from the command line? (For what it’s worth, the inability to stop a scheduled clone from NMC is a known issue, and there’s an EMC request running for it.)

It turns out without NMC, the command line is how you stop a scheduled cloning operation. It actually turned out to be fairly simple in the end. To do so, you use jobquery and jobkill.

First, use jobquery to identify the scheduled clone job you want:

# jobquery
jobquery> show name:; job id:; job state:
jobquery> print type: clone job; job state: SESSION ACTIVE:
                      job id: 64002;
                   job state: SESSION ACTIVE;
                        name: clone.linux clones;

Once you’ve got that job ID, all you have to do is quit jobquery, and run:

# jobkill -j jobID

In this case – it would be:

# jobkill -j 64002
Terminating job 64002

That’s it – that’s how you stop a scheduled clone job.

Sep 252010
 

Introduction

I was lucky enough to get to work on the beta testing programme for NetWorker 7.6 SP1. While there are a bunch of new features in NW 7.6 SP1 (with the one most discussed by EMC being the Data Domain integration), I want to talk about three new features that I think are quite important, long term, in core functionality within the product for the average Joe.

These are:

  • Scheduled Cloning
  • AFTD enhancements
  • Checkpoint restarts

Each of these on their own represents significant benefit to the average NetWorker site, and I’d like to spend some time discussing the functionality they bring.

[Edit/Aside: You can find the documentation for NetWorker 7.6 SP1 available in the updated Documentation area on nsrd.info]

Scheduled Cloning

In some senses, cloning has been the bane of the NetWorker administrator’s life. Up until NW 7.6 SP1, NetWorker has had two forms of cloning:

  • Per-group cloning, immediately following completion of backups;
  • Scripted cloning.

A lot of sites use scripted cloning, simply due to the device/media contention frequently caused in per-group cloning. I know this well; since starting working with NetWorker in 1996, I’ve written countless numbers of NetWorker cloning scripts, and currently am the primary programmer for IDATA Tools, which includes what I can only describe as a cloning utility on steroids (‘sslocate’).

Will those scripts and utilities go away with scheduled cloning? Well, I don’t think they’re always going to go away – but I do think that they’ll be treated more as utilities rather than core code for the average site, since scheduled cloning will be able to achieve much of the cloning requirements for companies.

I had heard that scheduled cloning was on the way long before the 7.6 SP1 beta, thanks mainly to one day getting a cryptic email along the lines of “if we were to do scheduled cloning, what would you like to see in it…” – so it was pleasing, when it arrived, to see that much of my wish list had made it in there. As a first-round implementation of the process, it’s fabulous.

So, let’s look at how we configure scheduled clones. First off, in NMC, you’ll notice a new menu item in the configuration section:

Scheduled Cloning Resource, in menu

This will undoubtedly bring joy to the heart of many a NetWorker administrator. If we then choose to create a new scheduled clone resource, we can create a highly refined schedule:

Scheduled Clone Resource, 1 of 2

Let’s look at those options first before moving onto the second tab:

  • Name and comment is pretty self explanatory – nothing to see there.
  • Browse and retention – define, for the clone schedule, both the browse and retention time of the savesets that will be cloned.
  • Start time – Specify exactly what time the cloning is to start.
  • Schedule period – Weekly allows you to specify which days of the week the cloning is to run. Monthly allows you to specify which dates of the month the cloning will run.
  • Storage node – Allows you to specify to which storage node the clone will write to. Great for situations where you have say, an off-site storage node and you want the data streamed directly across to it.
  • Clone pool – Which pool you want to write the clones to – fairly expected.
  • Continue on save set error – This is a big help. Standard scripting of cloning will fail if one of the savesets selected to clone has an error (regardless of whether that’s a read error, or it disappears (e.g., is staged off) before it gets cloned, etc.) and you haven’t used the ‘-F’ option. Click this check box and the cloning will at least continue and finish all savesets it can hit in one session.
  • Limit number of save set clones – By default this is 1, meaning NetWorker won’t create more than one copy of the saveset in the target pool. This can be increased to a higher number if you want multiple clones, or it can be set to zero (for unlimited), which I wouldn’t see many sites having a need for.

Once you’ve got the basics of how often and when the scheduled clone runs, etc., you can move on to selecting what you want cloned:

Scheduled Clone Resource, 2 of 2

I’ve only just refreshed my lab server, so you can see that a bit of imagination is required in the above screen shot to flesh out what this may look in a normal site. But, you can choose savesets to clone based on:

  • Group
  • Client
  • Source Pool
  • Level
  • Name

or

  • Specific saveset ID/clone IDs

When specifying savesets based on group/client/level/etc., you can also specify how far back NetWorker is to look for savesets to clone. This avoids a situation whereby you might say, enable scheduled cloning and suddenly have media from 3 years ago requested.

You might wonder about the practicality of being able to schedule a clone for specific SSID/CloneID combos. I can imagine this would be particularly useful if you need to do ad-hoc cloning of a particularly large saveset. E.g., if you’ve got a saveset that’s say, 10TB, you might want to configure a schedule that would start specifically cloning this at 1am Saturday morning, with your intent being to then delete the scheduled clone after it’s finished. In other words, it’s to replace having to do a scheduled cron or at job just for a single clone.

Once configured, and enabled, scheduled cloning runs like a dream. In fact, it was one of the first things I tackled while doing beta testing, and almost every subsequent day found myself thinking at 3pm “why is my lab server suddenly cloning? – ah yes, that’s why…”

AFTD Enhancements

There’s not a huge amount to cover in terms of AFTD enhancements – they’re effectively exactly the same enhancements that have been run into NetWorker 7.5 SP3, which I’ve previously covered here. So, that means there’s a better volume selection criteria for AFTD backups, but we don’t yet have backups being able to continue from one AFTD device to another. (That’s still in the pipeline and being actively worked on, so it will come.)

Even this one key change – the way in which volumes are picked in AFTDs for new backups – will be a big boon for a lot of NetWorker sites. It will allow administrators to not focus so much on the “AFTD data shuffle”, as I like to consider it, and instead focus on higher level administration of the backup environment.

(These changes are effectively “under the hood”, so there’s not much I can show in the way of screen-shots.)

Checkpoint Restarting

When I first learned NetBackup, I immediately saw the usefulness of checkpoint restarting, and have been eager to see it appear in NetWorker since that point. I’m glad to say it’s appeared in (what I consider to be) a much more usable form. So what is checkpoint restarting? If you’re not familiar with the term, it’s where the backup product has regular points at which it can restart from, rather than having to restart an entire backup. Previously NetWorker has only done this at the saveset level, but that’s not really what the average administrator would think of when ‘checkpoints’ are discussed. NetBackup, last I looked at it, does this at periodic intervals – e.g., every 15 minutes or so.

Like almost everything in NetWorker, we get more than one way to run a checkpoint:

Checkpoint restart options

Within any single client instance you can choose to enable checkpoint restarting, with the restart options being:

  • Directory – If a backup failure occurs, restart from the last directory that NetWorker had started processing.
  • File – If a backup failure occurs, restart from the last file NetWorker had started processing.

Now, the documentation warns that with checkpoint enabled, you’ll get a slight performance hit on the backup process. However, that performance hit is nothing compared to the performance and potentially media hit you’d take if you’re 9.8TB through a 10TB saveset and the client is accidentally rebooted!

Furthermore, in my testing (which admittedly focused on savesets smaller than 10GB), I inevitably found that with either file or directory level checkpointing enabled, the backup actually ran faster than the normal backup. So maybe it’s also based on the hardware you’re running on, or maybe that performance hit doesn’t come in until you’re backing up millions of files, but either way, I’m not prepared to say it’s going to be a huge performance hit for anyone yet.

Note what I said earlier though – this can be configured on a single client instance. This lets you configure checkpoint restarting even on the individual client level to suit the data structure. For example, let’s consider a fileserver that offers both departmental/home directories, and research areas:

  • The departmental/home directories will have thousands and thousands of directories – have a client instance for this area, set for directory level checkpointing.
  • The research area might feature files that are hundreds of gigabytes each – use file level checkpointing here.

When I’d previously done a blog entry wishing for checkpoint restarts (“Sub saveset checkpointing would be good“), I’d envisaged the checkpointing being done via the continuation savesets – e.g., “C:”, “<1>C:”, “<2>C:”, etc. It hasn’t been implemented this way; instead, each time the saveset is restarted, a new saveset is generated of the same level, catering to whatever got backed up during that saveset. On reflection, I’m not the slightest bit hung up over how it’s been implemented, I’m just overjoyed to see that it has been implemented.

Now you’re probably wondering – does the checkpointing work? Does it create any headaches for recoveries? Yes, and no. As in, yes it works, and no, throughout all my testing, I wasn’t able to create any headaches in the recovery process. I would feel very safe about having checkpoint restarts enabled on production filesystems right now.

Bonus: Mac OS X 10.6 (“Snow Leopard”) Support

For some time, NetWorker has had some issues supporting Mac OS X 10.6, and it’s certainly caused some problems for various customers as this operating system continues to get market share in the enterprise. I was particularly pleased during a beta refresh to see an updated Mac OS X client for NetWorker. This works excellently for backup, recovery, installation, uninstallation, etc. I’d suggest on the basis of the testing I did that any site with OS X should immediately upgrade those clients at least to 7.6 SP1.

In Summary

The only major glaring question for me, looking at NetWorker 7.6 SP1 is the obvious: this has so many updates, and so many new features, way more than we’d see in a normal service pack – why the heck isn’t it NetWorker v8?

Mar 032010
 

While I touched on this in the second blog posting I made (Instantiating Savesets), it’s worthwhile revisiting this topic more directly.

Using ADV_FILE devices can play havoc with conventional tape rotation strategies; if you aren’t aware of these implications, it could cause operational challenges when it comes time to do recovery from tape. Let’s look at the lifecycle of a saveset in a disk backup environment where a conventional setup is used. It typically runs like this:

  1. Backup to disk
  2. Clone to tape
  3. (Later) Stage to tape
  4. (At rest) 2 copies on tape

    Looking at each stage of this, we have:

    Saveset on ADV_FILE deviceThe saveset, once written to an ADV_FILE volume, has two instances. The instance recorded as being on the read-read only part of the volume will have an SSID/CloneID of X/Y. The instance recorded as being on the read-write part of the volume will have an SSID/CloneID of X/Y+1. This higher CloneID is what causes NetWorker, upon a recovery request, to seek the “instance” on the read-only volume. Of course, there’s only one actual instance (hence why I object so strongly to the ‘validcopies’ field introduced in 7.6 reporting 2) – the two instances reported are “smoke and mirrors” to allow simultaneous backup to and recovery from an ADV_FILE volume.

    The next stage sees the saveset cloned:

    ADV_FILE + Tape CloneThis leaves us with 3 ‘instances’ – 2 physical, one virtual. Our SSID/CloneIDs are:

    • ADV_FILE read-only: X/Y
    • ADV_FILE read-write: X/Y+1
    • Tape: X/Y+n, where n > 1.

    At this point, any recovery request will still call for the instance on the read-only part of the ADV_FILE volume, so as to help ensure the fastest recovery initiation.

    At some future point, as disk capacity starts to run out on the ADV_FILE device, the saveset will typically be staged out:

    ADV_FILE staging to tapeAt the conclusion of the staging operation, the physical + virtual instances of the saveset on the ADV_FILE device are removed, leaving us with:

    Savesets on tape only

    So, at this point, we end up with:

    • A saveset instance on a clone volume with SSID/CloneID of: X/Y+n.
    • A saveset instance on (typically) a non-clone volume with SSID/CloneID of: X/Y+n+m, where m > 0.

    So, where does this leave us? (Or if you’re not sure where I’ve been heading yet, you may be wondering what point I’m actually trying to make.)

    Note what I’ve been saying each time – NetWorker, when it needs to read from a saveset for recovery purposes, will want to pick the saveset instance with the lowest CloneID. At the point where we’ve got a clone copy and a staged copy, both on tape, the clone copy will have the lowest CloneID.

    The net result is that NetWorker will, in these circumstances, when both tapes aren’t online, request the clone volume for recovery – even though in an extreme number of cases, this will be the volume that’s offsite.

    For NetWorker versions 7.3.1 and lower, there was only one solution to this – you had to hunt down the actual clone saveset instances NetWorker was asking for, mark them as suspect, and reattempt the recovery. If you managed to mark them all as suspect, then you’d be able to ‘force’ NetWorker into facilitating the recovery from the volume(s) that had been staged to. However, after the recovery you had to make sure you backed out of those changes, so that both the clones and the staged copies would be considered not-suspect.

    Some companies, in this situation, would instigate a tape rotation policy such that clone volumes would be brought back from off-site before savesets were likely to be staged out, with subsequently staged media sent offsite. This has a dangerous side-effect of temporarily leaving all copies of backups on-site, jeapordising disaster recovery situations, and hence it’s something that I couldn’t in any way recommend.

    The solution introduced around 7.3.2 however is far simpler – a mminfo flag called offsite. This isn’t to be confused with the convention of setting a volume location field to ‘offsite’ when the media is removed from site. Annoyingly, this remains unqueryable; you can set it, and NetWorker will use it, but you can’t say, search for volumes with the ‘offsite’ flag set.

    The offsite flag has to be manually set, using the command:

    # nsrmm -o offsite volumeName

    (where volumeName typically equals the barcode).

    Once this is set, then NetWorker’s standard saveset (and therefore volume) selection criteria is subtly adjusted. Normally if there are no online instances of a saveset, NetWorker will request the saveset with the lowest CloneID. However, saveset instances that are on volumes with the offsite flag set will be deemed ineligible and NetWorker will look for a saveset instance that isn’t flagged as being offsite.

    The net result is that when following a traditional backup model with ADV_FILE disk backup (backup to disk, clone to tape, stage to tape), it’s very important that tape offsiting procedures be adjusted to set the offsite flag on clone volumes as they’re removed from the system.

    The good news is that you don’t normally have to do anything when it’s time to pull the tape back onsite. The flag is automatically cleared* for a volume as soon as it’s put back into an autochanger and detected by NetWorker. So when the media is recycled, the flag will be cleared.

    If you come from a long-term NetWorker site and the convention is still to mark savesets as suspect in this sort of recovery scenario, I’d suggest that you update your tape rotation policies to instead use the offsite flag. If on the other hand, you’re about to implement an ADV_FILE based backup to disk policy, I’d strongly recommend you plan in advance to configure a tape rotation policy that uses the offsite flag as cloned media is sent away from the primary site.


    * If you did need to explicitly clear the flag, you can run:

    # nsrmm -o notoffsite volumeName

    Which would turn the flag back off for the given volumeName.

    Feb 222010
     

    The scenario:

    • A clone or stage operation has aborted (or otherwise failed)
    • It has been restarted
    • It hangs waiting for a new volume even though there’s a partially written volume available.

    This is a relatively easy problem to explain. Let’s first look at the log messages that happens. To generate this error, I started cloning some data to the “Default Clone” pool, with only one volume in the pool, then aborted. Shortly thereafter I tried to run the clone again, and when NetWorker wouldn’t write to the volume I unmounted and remounted it – a common thing that newer administrators will try in this scenario. This is where you’ll hit the following error in the logs:

    media notice: Volume `800829L4' ineligible for this operation; Need a different volume
    from pool `Default Clone'
    media info: Suggest manually labeling a new writable volume for pool 'Default Clone'

    So, what’s the cause of this problem? It’s actually relatively easy to explain.

    A core component in NetWorker’s media database design is that a saveset can only ever have one instance on a piece of media. This applies as equally to failed as complete saveset instances.

    The net result is that this error/situation will occur because it’s meant to – NetWorker doesn’t permit more than one instance of a saveset to appear on the same piece of physical media.

    So what do you do when this error comes up?

    • If you’re backing up to disk, an aborted saveset should normally be cleared up automatically by NetWorker after the operation is aborted. However, in certain instances this may not be the case. For NetWorker 7.5 vanilla and 7.5.1.1/7.5.1.2, this should be done by expiring the saveset instance – using nsrmm to flag the instance as having an expiry date within a few minutes or seconds. For all other versions of NetWorker, you should just be able to delete the saveset instance.
    • When working with tape (virtual or physical), the most recommended approach would be to move on to another tape, or if the instance is the only instance on that tape, relabel the tape. (Some would argue that you can use nsrmm to delete the saveset instance from the tape and then re-attempt the operation, but since NetWorker is so heavily designed to prevent multiple instances of a saveset on a piece of media, I’d strongly recommend against this.)

    Overall it’s a fairly simple issue, but knowing how to recognise it lets you resolve it quickly and painlessly.

    Oct 272009
     

    NetWorker has an irritating quirk where it doesn’t allow you to clone or stage incomplete savesets. I can understand the rationale behind it – it’s not completely usable data, but that rationale is wrong.

    If you don’t think this is the case, all you have to do to test is start a backup, cancel it mid-way through a saveset, then attempt to clone that saveset. Here’s an example:

    [root@tara ~]# save -b Big -q -LL /usr
    Oct 25 13:07:15 tara logger: NetWorker media: (waiting) Waiting for 1
    writable volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
    <backup running, CTRL-C pressed>
    (interrupted), exiting
    [root@tara ~]# mminfo -q "volume=BIG995S3"
     volume        client       date      size   level  name
    BIG995S3       tara.pmdg.lab 10/25/2009 175 MB manual /usr
    [root@tara ~]# mminfo -q "volume=BIG995S3" -avot
     volume        client           date     time         size ssid      fl   lvl name
    BIG995S3       tara.pmdg.lab 10/25/2009 01:07:15 PM 175 MB 14922466  ca manual /usr
    [root@tara ~]# nsrclone -b Default -S 14922466
    5876:nsrclone: skipping aborted save set 14922466
    5813:nsrclone: no complete save sets to clone

    Now, you may be wondering why I’m hung up on not being able to clone or stage this sort of data. The answer is simple: sometimes the only backup you have is a broken backup. You shouldn’t be punished for this!

    Overall, NetWorker has a fairly glowing pedigree in terms of enforced data viability:

    • It doesn’t recycle savesets until all dependent savesets are also recyclable;
    • It’s damn aggressive at making sure you have current backups of the backup server’s bootstrap information;
    • If there’s any index issue it’ll end up forcing a full backup for savesets even if it’s backed them up before;
    • It won’t overwrite data on recovery unless you explicitly tell it to;
    • It lets you recover from incomplete savesets via scanner/uasm!

    and so on.

    So, logically, there makes little sense in refusing to clone/stage incomplete savesets.

    There may be programmatic reasons why NetWorker doesn’t permit cloning/staging incomplete savesets, but these aren’t sufficient reasons. NetWorker’s pedigree of extreme focus on recoverability remains tarnished by this inability.

    You can’t escape cloning with cross-site backups

     Backup theory  Comments Off on You can’t escape cloning with cross-site backups
    Sep 092009
     

    Periodically someone will tell me that they don’t need to clone because they run cross-site backups. I.e., they’ll have an architecture such as the following:

    Sample cross site backup configuration

    Sample cross site backup configuration

    Looking at a tape-only environment for simplicity, this configuration sees the backup media immediately off-sited by virtue of the tape library being physically located in another site.

    The fallacious assumption made by some companies is that by running off-site backups, they don’t need to clone – their backups are off-site as soon as they’re generated, after all. This is incorrect, and can be readily shown when we evaluate a “site destroyed” scenario.

    If the disaster recovery site is destroyed:

    • All historical backup information has been lost. We have only the “current” data stored on site.

    If the production site is destroyed:

    • We are left with only one copy of our data, and can easily encounter a catastrophic single point of failure within the backup environment.

    Therefore, regardless of whether you run cross-site backups, if you want full data protection, you still need to clone, so that:

    • If the production site is destroyed, you don’t have to rely on single-copies of backups.
    • If the disaster recovery site is destroyed, you still have access to historical backups.

    Please, don’t make the mistake of thinking that cross-site backups are sufficient justification to avoid cloning.