When people are just starting to get into NetWorker, a common situation is that they get confused about the difference between cloning and staging. (This isn’t helped given NetWorker can report in-progress staging operations as cloning – a perennial source of annoyment.)

So, what’s the difference?

  • A clone operation is where NetWorker duplicates a saveset. It makes a registered copy of the saveset, and at the conclusion of the operation is aware that it has an additional copy.
  • A stage operation is where NetWorker moves a saveset. It first makes a registered copy of the saveset, and then at the conclusion of the operation removes reference to the instance that it copied from.

Typically when we talk about staging, we talk about moving from a disk (media type ‘FILE’ or media type ‘ADV_FILE’) volume to tape. In such a situation where NetWorker stages from an actual FILE/ADV_FILE volume, it not only removes reference to the original saveset, but it actually removes it from the source volume as well. That’s to be expected – it’s a real disk filesystem that NetWorker is accessing, and removing a saveset is as simple as just running an operating system ‘delete’ command.

While it’s not often done, NetWorker does support staging from tape – but obviously when it’s done reading from the source volume, it can’t then selectively erase chunks of savesets from the source tape. Instead, all it does in that situation is delete, from the media database, the reference to the saveset having been on the source volume.

(In case you’re new to NetWorker and intend to now run off and try some staging – make sure, please, before you do, to read the second article ever posted on the NetWorker Blog – “Instantiating Savesets“. It has some very cautionary information about staging operations.)

One final thing – while I said that a clone or stage operation is done against a saveset, it’s not always as simple as that. If you tell NetWorker to clone or stage an individual saveset/saveset instance, that’s exactly what it will do. However, if you tell NetWorker to clone or stage a NetWorker tape volume, it will clone/stage the entire volume, with multiplexing left intact.

Regardless of those caveats though, remember the simple rule – a clone is a copy operation, and a stage is a move operation.

 

While I touched on this in the second blog posting I made (Instantiating Savesets), it’s worthwhile revisiting this topic more directly.

Using ADV_FILE devices can play havoc with conventional tape rotation strategies; if you aren’t aware of these implications, it could cause operational challenges when it comes time to do recovery from tape. Let’s look at the lifecycle of a saveset in a disk backup environment where a conventional setup is used. It typically runs like this:

  1. Backup to disk
  2. Clone to tape
  3. (Later) Stage to tape
  4. (At rest) 2 copies on tape

    Looking at each stage of this, we have:

    Saveset on ADV_FILE deviceThe saveset, once written to an ADV_FILE volume, has two instances. The instance recorded as being on the read-read only part of the volume will have an SSID/CloneID of X/Y. The instance recorded as being on the read-write part of the volume will have an SSID/CloneID of X/Y+1. This higher CloneID is what causes NetWorker, upon a recovery request, to seek the “instance” on the read-only volume. Of course, there’s only one actual instance (hence why I object so strongly to the ‘validcopies’ field introduced in 7.6 reporting 2) – the two instances reported are “smoke and mirrors” to allow simultaneous backup to and recovery from an ADV_FILE volume.

    The next stage sees the saveset cloned:

    ADV_FILE + Tape CloneThis leaves us with 3 ‘instances’ – 2 physical, one virtual. Our SSID/CloneIDs are:

    • ADV_FILE read-only: X/Y
    • ADV_FILE read-write: X/Y+1
    • Tape: X/Y+n, where n > 1.

    At this point, any recovery request will still call for the instance on the read-only part of the ADV_FILE volume, so as to help ensure the fastest recovery initiation.

    At some future point, as disk capacity starts to run out on the ADV_FILE device, the saveset will typically be staged out:

    ADV_FILE staging to tapeAt the conclusion of the staging operation, the physical + virtual instances of the saveset on the ADV_FILE device are removed, leaving us with:

    Savesets on tape only

    So, at this point, we end up with:

    • A saveset instance on a clone volume with SSID/CloneID of: X/Y+n.
    • A saveset instance on (typically) a non-clone volume with SSID/CloneID of: X/Y+n+m, where m > 0.

    So, where does this leave us? (Or if you’re not sure where I’ve been heading yet, you may be wondering what point I’m actually trying to make.)

    Note what I’ve been saying each time – NetWorker, when it needs to read from a saveset for recovery purposes, will want to pick the saveset instance with the lowest CloneID. At the point where we’ve got a clone copy and a staged copy, both on tape, the clone copy will have the lowest CloneID.

    The net result is that NetWorker will, in these circumstances, when both tapes aren’t online, request the clone volume for recovery – even though in an extreme number of cases, this will be the volume that’s offsite.

    For NetWorker versions 7.3.1 and lower, there was only one solution to this – you had to hunt down the actual clone saveset instances NetWorker was asking for, mark them as suspect, and reattempt the recovery. If you managed to mark them all as suspect, then you’d be able to ‘force’ NetWorker into facilitating the recovery from the volume(s) that had been staged to. However, after the recovery you had to make sure you backed out of those changes, so that both the clones and the staged copies would be considered not-suspect.

    Some companies, in this situation, would instigate a tape rotation policy such that clone volumes would be brought back from off-site before savesets were likely to be staged out, with subsequently staged media sent offsite. This has a dangerous side-effect of temporarily leaving all copies of backups on-site, jeapordising disaster recovery situations, and hence it’s something that I couldn’t in any way recommend.

    The solution introduced around 7.3.2 however is far simpler – a mminfo flag called offsite. This isn’t to be confused with the convention of setting a volume location field to ‘offsite’ when the media is removed from site. Annoyingly, this remains unqueryable; you can set it, and NetWorker will use it, but you can’t say, search for volumes with the ‘offsite’ flag set.

    The offsite flag has to be manually set, using the command:

    # nsrmm -o offsite volumeName

    (where volumeName typically equals the barcode).

    Once this is set, then NetWorker’s standard saveset (and therefore volume) selection criteria is subtly adjusted. Normally if there are no online instances of a saveset, NetWorker will request the saveset with the lowest CloneID. However, saveset instances that are on volumes with the offsite flag set will be deemed ineligible and NetWorker will look for a saveset instance that isn’t flagged as being offsite.

    The net result is that when following a traditional backup model with ADV_FILE disk backup (backup to disk, clone to tape, stage to tape), it’s very important that tape offsiting procedures be adjusted to set the offsite flag on clone volumes as they’re removed from the system.

    The good news is that you don’t normally have to do anything when it’s time to pull the tape back onsite. The flag is automatically cleared* for a volume as soon as it’s put back into an autochanger and detected by NetWorker. So when the media is recycled, the flag will be cleared.

    If you come from a long-term NetWorker site and the convention is still to mark savesets as suspect in this sort of recovery scenario, I’d suggest that you update your tape rotation policies to instead use the offsite flag. If on the other hand, you’re about to implement an ADV_FILE based backup to disk policy, I’d strongly recommend you plan in advance to configure a tape rotation policy that uses the offsite flag as cloned media is sent away from the primary site.


    * If you did need to explicitly clear the flag, you can run:

    # nsrmm -o notoffsite volumeName

    Which would turn the flag back off for the given volumeName.

     

    The scenario:

    • A clone or stage operation has aborted (or otherwise failed)
    • It has been restarted
    • It hangs waiting for a new volume even though there’s a partially written volume available.

    This is a relatively easy problem to explain. Let’s first look at the log messages that happens. To generate this error, I started cloning some data to the “Default Clone” pool, with only one volume in the pool, then aborted. Shortly thereafter I tried to run the clone again, and when NetWorker wouldn’t write to the volume I unmounted and remounted it – a common thing that newer administrators will try in this scenario. This is where you’ll hit the following error in the logs:

    media notice: Volume `800829L4' ineligible for this operation; Need a different volume
    from pool `Default Clone'
    media info: Suggest manually labeling a new writable volume for pool 'Default Clone'

    So, what’s the cause of this problem? It’s actually relatively easy to explain.

    A core component in NetWorker’s media database design is that a saveset can only ever have one instance on a piece of media. This applies as equally to failed as complete saveset instances.

    The net result is that this error/situation will occur because it’s meant to – NetWorker doesn’t permit more than one instance of a saveset to appear on the same piece of physical media.

    So what do you do when this error comes up?

    • If you’re backing up to disk, an aborted saveset should normally be cleared up automatically by NetWorker after the operation is aborted. However, in certain instances this may not be the case. For NetWorker 7.5 vanilla and 7.5.1.1/7.5.1.2, this should be done by expiring the saveset instance – using nsrmm to flag the instance as having an expiry date within a few minutes or seconds. For all other versions of NetWorker, you should just be able to delete the saveset instance.
    • When working with tape (virtual or physical), the most recommended approach would be to move on to another tape, or if the instance is the only instance on that tape, relabel the tape. (Some would argue that you can use nsrmm to delete the saveset instance from the tape and then re-attempt the operation, but since NetWorker is so heavily designed to prevent multiple instances of a saveset on a piece of media, I’d strongly recommend against this.)

    Overall it’s a fairly simple issue, but knowing how to recognise it lets you resolve it quickly and painlessly.

     

    NetWorker has an irritating quirk where it doesn’t allow you to clone or stage incomplete savesets. I can understand the rationale behind it – it’s not completely usable data, but that rationale is wrong.

    If you don’t think this is the case, all you have to do to test is start a backup, cancel it mid-way through a saveset, then attempt to clone that saveset. Here’s an example:

    [root@tara ~]# save -b Big -q -LL /usr
    Oct 25 13:07:15 tara logger: NetWorker media: (waiting) Waiting for 1
    writable volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
    <backup running, CTRL-C pressed>
    (interrupted), exiting
    [root@tara ~]# mminfo -q "volume=BIG995S3"
     volume        client       date      size   level  name
    BIG995S3       tara.pmdg.lab 10/25/2009 175 MB manual /usr
    [root@tara ~]# mminfo -q "volume=BIG995S3" -avot
     volume        client           date     time         size ssid      fl   lvl name
    BIG995S3       tara.pmdg.lab 10/25/2009 01:07:15 PM 175 MB 14922466  ca manual /usr
    [root@tara ~]# nsrclone -b Default -S 14922466
    5876:nsrclone: skipping aborted save set 14922466
    5813:nsrclone: no complete save sets to clone

    Now, you may be wondering why I’m hung up on not being able to clone or stage this sort of data. The answer is simple: sometimes the only backup you have is a broken backup. You shouldn’t be punished for this!

    Overall, NetWorker has a fairly glowing pedigree in terms of enforced data viability:

    • It doesn’t recycle savesets until all dependent savesets are also recyclable;
    • It’s damn aggressive at making sure you have current backups of the backup server’s bootstrap information;
    • If there’s any index issue it’ll end up forcing a full backup for savesets even if it’s backed them up before;
    • It won’t overwrite data on recovery unless you explicitly tell it to;
    • It lets you recover from incomplete savesets via scanner/uasm!

    and so on.

    So, logically, there makes little sense in refusing to clone/stage incomplete savesets.

    There may be programmatic reasons why NetWorker doesn’t permit cloning/staging incomplete savesets, but these aren’t sufficient reasons. NetWorker’s pedigree of extreme focus on recoverability remains tarnished by this inability.

    © 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha