Normally you don’t want to be in this position, but sometimes you’ll strike a situation where the only possible location of data that you need to get back is in a saveset that aborted (i.e., failed) during the backup process. Now, if the saveset/media is almost completely hosed, you’re probably going to need to recover using the scanner|uasm process, but if it was just a case of a failed backup, you can direct a partial saveset recovery using the recover command.

When you’re at this point the first thing you need to do is find the saveset ID of the aborted saveset, but I’ll leave that as an exercise to the reader. Now, once you’ve got the aborted saveset ID, it’s as simple as running a saveset recovery. The basic command might look like this:

C:\> recover -d path -s buServer -iN -S ssid

Where:

  • ‘path’ is the path that you want to recover to. Note that in these situations, it’s usually a very, very good idea to make sure you recover to somewhere new, rather than overwriting any existing files.
  • ‘buServer’ is the backup server that you want to recover from.
  • ‘ssid’ is the saveset ID for the aborted saveset that you want to recover from.

Depending on whether you’re doing a directed recovery, etc., you may end up with a few additional arguments, but the above is fairly much what you need in this situation. (If you’re confident that a specific path or file you want back is going to be in the part of the saveset backed up, you can always add that path at the end of the recovery command, too.)

Once the recovery runs, you’ll get a standard file-by-file listing of what is being recovered, but the recovery will end with what looks like an error – it’s effectively though just a notification that NetWorker has hit the data that was ‘in transit’, so to speak, when the saveset was aborted. This error will look similar to the following:

5041:recover: Unable to read checksum from save stream

16294:recover: Encountered an error recovering C:\temp2\Temp\744\win_x86\networkr\hba\emc-homebase-agent-6.1.2-win-x86.exe

53363:recover: Recover of rsid 851692923 failed: Error receiving files from NSR server `tara'

The process cannot access the file because it is being used by another process.

Received 231 matching file(s) from NSR server `tara'

Recover errors with 1 file(s)

Recover completion time: 4/20/2010 3:41:12 PM

At that point, you know that you’ve got back all the data you’re going to get back, and you can search through the recovered files for the data you want.

(As an aside, don’t forget to join the forums if you’ve got questions that aren’t answered in this blog.)

 

For quite a while I worked under the assumption that you could do the following with directives in NetWorker:

<< /path >>
+skip: *.mp3
<< /path/subpath/criticalpath >>
forget

The logic of this is that it should be possible to skip files in one directory, but forget that directive in a lower directory and thus be able to still backup files matching a particular criteria in a subpath.

Recent discussions on the NetWorker mailing list left me questioning whether I was correct in my assumption. I thought I’d tested it long ago, but the discussions on the list (and the tests that I did) seemed to indicate this wasn’t the way NetWorker worked.

It turns out I was testing incorrectly. Instead of testing with an exact specification such as the above, I was testing “lazily”:

<< /path >>
+skip: *
<< /path/subpath/criticalpath >>
forget

The mistake that I made was in the “*” vs the “*.mp3″ I should have been testing my use case scenario. In short:

  • Obviously skipping “*” will result in NetWorker determining that everything is being skipped, at which point there is no need to continue to traverse any directory path beneath the point in which the “skip *” is encountered.
  • However, if just skipping a particular pattern, then NetWorker will have to continue to traverse all subdirectories from the path it encounters the skip command, meaning that the forget directive will still be honoured at a deeper directory path.

So I wasn’t wrong about my long-term belief, I just tested incorrectly.

This does mean that you can use skip, followed by forget, so long as your skip isn’t too open in its selection criteria.

 

Sometimes NetWorker may not want to cooperate when it comes to moving media in and out of drives, or around in a tape library. While nsrjb -n will do the trick for some media load operations where you don’t want to mount media, it’s not always available. Sometimes you will need to do a media move operation without NetWorker – either in situations where NetWorker isn’t running, or at times when NetWorker is disagreeing with the output of sjirdtag.

In these cases, you want to work with sjimm.

The usage for sjimm is:

[root@tara ~]# sjimm
1642:sjimm: usage: sjimm jukebox {drive|slot|inlt|mt} src {drive|slot|inlt|mt} dst

In this case, ‘jukebox’ will be the x.y.z component of the SCSI device ID as output from inquire (or as determined by checking the control port field for the jukebox.)

For instance, on my lab system, inquire shows me:

[root@tara ~]# inquire -l | grep Autochanger
scsidev@0.0.0:SPECTRA PYTHON          5500|Autochanger (Jukebox), /dev/sg6

So I know then that the jukebox component of my sjimm command will be 0.0.0.

So say I wanted to move the tape in slot 23 into the first drive in my autochanger. I’d use the command:

[root@tara ~]# sjimm 0.0.0 slot 23 drive 1

Note though that this doesn’t mount the tape. If I then run nsrjb, for my drive area I see:

drive 1 (/dev/nst0) slot   :   
drive 2 (/dev/nst1) slot   :   
drive 3 (/dev/nst2) slot   :   
drive 4 (/dev/nst3) slot   :   
drive 5 (/dev/nst4) slot   :   
drive 6 (/dev/nst5) slot   :

Note too that I didn’t give the drive to load the tape into as a operating system device, but instead a device number as per the autochanger’s definition. (I’ll get to tracing that in a minute.)

I can verify that there is a tape in the given drive at this point by running the command:

[root@tara ~]# nsrmm -p -f /dev/nst0
Verified LTO Ultrium-4 tape 800823L4 on /dev/nst0

When you’re done with the tape, you can then move it back:

[root@tara ~]# sjimm 0.0.0 drive 1 slot 23

Note that depending on the drive type, it may be necessary before issuing the above command to issue the mt command to take the media “offline”, which usually issues an eject command to the drive – e.g.,:

[root@tara ~]# mt -f /dev/nst0 rewoff

Other than that, there’s actually not a lot to sjimm. You can move tapes from slots to drives, slots to CAP slots, drives to slots, slots to slots, etc.

However, I did mention that I’d help you work out what drive number corresponds to what operating system device. Obviously if you’ve got the library configured, you can just use nsrjb’s output to see see the autochanger device <-> OS device path mapping. If you don’t yet have a tape library configured in NetWorker, or the issue is determining which drive is currently mapped to which path (after say, a tape drive replacement), you need to do a little more digging.

So, in this case you’d run sjisn – which is designed to report serial numbers and device details for tape library components. Like sjimm, sjisn takes the control port of the tape library we want to communicate with – e.g.:

[root@tara ~]# sjisn 0.0.0

Serial Number data for 0.0.0 (SPECTRA  PYTHON          ):
Library:
Serial Number: XYZZY
SCSI-3 Device Identifiers:
ATNN=SPECTRA PYTHON          XYZZY
WWNN=11223344ABCDEF00
Drive at element address 1:
SCSI-3 Device Identifiers: ATNN=ZF7584364
Drive at element address 2:
SCSI-3 Device Identifiers:
ATNN=ZF7584366
Drive at element address 3:
SCSI-3 Device Identifiers:
ATNN=ZF7584368
Drive at element address 4:
SCSI-3 Device Identifiers:
ATNN=ZF7584370
Drive at element address 5:
SCSI-3 Device Identifiers:
ATNN=ZF7584372
Drive at element address 6:
SCSI-3 Device Identifiers:
ATNN=ZF7584374

The number given in the “Drive at element address” line for each drive represents, literally, the drive number according to the tape library itself. I.e., when it refers to drive 1, it means the drive with serial number ZF7584364.

Moving on, we can then run inquire -l to provide the device details so as to align internal library drive numbers to operating system paths, cross-referencing by the serial numbers (or WWNs when using a fibre-channel tape library).  In this case, I’ll just show the details for two of the tape drives:

scsidev@0.3.0:IBM     ULT3580-TD4     5500|Tape, /dev/nst2
                                           S/N:    ZF7584368
                                           ATNN=IBM     ULT3580-TD4     ZF7584368
                                           WWNN=11223344ABCDEF03
scsidev@0.4.0:IBM     ULT3580-TD4     5500|Tape, /dev/nst3
                                           S/N:    ZF7584370
                                           ATNN=IBM     ULT3580-TD4     ZF7584370
                                           WWNN=11223344ABCDEF04

So, you can see from the above that we can map the drives as follows:

  • The drive known to the OS as /dev/nst2, which has a serial number of ZF7584368 maps to the library device number 3.
  • The drive known to the OS as /dev/nst3, which has a serial number of ZF7584370 maps to the library device number 4.

So this would give us the drive numbers to use in sjimm if we needed to move tapes in or out of those drives without using NetWorker’s NMC or nsrjb.

As a side-note, that’s also how you’d go about identifying the correct device order for a manual jbconfig operation when the library device order is out of sync with the operating system devices – cross-checking via sjisn and inquire.

 

While this is pertinent to all versions of NetWorker, it particularly seems relevant mentioning now, since as of 7.5.2, we’re now seeing revised messaging from NetWorker when a tape becomes prematurely full. These new messages now state:

nsrd media notice: LTO Ultrium-4 tape 800814L4 used 2039 MB of 800 GB capacity
nsrd media notice: NetWorker media: (Warning) 800814L4 marked full prematurely.
  Verify possible error on the device /dev/nst4, advertised capacity is 800 GB
  marked full at 2039 MB

Now, it’s worth noting here that normally if you get a tape fill up so soon that probably means there is an issue, and this version of the message, while only subtly different, is certainly more informative and that is a good thing. When we consider VTLs however, it’s a different story. In a virtual tape library, we normally want to use much smaller media sizes than the drive type we’re configured for. That way you’re writing virtual volumes that are 50GB or 100GB rather than 800GB. In my case referring to the above, my lab VTL uses virtual media sizes of 1GB (with compression).

So, how do you go about this? Well, it’s easiest to accomplish when you first setup the environment. You need to change the “Volume Default Capacity” of each virtual device to suit the allocated media sizes. To do this, in NMC turn on View->Diagnostic Mode, then when viewing device properties, enter the appropriate size in gigabytes (followed by “G” or “GB”) in the “Volume default capacity” field of the Configuration tab, shown below:

Changing default volume size

Now, if you can do that on your VTL devices before you start labelling volumes, you’re done and dusted. However, if you’ve previously labelled your media, you either have to relabel the currently blank virtual media or wait until NetWorker gets around to recycling the currently used media.

You can query mminfo to see what the default capacity is registered at – e.g.,

[root@tara ~]# mminfo -m
state volume                  written  (%)  expires     read mounts capacity
800801L4                2254 MB full 02/26/2011   0 KB     5    800 GB
800802L4                   0 KB   0%     undef    0 KB     5   1000 MB
800804L4                   0 KB   0%     undef    0 KB     5   1000 MB
800805L4                   0 KB   0%     undef    0 KB     3    800 GB

Now, what effect does this have to how much you can write to the volumes? The short answer is none. All you’re doing is adjusting the default capacity assigned to new volumes that are labelled in these (virtual) tape drives – and we can see what happens when NetWorker breaches the default volume capacity all the time in relation to physical tape – it just keeps writing until it hits end of physical tape. Nothing more, nothing less. So this means when you fill up your virtual media, NetWorker doesn’t complain at all:

nsrd media notice: LTO Ultrium-4 tape 800802L4 on /dev/nst3 is full
nsrd media notice: LTO Ultrium-4 tape 800802L4 used 2793 MB of 1000 MB capacity
nsrd media info: WORM capable for device /dev/nst3 has been set

Is this something you must to do? Well, no, not technically. However, remembering that I advocate a zero error policy, the above is something I’d definitely strongly recommend for virtual devices. Doing so will eliminate what would otherwise be false errors on the virtual tapes within the NetWorker daemon logs. That means if you have to search for media issues, or refer your daemon logs to your support provider for analysis, they won’t be seeing bunches of “tape filled prematurely” issues.

 

The scenario:

  • A clone or stage operation has aborted (or otherwise failed)
  • It has been restarted
  • It hangs waiting for a new volume even though there’s a partially written volume available.

This is a relatively easy problem to explain. Let’s first look at the log messages that happens. To generate this error, I started cloning some data to the “Default Clone” pool, with only one volume in the pool, then aborted. Shortly thereafter I tried to run the clone again, and when NetWorker wouldn’t write to the volume I unmounted and remounted it – a common thing that newer administrators will try in this scenario. This is where you’ll hit the following error in the logs:

media notice: Volume `800829L4' ineligible for this operation; Need a different volume
from pool `Default Clone'
media info: Suggest manually labeling a new writable volume for pool 'Default Clone'

So, what’s the cause of this problem? It’s actually relatively easy to explain.

A core component in NetWorker’s media database design is that a saveset can only ever have one instance on a piece of media. This applies as equally to failed as complete saveset instances.

The net result is that this error/situation will occur because it’s meant to – NetWorker doesn’t permit more than one instance of a saveset to appear on the same piece of physical media.

So what do you do when this error comes up?

  • If you’re backing up to disk, an aborted saveset should normally be cleared up automatically by NetWorker after the operation is aborted. However, in certain instances this may not be the case. For NetWorker 7.5 vanilla and 7.5.1.1/7.5.1.2, this should be done by expiring the saveset instance – using nsrmm to flag the instance as having an expiry date within a few minutes or seconds. For all other versions of NetWorker, you should just be able to delete the saveset instance.
  • When working with tape (virtual or physical), the most recommended approach would be to move on to another tape, or if the instance is the only instance on that tape, relabel the tape. (Some would argue that you can use nsrmm to delete the saveset instance from the tape and then re-attempt the operation, but since NetWorker is so heavily designed to prevent multiple instances of a saveset on a piece of media, I’d strongly recommend against this.)

Overall it’s a fairly simple issue, but knowing how to recognise it lets you resolve it quickly and painlessly.

 

Looking at the stats both for this new site and the previous site, I’ve compiled a list of the top 10 read articles on The NetWorker Blog for 2009. The top 3 of course match the three articles that routinely turn out to be the most popular on any given month, which speaks something of their relevance to the average NetWorker administrator.

(Note: I’ve excluded non-article pages from the top 10.)

Number 10 – Instantiating Savesets

The very first article on the blog, Instantiating Savesets detailed the importance of distinguishing between all instances of a saveset and a specific instance of a saveset.

This distinction between using just the saveset ID, and using a saveset ID/clone ID combination becomes particularly important when staging from disk backup units. If clones exist and you stage using just the saveset ID, when NetWorker cleans up at the end of the staging operation it will remove reference to the clones as well as deleting the original from the disk backup unit. (Something you really don’t want to have happen.)

Recommendation to EMC: Perhaps it would be worthwhile requiring a “-y” argument to nsrstage if staging savesets from disk backup units and specifying only the saveset ID.

Recommendation to NetWorker administrators: Always be careful when staging that you specify both the saveset and the clone ID.

Number 9 – Basics – Important mminfo fields

In May I wrote about a few key mminfo fields – notably:

  • savetime
  • sscreate
  • ssinsert
  • sscomp
  • ssaccess

Sadly, I didn’t get the result I wanted with EMC on ssaccess. Documented as being updated whenever a saveset fragment is accessed for backup and recovery, the most I could get was an acknowledgement that it was currently broken and to lodge an RFE to get it fixed. (The alternative was to have the documentation changed to take out reference to read operations – something I didn’t want to have happen!)

Recommendation to EMC: ssaccess would be a particularly useful mminfo field, particularly when analysing recovery statistics for NetWorker. Please fix it.

Number 8 – Basics – Listing files in a backup

Want to know what files were backed up as part of the creation of a saveset? If you do, you’re not unique – this has remained a very popular article since it was written in January.

Recommendation to EMC: This information can be retrieved via a combination of mminfo/nsrinfo, but it would be handy if NMC supported drilling down into a saveset to provide a file listing.

Number 7 – Using yum to install NetWorker on Linux

NetWorker’s need for dependency resolution on Linux for installation of the client packages in particular drew a lot of people to this article.

Number 6 – Basics – mminfo, savetime, and greater than/less than

This article explained why NetWorker uses the greater than and less than signs in mminfo in a way that newcomers to the product might find backwards. If you’re not aware of why mminfo works the way it does for specifying savetimes, you should be.

Number 5 – 7.5(.1) changed behaviour – deleting savesets from adv_file devices

This was a particularly unpleasant bug introduced into NetWorker 7.5, thankfully resolved now in the cumulative service releases and NetWorker 7.6

The gist of it is that in NetWorker 7.5/7.5.1 (aka 7.5 SP1), if you deleted a saveset on a disk backup unit, NetWorker would suffer a serious failure where it would from that point have issues cleaning regular expired savesets from the disk backup unit and insist that the disk backup unit had major issues. The primary error would manifest as:

nsrd adv_file warning: Failed to fetch the saveset(ss_t) structure for ssid 1890993582

This was fixed in 7.5.1.2, thankfully.

Recommendation to EMC: Never let this bug see the light of day again, please. (So far you’re doing an excellent job, by the way.)

Number 4 – NetWorker 7.5.1 Released

I’ve recently noticed a disturbing trend among many vendors, EMC included, where once a new release is made of a product, sales and account staff become overly enthusiastic about recommending new releases. This comes on top of not really having any technical expertise. (Please be patient, I’m trying to put this as diplomatically as possible.)

One of the worst instances I’ve seen of this in the last few years was the near-hysterical pumping of 7.5 thanks to some useful features to do with virtualisation in particular. I’ll admit that my articles on the integration between Oracle Module 5 and NetWorker 7.5, as well as Probe Based Backups may have added to this. However, there was somewhat of a stampede to 7.5 when it came out, and consequently, when it had some issues, there was strong enthusiasm for the release of 7.5.1.

This is why, by the way, that IDATA maintains for its support customers a recommended versions list that is not automatically updated when new versions of products come out.

Recommendation to EMC: Remind your sales staff that existing users already have the product, and not to just go blindly convincing them to upgrade. Otherwise you’ll eventually start sounding like this.

Number 3 – Carry a jukebox with you (if you’re using Linux)

During 2009, Mark Harvey’s LinuxVTL project first got the open source LinuxVTL working with NetWorker in a single drive configuration, then eventually, in multi-drive configurations. (Mark assures me, by the way, that patches are coming real soon to allow multiple robots on the same storage node/server.)

Lesson for me: With the LinuxVTL configured on multiple lab servers in my environment, I’ve really taken to VTLs this year, and considerably changed my attitude on using them. (I’ll say again: I still resent that they’re needed, but I now respect them a lot more than I previously did.)

Lesson for others: Even Mark himself says that the open source VTL shouldn’t be used for production backups. Don’t be cheap with your backup system, this is an excellent tool for lab setups, training, diagnostics, etc., but it is not a replacement to a production-ready VTL system. If you want a VTL, buy a VTL.

Number 2 – Basics – Parallelism in NetWorker

Some would say that the high popularity of an article about parallelism in NetWorker indicates that it’s not sufficiently documented.

I’m not entirely convinced that’s the case. But it does go to show that it’s an important topic when it comes to performance tuning, and summary articles about how the various types of parallelism interact are obviously popular.

Lesson for everyone: Now that the performance tuning guide has been updated and made more relevant in NetWorker 7.6, I’d recommend people wanting an official overview of some of the parallelism options checking that out in addition to the article above.

Number 1 – Basics – Fixing “NSR peer information” errors

Goodness this was a popular article in 2009 – detailing how to fix the “NSR peer information” errors that can come up from time to time in the NetWorker logs. If you’re not familiar with this error yet, it’s likely you will eventually as a NetWorker administrator see an error such as:

39078 02/02/2009 09:45:13 PM  0 0 2 1152952640 5095 0 nox nsrexecd SYSTEM error: There is already a machine using the name: “faero”. Either choose a different name for your machine, or delete the “NSR peer information” entry for “faero” on host: “nox”

Recommendation for EMC: Users shouldn’t really need to be Googling for a solution to this problem. Let’s see an update to NetWorker Management Console where these errors/warnings are reported in the monitoring log, with the administrator being able to right click on them and choose to clear the peer information after confirming that they’re confident no nefarious activity is happening.

Wrapping Up

I have to say, it was a fantastically satisfying year writing the blog, and I’m looking forward to seeing what 2010 brings in terms of most useful articles.

 

Something that continues to periodically come up is the need to remind people running manual staging to ensure they specify both the SSID and the Clone ID when they stage. I did some initial coverage of this when I first started the blog, but I wanted to revisit and demonstrate exactly why this is necessary.

The short version of why is simple: If you stage by SSID alone, NetWorker will delete/purge all instances of the saveset other than the one you just created. This is Not A Good Thing for 99.999% of what we do within NetWorker.

So to demonstrate, here’s a session where I:

  1. Generate a backup
  2. Clone the backup to tape
  3. Stage the saveset only to tape

In between each step, I’ll run mminfo to get a dump of what the media database says about saveset availability.

Part 1 – Generate the Backup

Here’s a very simple backup for the purposes of this demonstration, and the subsequent mminfo command to find out about the backup:

[root@tara ~]# save -b Default -LL -q /etc
save: /etc  106 MB 00:00:07   2122 files
completed savetime=1258093549

[root@tara ~]# mminfo -q "client=tara.pmdg.lab,name=/etc" -r volume,ssid,cloneid,
savetime
 volume        ssid          clone id  date
Default.001    2600270829  1258093549 11/13/2009
Default.001.RO 2600270829  1258093548 11/13/2009

There’s nothing out of the ordinary here, so we’ll move onto the next step.

Part 2 – Clone the Backup

We’ll just do a manual clone to the Default Clone pool. Here we’ll specify the saveset ID alone, which is fine for cloning – but is often what leads people to being in the habit of not specifying a particular saveset instance. I’m using very small VTL tapes, so don’t be worried that in this case I’ve got a clone of /etc spanning 3 volumes:

[root@tara ~]# nsrclone -b "Default Clone" -S 2600270829
[root@tara ~]# mminfo -q "client=tara.pmdg.lab,name=/etc" -r volume,ssid,cloneid,
savetime
 volume        ssid          clone id  date
800843S3       2600270829  1258094164 11/13/2009
800844S3       2600270829  1258094164 11/13/2009
800845S3       2600270829  1258094164 11/13/2009
Default.001    2600270829  1258093549 11/13/2009
Default.001.RO 2600270829  1258093548 11/13/2009

As you can see there, it’s all looking fairly ordinary at this point – nothing surprising is going on at all.

Part 3 – Stage by Saveset ID Only

In this next step, I’m going to stage by saveset ID alone rather than specifying the saveset ID/clone ID, which is the correct way of staging, so as to demonstrate what happens at the conclusion of the staging. I’ll be staging to a pool called “Big”:

[root@tara ~]# nsrstage -b Big -v -m -S 2600270829
Obtaining media database information on server tara.pmdg.lab
Parsing save set id(s)
Migrating the following save sets (ids):
 2600270829
5874:nsrstage: Automatically copying save sets(s) to other volume(s)

Starting migration operation...
Nov 13 17:34:00 tara logger: NetWorker media: (waiting) Waiting for 1 writable
volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
5884:nsrstage: Successfully cloned all requested save sets
5886:nsrstage: Clones were written to the following volume(s):
 BIG991S3
6359:nsrstage: Deleting the successfully cloned save set 2600270829
Successfully deleted original clone 1258093548 of save set 2600270829
from media database.
Successfully deleted AFTD's companion clone 1258093549 of save set 2600270829
from media database with 0 retries.
Successfully deleted original clone 1258094164 of save set 2600270829
from media database.
Recovering space from volume 4294740163 failed with the error
'Cannot access volume 800844S3, please mount the volume or verify its label.'.
Refer to the NetWorker log for details.
6330:nsrstage: Cannot access volume 800844S3, please mount the volume
or verify its label.
Completed recover space operation for volume 4177299774
Refer to the NetWorker log for any failures.
Recovering space from volume 4277962971 failed with the error
'Cannot access volume 800845S3, please mount the volume or verify its label.'.
Refer to the NetWorker log for details.
6330:nsrstage: Cannot access volume 800845S3, please mount the volume
or verify its label.
Recovering space from volume 16550059 failed with the error
'Cannot access volume 800843S3, please mount the volume or verify its label.'.
Refer to the NetWorker log for details.
6330:nsrstage: Cannot access volume 800843S3, please mount the volume
or verify its label.

You’ll note there’s a bunch of output there about being unable to access the clone volumes the saveset was previously cloned to. When we then check mminfo, we see the consequences of the staging operation though:

[root@tara ~]# mminfo -q "client=tara.pmdg.lab,name=/etc" -r volume,ssid,cloneid,
savetime
 volume        ssid          clone id  date
BIG991S3       2600270829  1258095244 11/13/2009

As you can see – no reference to the clone volumes at all!

Now, has the clone data been erased? No, but it has been removed from the media database, meaning you’d have to manually scan the volumes back in order to be able to use them again. Worse, if those volumes only contained clone data that was subsequently removed from the media database, they may become eligible for recycling and get re-used before you notice what has gone wrong!

Wrapping Up

Hopefully the above session will have demonstrated the danger of staging by saveset ID alone. If instead of staging by saveset ID we staged by saveset ID and clone ID, we’d have had a much more desirable outcome. Here’s a (short) example of that:

[root@tara ~]# save -b Default -LL -q /tmp
save: /tmp  2352 KB 00:00:01     67 files
completed savetime=1258094378
[root@tara ~]# mminfo -q "name=/tmp" -r volume,ssid,cloneid
 volume        ssid          clone id
Default.001    2583494442  1258094378
Default.001.RO 2583494442  1258094377
[root@tara ~]# nsrclone -b "Default Clone" -S 2583494442

[root@tara ~]# mminfo -q "name=/tmp" -r volume,ssid,cloneid
 volume        ssid          clone id
800845S3       2583494442  1258095244
Default.001    2583494442  1258094378
Default.001.RO 2583494442  1258094377
[root@tara ~]# nsrstage -b Big -v -m -S 2583494442/1258094377
Obtaining media database information on server tara.pmdg.lab
Parsing save set id(s)
Migrating the following save sets (ids):
 2583494442
5874:nsrstage: Automatically copying save sets(s) to other volume(s)

Starting migration operation...

5886:nsrstage: Clones were written to the following volume(s):
 BIG991S3
6359:nsrstage: Deleting the successfully cloned save set 2583494442
Successfully deleted original clone 1258094377 of save set 2583494442 from
media database.
Successfully deleted AFTD's companion clone 1258094378 of save set 2583494442
from media database with 0 retries.
Completed recover space operation for volume 4177299774
Refer to the NetWorker log for any failures.

[root@tara ~]# mminfo -q "name=/tmp" -r volume,ssid,cloneid
 volume        ssid          clone id
800845S3       2583494442  1258095244
BIG991S3       2583494442  1258096324

The recommendation that I always make is that you forget about using saveset IDs alone unless you absolutely have to. Instead, get yourself into the habit of always specifying a particular instance of a saveset ID via the “ssid/cloneid” option. That way, if you do any manual staging, you won’t wipe out access to data!

 

When I was at University, a philosophy lecturer remarked rather sagely that University is the last place people can go to learn for the sake of learning.

That’s sort of correct, but not always so. People can fumble through their jobs on a day to day basis learning what they have to, but they can also work along the basis of trying to soak up as much information as they can along the way. I’m not always a knowledge sponge – particularly if my caffeine quota is on the light side for the day, but I like to think I learn the odd thing here and there.

In the spirit of knowledge acquisition, here’s a few smaller things I’ve learned recently:

  • When simulating network connectivity problems, there’s a big difference between yanking the network cable and shutting down the network interface. (I was doing the interface shutdown, another person was doing the network cable unplug – and our results didn’t correlate.) Lesson: When escalating a case to vendor support, always spell out how you’re simulating the “comms failure” a customer is having.
  • The ‘bigasm’ utility starts to fall in a heap and becomes extremely unreliable once you exceed about 2100 GB of data generated for a single file. Lesson: When setting out to generate 2.3+ TB of backup data, create a bunch of files and have a bigasm directive to generate a smaller amount of data per file.
  • When setting up tests that will take a couple of days to run, always triple check what you’re about to do before you start it. Lesson: If you make a typo of 250 files at 100 GB each instead of 250 files at 10 GB each, bigasm/NetWorker won’t interpolate what you really meant.
  • There’s a hell of a difference between Solaris 10 AMD release 2 and release 8. Lesson: If wanting to get a Solaris 10 AMD 64-bit OS working in Parallels Desktop for Mac v5 with networking, go for release 8. It will save many forehead bruises.
  • ext3 is about as “modern” a filesystem as I am an elite sportsperson. Lesson: If wanting to achieve decent operational activities with backup to disk under Linux, use XFS instead of ext3.
  • All eSATA is not created equal. Lesson: When using an motherboard SATA -> eSATA converter, make sure the dual drive dock you order doesn’t work as a port multiplier.
 

Within NetWorker, data (savesets) can go through several stages in its lifecycle. Here’s a simple overview of those stages:

Basic data lifecycle

Basic data lifecycle

The first stage, obviously, is when data is initially being written – the “in progress” stage.

After the backup completes, data enters two stages – a browsable period and a retention period. These periods may have 100% overlap, or they may be distinctly different. For instance, the “standard” browse/retention policies chosen by NetWorker when you create a new client are:

  • Browse period – 1 month
  • Retention period – 1 year

A common mistake people make with NetWorker is to assume that the retention period starts when the browse period finishes; in actual fact, the retention and browse period start at the same time, but the browse period can finish before the retention period. So using that standard setting as an example, the saveset is browsable for the first 1 month of the 12 months that it is retained – it is not the case that the saveset is browsable for 1 month, then retained for another 12.

Once data is no longer within the retention period, and there are no backups that depend on it still within the retention period, data is considered to be recyclable.

When data is recyclable:

  • If it is on tape:
    • The data will remain available until the media is recycled. This will only happen once all the backups on the media is also recyclable, and either the administrator manually recycles the media or NetWorker re-uses it.
  • If it is on a disk backup unit (ADV_FILE) device:
    • The data will be erased from the disk backup unit the next time a volume clean operation is run, or nsrim is run (either as a overnight standard event by NetWorker, or manually via nsrim -X).

This isn’t the “whole picture” for data lifecycle within NetWorker, but it is a good brief overview to give you an idea of how data is managed within the environment.

 

Many of us with NetWorker have been in the situation where a backup has started (particularly when it’s for a newly configured group), and instead of going to the pool we want it to go to, it’s goes to the Default pool. For sites using multiple pools, it’s usually the case that no media will be in the Default pool, and hence the backup won’t go anywhere.

In those situations, determining why NetWorker is suddenly requesting media in the Default pool is quite easy. Sometimes however, the answer is not so easy. A media request may come out of the blue, with no server-initiated activities behind it, and nothing may be logged to indicate what is causing the request. It could be that an end-user is attempting to run a backup, or that a backup process that was server initiated has gone awry, restarted, and for some reason targeted the Default pool.

This leads me to what I’d call “Default pool debugging 101″ … or “how to save yourself a lot of hair tearing”. I had a customer once who called me and expressed a level of exasperation over having already spent several days off and on chasing down what might be causing the persistent request for “1 writable volume in the Default pool”.

My solution in such situations is simple: if you can’t spot what is going wrong – why NetWorker is asking for the media in the wrong pool, then label a volume into that pool and see what writes to it. In such cases one of three things will typically happen:

  1. The volume will be loaded but then not used because a process requested it, was aborted, and for some reason NetWorker didn’t detect the abort.
  2. The volume will be loaded and written to by a manual backup process, in which case the metadata for the backup can be used to identify who (or what) has sent the data to the wrong pool.
  3. The volume will be loaded and written to by an errant scheduled backup process that experienced some failure “a while ago”, in which case it can be staged, upon completion, to the correct pool.

I’m the first person to jump to the defense of elegant and well considered solutions. Doing the mundane thing of just labeling media into the “incorrect” pool that NetWorker is requesting media for smacks of inelegance or even a pseudo “brute force” approach. However, sometimes the easiest solution is also the best – instead of wasting considerable amounts of time chasing phantoms, why not just cut to the chase in such media situations where the solution isn’t obvious, and let NetWorker tell you where the request is coming from?

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha