Basics – Recovering from an aborted saveset

 Basics, NetWorker  Comments Off on Basics – Recovering from an aborted saveset
Apr 242010
 

Normally you don’t want to be in this position, but sometimes you’ll strike a situation where the only possible location of data that you need to get back is in a saveset that aborted (i.e., failed) during the backup process. Now, if the saveset/media is almost completely hosed, you’re probably going to need to recover using the scanner|uasm process, but if it was just a case of a failed backup, you can direct a partial saveset recovery using the recover command.

When you’re at this point the first thing you need to do is find the saveset ID of the aborted saveset, but I’ll leave that as an exercise to the reader. Now, once you’ve got the aborted saveset ID, it’s as simple as running a saveset recovery. The basic command might look like this:

C:> recover -d path -s buServer -iN -S ssid

Where:

  • ‘path’ is the path that you want to recover to. Note that in these situations, it’s usually a very, very good idea to make sure you recover to somewhere new, rather than overwriting any existing files.
  • ‘buServer’ is the backup server that you want to recover from.
  • ‘ssid’ is the saveset ID for the aborted saveset that you want to recover from.

Depending on whether you’re doing a directed recovery, etc., you may end up with a few additional arguments, but the above is fairly much what you need in this situation. (If you’re confident that a specific path or file you want back is going to be in the part of the saveset backed up, you can always add that path at the end of the recovery command, too.)

Once the recovery runs, you’ll get a standard file-by-file listing of what is being recovered, but the recovery will end with what looks like an error – it’s effectively though just a notification that NetWorker has hit the data that was ‘in transit’, so to speak, when the saveset was aborted. This error will look similar to the following:

5041:recover: Unable to read checksum from save stream

16294:recover: Encountered an error recovering C:temp2Temp744win_x86networkrhbaemc-homebase-agent-6.1.2-win-x86.exe

53363:recover: Recover of rsid 851692923 failed: Error receiving files from NSR server `tara'

The process cannot access the file because it is being used by another process.

Received 231 matching file(s) from NSR server `tara'

Recover errors with 1 file(s)

Recover completion time: 4/20/2010 3:41:12 PM

At that point, you know that you’ve got back all the data you’re going to get back, and you can search through the recovered files for the data you want.

(As an aside, don’t forget to join the forums if you’ve got questions that aren’t answered in this blog.)

Saveset sizes from 32-bit Windows

 Features, NetWorker, Windows  Comments Off on Saveset sizes from 32-bit Windows
Jan 172010
 

There’s currently a bug within NetWorker whereby if you’re using a 32-bit Windows client that has a filesystem large enough such that the savesets generated are larger than 2TB, you’ll get a massively truncated size reported in the savegroup completion. In fact, for a 2,510 GB saveset, the savegroup completion report will look like this:

Start time:   Sat Nov 14 17:42:52 2009
End time:     Sun Nov 15 06:58:57 2009

--- Successful Save Sets ---
* cyclops:Probe savefs cyclops: succeeded.
* cyclops:C:bigasms 66135:(pid 3308): NSR directive file (C:bigasmsnsr.dir) parsed
 cyclops: C:bigasms               level=full,   1742 MB 13:15:56    255 files
 trash.pmdg.lab: index:cyclops     level=full,     31 KB 00:00:00      7 files
 trash.pmdg.lab: bootstrap         level=full,    213 KB 00:00:00    198 files

However, when checked through NMC, nsrwatch or mminfo, you’ll find that that the correct size for the saveset is actually shown:

[root@trash ~]# mminfo
 volume        client       date      size   level  name
XFS.002        cyclops   11/14/2009 2510 GB   full  C:bigasms
XFS.002.RO     cyclops   11/14/2009 2510 GB   full  C:bigasms

The reporting doesn’t affect recoverability, but if you’re reviewing savegroup completion reports the data sizes will likely (a) be a cause for concern or (b) affect any auto parsing that you’re doing of the savegroup completion report.

I’ve managed to secure a fix for 7.4.4 for this, with requests in to get it ported to 7.5.1 as well, and to get it integrated into the main trees for permanent inclusion upon the next service packs, etc. If you’ve been putting up with this problem for a while or have just noticed it and want it fixed, the escalation patch number was NW110493.

(It’s possible that this problem affects more than just 32-bit Windows clients – i.e,. it could affect other 32-bit clients as well. I’d be interested in knowing if someone has spotted it on another operating system. I’d test, but my lab environment is currently otherwise occupied and generating 2+TB of data, even at 90MB/s, is a wee bit long.)

Nov 252009
 

Everyone who has worked with ADV_FILE devices knows this situation: a disk backup unit fills, and the saveset(s) being written hang until you clear up space, because as we know savesets in progress can’t be moved from one device to another:

Savesets hung on full ADV_FILE device until space is cleared

Honestly, what makes me really angry (I’m talking Marvin the Martian really angry here) is that if a tape device fills and another tape of the same pool is currently mounted, NetWorker will continue to write the saveset on the next available device:

Saveset moving from one tape device to another

What’s more, if it fills and there’s a drive that currently does have a tape mounted, NetWorker will mount a new tape in that drive and continue the backup in preference to dismounting the full tape and reloading a volume in the current drive.

There’s an expression for the behavioural discrepancy here: That sucks.

If anyone wonders why I say VTLs shouldn’t need to exist, but I still go and recommend them and use them, that’s your number one reason.

Oct 272009
 

NetWorker has an irritating quirk where it doesn’t allow you to clone or stage incomplete savesets. I can understand the rationale behind it – it’s not completely usable data, but that rationale is wrong.

If you don’t think this is the case, all you have to do to test is start a backup, cancel it mid-way through a saveset, then attempt to clone that saveset. Here’s an example:

[root@tara ~]# save -b Big -q -LL /usr
Oct 25 13:07:15 tara logger: NetWorker media: (waiting) Waiting for 1
writable volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
<backup running, CTRL-C pressed>
(interrupted), exiting
[root@tara ~]# mminfo -q "volume=BIG995S3"
 volume        client       date      size   level  name
BIG995S3       tara.pmdg.lab 10/25/2009 175 MB manual /usr
[root@tara ~]# mminfo -q "volume=BIG995S3" -avot
 volume        client           date     time         size ssid      fl   lvl name
BIG995S3       tara.pmdg.lab 10/25/2009 01:07:15 PM 175 MB 14922466  ca manual /usr
[root@tara ~]# nsrclone -b Default -S 14922466
5876:nsrclone: skipping aborted save set 14922466
5813:nsrclone: no complete save sets to clone

Now, you may be wondering why I’m hung up on not being able to clone or stage this sort of data. The answer is simple: sometimes the only backup you have is a broken backup. You shouldn’t be punished for this!

Overall, NetWorker has a fairly glowing pedigree in terms of enforced data viability:

  • It doesn’t recycle savesets until all dependent savesets are also recyclable;
  • It’s damn aggressive at making sure you have current backups of the backup server’s bootstrap information;
  • If there’s any index issue it’ll end up forcing a full backup for savesets even if it’s backed them up before;
  • It won’t overwrite data on recovery unless you explicitly tell it to;
  • It lets you recover from incomplete savesets via scanner/uasm!

and so on.

So, logically, there makes little sense in refusing to clone/stage incomplete savesets.

There may be programmatic reasons why NetWorker doesn’t permit cloning/staging incomplete savesets, but these aren’t sufficient reasons. NetWorker’s pedigree of extreme focus on recoverability remains tarnished by this inability.

Avoiding 2GB saveset chunks

 NetWorker, Security  Comments Off on Avoiding 2GB saveset chunks
Aug 192009
 

Periodically a customer will report to me that a client is generating savesets in 2GB chunks. That is, they get savesets like the following:

  • C: – 2GB
  • <1>C: – 2GB
  • <2>C: – 2GB
  • <3>C: – 1538MB

Under much earlier versions of NetWorker, this was expected; these days, it really shouldn’t happen. (In fact, if it does happen, it should be considered a potential error condition.)

The release notes for 7.4.5 suggest that if you’re currently experiencing chunking in the 7.4.x series, going to 7.4.5 may very well resolve the issue. However, if that doesn’t do the trick for you, the other way of doing it is to switch from nsrauth to oldauth authentication on the backup server for the client exhibiting the problem.

To do this, you need to fire up nsradmin against the client process on the server and adjust the NSRLA record. Here’s an example server output/session, using a NetWorker backup server of ‘tara’ as our example:

[root@tara ~]# nsradmin -p 390113 -s tara
NetWorker administration program.
Use the "help" command for help, "visual" for full-screen mode.
nsradmin> show type:; name:; auth methods:
nsradmin> print type: NSRLA
                        type: NSRLA;
                        name: tara.pmdg.lab;
                auth methods: "0.0.0.0/0,nsrauth/oldauth";

So, what we want to do is adjust the ‘auth methods’ for the client that is chunking data, and we want to switch it to using ‘oldauth’ instead. Assuming we have a client called ‘cyclops’ that is exhibiting this problem, and we want to only adjust cyclops, we would run the command:

nsradmin> update auth methods: "cyclops,oldauth","0.0.0.0/0,nsrauth/oldauth"
                auth methods: "cyclops,oldauth", "0.0.0.0/0,nsrauth/oldauth";
Update? y
updated resource id 4.0.186.106.0.0.0.0.42.47.135.74.0.0.0.0.192.168.50.7(7)

Once this has been done, it’s necessary to stop and restart the NetWorker services on the backup server for the changes to take effect.

So the obvious follow up questions and their answers are:

  • Why would you need to change the security model from nsrauth to oldauth to fix this problem? It seems the case that in some instances the security/authentication model can lead to NetWorker having issues with some clients that forces a reversion to chunking. By switching to the oldauth method it prevents this behaviour.
  • Should you just change every client to using oldauth? No – oldauth is being retired over time, and nsrauth is more secure, so it’s best to only do this as a last resort. Indeed, if you can upgrade to 7.4.5 that may be the better solution.

[Edit – 2009-10-27]

If you’re on 7.5.1, then in order to avoid chunking you need to be at least on 7.5.1.5 (that’s cumulative patch cluster 5 for 7.5.1.); if you’re one of those sites experiencing recovery problems from continuation/chunked savesets, you are going to need 7.5.1.6. Alternatively, you’ll need LGTsc31925 for whatever platform/release of 7.5.1 that you’re running.

Sub-saveset checkpointing would be good

 Architecture, NetWorker  Comments Off on Sub-saveset checkpointing would be good
Jul 292009
 

Generally speaking I don’t have a lot of time for NetBackup, primarily due to the lack of dependency checking. That’s right, a backup product that doesn’t ensure that fulls are kept for as long as necessary to guarantee recoverability of dependent incrementals isn’t something I enjoy using.

That being said, there are some nifty ideas within NetBackup that I’d like to see eventually make their way into NetWorker.

One of those nifty ideas is the notion of image checkpointing. To use the NetWorker vernacular, this would be sub-saveset checkpointing. The notion of checkpointing is to allow a saveset to be restarted from a point as close to the failure as possible rather than from the start. E.g., your backup may be 20GB into a 30GB filesystem and a failure occurs. With image checkpointing turned on in NetBackup, the backup won’t need to re-run the entire 20GB previously done, but will pick up from the last point in the backup that a checkpoint was taken.

I’m not saying this would be easy to implement in NetWorker. Indeed, if I were to be throwing a bunch of ideas into a group of “Trivial”, “Easy”, “Hmmm”, “Hard” and “Insanely Difficult” baskets, I’d hazard a guess that the modifications required for sub-saveset checkpointing would fall at least into the “Hard” basket.

To paraphrase a great politician though, sometimes you need to choose to do things not because they’re easy, but because they’re hard.

So, first – why is sub-saveset checkpointing important? Well, as data sizes increase, and filesystems continue to grow, having to restart the entire saveset because of a failure “somewhere” within the stream is increasingly inefficient. For the most part, we work through these issues, but as filesystems continue to grow in size and complexity, this makes it harder to hit backup windows when failures occur.

Secondly – how might sub-saveset checkpointing be done? Well, NetWorker already is capable of doing this – sort of. It’s in chunking or fragments. Long term NetWorker users will be well aware of this: savesets that had a maximum size of 2GB, and so if you were backing up a 7 GB filesystem called “/usr”, you’d get:

/usr
<1>/usr
<2>/usr
<3>/usr

In the above, “/usr” was considered the “parent” of “<1>/usr”, “<1>/usr” was the parent of “<2>/usr”, and so on. (Parent? man mminfo – read about pssid.)

Now, I’m not suggesting a whole-hearted return to this model – it’s a pain in the proverbial to parse and calculate saveset sizes, etc., and I’m sure there’s other inconveniences to it. However, it does an entry to the model we’re looking for – if needing to restart from a checkpoing, a backup could continue via a chunked/fragmented saveset.

The difficulty lays in differentiating between the “broken” part of the parent saveset chunk and the “correct” part of the child saveset chunk, which would likely require extension to at least the media database. However, I think it’s achievable given that the media database contains details about segments within savesets (i.e., file/record markers, etc.), then in theory it should be possible to include a “bad” flag so that a chunk of data at the end of a saveset chunk can be declared as bad, indicating to NetWorker that it needs to move onto the next child chunk.

It’s fair to say that most people would be happy with needing to go through a media database upgrade (i.e., a change to the structure as part of starting a new version of NetWorker) in order to get sub-saveset checkpointing.

Feb 162009
 

Ever need to adjust the browse/retention time for a saveset, but you’ve not been sure how to do so? Here’s how.

To change the browse or retention time, you’ll need to find out the saveset ID (SSID) of the given saveset. This can be done with mminfo.

For instance, say you had a backup done last night of a machine called ‘archon’ that has now been rebuilt, but you want to keep the old backup for much longer than normal – e.g., ten years instead of the normal 3.

First, to find out what you need to change, get a list of the SSIDs:

# mminfo -q "client=archon,savetime>=24 hours ago" -r name,ssid
 name                          ssid
/                              4036558666
/Volumes/TARDIS/Yojimbo        4019781450
/Volumes/Yu                    4003004234

(If you’re confused about that savetime command, see my other post here.)

Now, for each of those SSIDs that are returned, we’ll run a nsrmm command to adjust the browse and retention time*.

The basic nsrmm command for adjusting the browse and retention time is:

# nsrmm -S ssid -w browse -e retent

or, for a single instance of a saveset:

# nsrmm -S ssid/cloneid -w browse -e retent

Where the ‘browse’ and ‘retent’ values can be either one of the two following:

  • A literal date in US date format ** – e.g., “12/31/2019” for 31 December 2019.
  • A ‘fuzzy’ english worded date – e.g., “+10 years” for 10 years from today.

Note that (rather obviously), your browse time cannot exceed your retention time, and generally its recommended that you set browse time to retention time.

So in this case, you’d run for each SSID or SSID/CloneID you want to affect:

# nsrmm -S ssid -w "+10 years" -e "+10 years"

Which will look like the following, based on my mminfo output:

# nsrmm -S 4036558666 -w "+10 years" -e "+10 years"
# nsrmm -S 4019781450 -w "+10 years" -e "+10 years"
# nsrmm -S 4003004234 -w "+10 years" -e "+10 years"

It’s that simple.


* You can also do this against an instance of a saveset by using the SSID/Clone ID; to do that variant, request “-r name,ssid,cloneid”, then use the two numbers in the nsrmm command separated by a forward slash – ssid/cloneid.

** The restriction on US date format may have eased in 7.5. I’m going to do some additional playing around with locales sometime soonish.

Jan 252009
 

Following a recent discussion I’ve been having on the NetWorker Mailing List, I thought I should put a few details down about clone IDs.

If you don’t clone your backups (and if you don’t: why not?), you may not have really encountered clone IDs very much. They’re the shadowy twin of the saveset ID, and serve a fairly important purpose.

From hereon in, I’ll use the following nomenclature:

  • SSID = Save Set ID
  • CLID = CLone ID

“SSID” is pretty much the standard NetWorker terminology for saveset ID, but usually clone ID is just written as “clone ID” or “clone-id”, etc., which gets a bit tiresome after a while.

Every saveset in NetWorker is tagged with a unique SSID. However, every copy of a saveset is tagged with the same SSID, but a different CLID.

You can see this when you ask mminfo to show both:

[root@nox ~]# mminfo -q "savetime>=18 hours ago,pool=Staging,client=archon,
name=/Volumes/TARDIS" -r volume,ssid,cloneid,nsavetime
 volume        ssid          clone id  save time
Staging-01     3962821973  1228135765 1228135764
Staging-01.RO  3962821973  1228135764 1228135764

(If you must know, being a fan of Doctor Who, all my Time Machine drives are called “TARDIS” – and no, I don’t backup my Time Machine copies with NetWorker, it would be a truly arduous and wasteful thing to do; I use my Time Machine drives for other database dumps from my Macs.)

In this case we’re not only seeing the SSID and CLID, but also a special instance of the SSID/CLID combination – that which is assigned for disk backup units. In the above example, you’ll note that the CLID associated with the read-only (.RO) version of the disk backup unit is exactly one less than the CLID associated with the read-write version of the disk backup unit. This is done by NetWorker for a very specific reason.

So, you might wonder then what the purpose of the CLID is, since we use the SSID to identify an individual saveset, right?

I had hunted for ages for a really good analogy on SSID/CLIDs, and stupidly the most obvious one never occurred to me. One of the NetWorker Mailing List’s most helpful posters, Davina Treiber, posted the (in retrospect) obvious and smartest analogy I’ve seen – comparing savesets to books in a library. To paraphrase, while a library may have multiple copies of the same book (with each copy having the same ISBN – after all, it’s the same book), they will obviously need to keep track of the individual copies of the book to know who has which copy, how many copies they have left, etc. Thus, the library would assign an individual copy number to each instance of the book they have, even if they only have one instance.

This, quite simply, is the purpose of the CLID – to identify individual instances of a single saveset. This means that you can, for example, do any of the following (and more!):

  • Clone a saveset by reading from a particular cited copy.
  • Recover from a saveset by reading from a particular cited copy.
  • Instruct NetWorker to remove from its media database reference to a particular cited copy.

In particular, in the final example, if you know that a particular tape is bad, and you want to delete that tape, you only want NetWorker to delete reference to the saveset instances on that tape – you wouldn’t want to also delete reference to perfectly good copies sitting on other tapes. Thus you would refer to SSID/CLID.

I’ve not been using the terminology SSID/CLID randomly. When working with NetWorker in a situation where you either want to, or must specify a specific instance of a saveset, you literally use that in the command. E.g.,:

# nsrclone -b “Daily Clone” -S 3962821973/1228135764

Would clone the saveset 3962821973 to the “Daily Clone” pool, using the saveset instance (CLID) 1228135764.

The same command could be specified as:

# nsrclone -b “Daily Clone” -S 3962821973

However, this would mean that NetWorker would pick which instance of the saveset to read from in order to clone the nominated saveset. The same thing happens when NetWorker is asked to perform a recovery in standard situations (i.e., non-SSID based recoveries).

So, how does NetWorker pick which instance of a saveset should be used to facilitate a recovery? The algorithm used goes a little like this:

  • If there are instances online, then the most available instance is used.
  • If there are multiple instances equally online, then the instance with the lowest CLID is requested.
  • If all instances are offline, then the instance with the lowest CLID not marked as offsite is requested.

The first point may not immediately make sense. Most available? If you say, have 2 copies on tape, and one tape is in a library, but the other is physically mounted in a tape drive, and is not in use, that tape in the drive will be used.

For the second point, consider disk backup units – adv_file type devices. In this case, both the RW and the RO “version” of the saveset (remembering, there’s only one real physical copy on disk, NetWorker just mungs some details to make it appear to the media database that there’s 2 copies) are equally online – they’re both mounted disk volumes. So, to prevent recoveries automatically running from the RW “version” of the saveset on disk, when the instances are setup, the “version” on the RO portion of the disk backup unit is assigned a CLID one less than the CLID of the “version” on the RW device.

Thus, we get “guaranteed” recovery/reading from the RO version of the disk backup unit. In normal circumstances, that is. (You can still force recovery/reading from the RW version if you so desire.)

In the final point, if all copies are equally offline, NetWorker previously just requested the copy with the lowest CLID. This works well in a tape only environment – i.e.:

  • Backup to tape
  • Clone backup to another tape
  • Send clone offsite
  • Keep ‘original’ onsite

In this scenario, NetWorker would ask for the ‘original’ by virtue of it having the lowest CLID. However, the CLID is only generated when the saveset is cloned. Thus, consider the backup to disk scenario:

  • Backup to disk
  • Clone from disk to tape
  • Send clone offsite
  • Later, when disk becomes full or savesets are too old, stage from disk to tape
  • Keep new “originals” on-site.

This created a problem – in this scenario, if you went to do a recovery after staging, then NetWorker would (annoyingly for many!) request the clone version of the saveset. This either meant requesting it to be pulled back from the offsite location, or doing a SSID/CLID recovery or marking the clone SSID/CLID as suspect or mounting the “original”. However you looked at it, it was a lot of work that you really shouldn’t have needed to do.

NetWorker 7.3.x however introduced the notion of an offsite flag; this isn’t the same as setting the volume location to offsite however. It’s literally a new flag:

# nsrmm -o offsite 800841

Would mark the volume 800841 in the media database as not being onsite – I.e., having a less desirable availability for recovery/read operations.

The net result is that in this situation, even if the offsite clone has a lower CLID, if it is flagged as offsite, but there’s a clone with a higher CLID not flagged as offsite, NetWorker will bypass that normal “use the lowest CLID” preference to instead request the onsite copy.

It would certainly be preferable however if a future version of NetWorker could have read priority established as a flag for pools; that way, rather than having to bugger around with the offsite flag (which, incidentally, can only be set/cleared from the command line, and can’t be queried!), an administrator could nominate “This pool has highest recovery priority, whereas this pool has lower recovery priority”. That way, NetWorker would pick the lowest CLID in the highest recovery priority pool.

(I wait, and hope.)

%d bloggers like this: