Generally speaking, you don’t want to be mucking around with the contents of your disk backup units except under extreme circumstances. In fact, I really recommend that you don’t do so unless you 100% know what you’re doing.

So this post is all about that 1-5% of the time where you may find it necessary to say, search for a saveset that’s reported in the media database that you’re having problems accessing from the disk backup unit.

It’s actually trivially easy, once you know how.

You may be familiar with the following style of query:

# mminfo -q "savetime>=24 hours ago" -r "volume,client,level,sumsize,savetime(22)"

The (22) at the end of the savetime report parameter tells mminfo to allow 22 characters for the reporting of the savetime. The benefit of this is that you get not only the savetime date, but also the time as well.

NetWorker actually allows you to put that (number) postfix onto any field that it can output in mminfo. This can output additional information, such as the above, or give more room to output longer fields, or even limit the size of fieldnames when you don’t need too much information. (E.g., if the first 4 characters of all client names can uniquely identify the client, you might limit the client to 4 characters in an mminfo report.)

Now, where we’re heading with this, is that the sorts of filenames used for the savesets written to disk backup units are not some random collection of strings – they’re actually the long saveset ID.

Consider then a filename of:

/d/nsr/02/63/05/cd3a182f-00000006-7b7801de-497801de-01871a00-3d2a4f4b

This isn’t just a random filename, it’s the saveset ID, but just in a format you may not used to.

To get the long saveset ID in mminfo output, we use the (number) postfix on the ssid field. This would be as follows:

# mminfo -q "ssid=071462366" -r "ssid(53)"
cd3a182f-00000006-7b7801de-497801de-01871a00-3d2a4f4b

With that information in hand, you can then search for a file with the same name as the long saveset ID on disk.

You can also do a reverse lookup. Say for instance, you know there’s an issue with a particular saveset file on a disk backup unit. To find out what the actual saveset ID is for this saveset, you can run the counter-query:

mminfo -q "ssid=cd3a182f-00000006-7b7801de-497801de-01871a00-3d2a4f4b" -r ssid

So, there you go – very easy!

 

Take a basic mminfo query, add someone not familiar with how NetWorker stores and works with dates/times, and you have instant chaos*. In this post I want to help people who are just starting out with mminfo understand how it works with dates.

So let’s look at a basic query that tends to cause a lot of confusion:

# mminfo -q "client=archon,savetime<=2 weeks ago"

As a long-term NetWorker user, and implementation consultant/support consultant, not to mention a long term member of the NetWorker mailing list, this question seems to come up fairly frequently. The output appears “broken” – rather than being “savetime less than or equal to two weeks ago”, we instead get all backups for the client archon where the savetime is greater than or equal to two weeks ago.

‘Huh?’ I hear you ask.

Indeed, this is oft-used as an example of how “broken” NetWorker is. In fact, the real state is far more prosaic.

NetWorker stores and works with times as seconds since the(/an) epoch. When you supply dates to NetWorker – either in the fuzzy format above, or as a literal date string, it converts that date into a timestamp of seconds since the(/an) epoch. (You can if you want find out what a savetime is in seconds, rather than an interpreted date any time you wish in mminfo by choosing a report specification of ‘nsavetime’.)

So if you then think of the query:

# mminfo -q "client=archon,savetime<=2 weeks ago"

It has a different meaning. You’re actually asking NetWorker:

  • Convert ’2 weeks ago’ into seconds offset from ‘now’. Let’s call that Z.
  • Give me all the backups for the client ‘archon’ where the savetime is less than or equal to Z.

If you don’t like to think of it as all referring to seconds since an epoch, there’s another, perhaps simpler way of thinking about it – that being:

  • Treat “<” as meaning before.
  • Treat “>” as meaning after.

Thus, in this scenario, the query:

# mminfo -q "client=archon,savetime<=2 weeks ago"

Can be interpreted to mean, “give me all backups of the client archon taken before two weeks ago”.

You’re obviously welcome to use whichever interpretation you feel makes more sense – seconds/math or before/after – it doesn’t really matter which. Once you get the hang of this though mminfo will make a lot more sense.

* I’ve unfortunately seen someone who got < and > wrong (and didn’t check their results) relabel all tapes in a tape library that had backups younger than 3 months, rather than older than 3 months. Hence, ‘chaos’ is an appropriate term.

 

A fairly common question I get asked is “How can I find out what files were backed up?”

This is actually fairly easy, particularly if you’re prepared to use the command line. You need to run two commands – mminfo, and nsrinfo.

The command mminfo accesses the NetWorker media database, and is used to pull out details of the saveset whose files you want to view. The nsrinfo command is then used to retrieve the relevant information from the client file index.

For example, consider the following situation – there’s two incremental backups of the “/etc” directory on the machine “faero”, and we want to know what was backed up in each backup. First, run mminfo to retrieve the nsavetime, which we use in nsrinfo. The mminfo command might resemble the following:

# mminfo -q "name=/etc,volume=Default.001.RO,level=incr"
-r "savetime(22),nsavetime"
     date     time      save time
     01/27/09 09:57:52 1233010672
     01/27/09 16:39:04 1233034744

Having retrieved the nsavetime field, we can then feed that into nsrinfo in order to get the list of files for that backup:

# nsrinfo -t 1233034744 faero
scanning client `faero' for savetime 1233034744(Tue Jan 27 16:39:04 2009)
from the backup namespace
/etc/svc/volatile//
/etc/svc/
/etc/mnttab//
/etc/
/
5 objects found

(So the most common invocation format of nsrinfo is: “nsrinfo -t nsavetime clientName”)

Like most NetWorker commands, nsrinfo will also accept a “-v” option for verbosity. Include this in your nsrinfo command and you get a whole lot more information. For example, a short excerpt from the same nsavetime/saveset used above would resemble the following:

# nsrinfo -v -t 1233034744 faero
scanning client `faero' for savetime 1233034744(Tue Jan 27 16:39:04 2009)
from the backup namespace
UNIX ASDF v2 file `/etc/svc/volatile//', NSR size=160, fid = 0.0, file size=512
UNIX ASDF v2 file `/etc/svc/', NSR size=632, fid = 4294967295.1520, file size=1024
  ndirentry->1433       ..
  ndirentry->0  volatile//
  ndirentry->1945       repository.db
  ndirentry->978        repository-boot
  ndirentry->1002       repository-manifest_import
  ndirentry->4310       repository-manifest_import-20070225_055641
  ndirentry->714        repository-boot-20070907_074755
  ndirentry->1001       repository-manifest_import-20070907_074828
  ndirentry->44611      repository-manifest_import-20070225_093651
  ndirentry->988        repository-boot-20071004_111149
  ndirentry->1014       repository-boot-20080414_023012
  ndirentry->1066       repository-boot-20070920_041017
UNIX ASDF v2 file `/etc/mnttab//', NSR size=156, fid = 0.0, file size=512
UNIX ASDF v2 file `/etc/', NSR size=5040, fid = 4294967295.1433, file size=4608
  ndirentry->2  ..
  ndirentry->1434       TIMEZONE

As you can see, this is a lot more information. It’s not necessarily information you need all the time, but like so many other chunks of information retievable from NetWorker, it’s useful to know how to retrieve it, and that it’s available should you need it.

If you’re wondering how NetWorker knows which saveset to retrieve based on the nsavetime, it’s simple – for any individual client, no two savesets will ever be generated with the same nsavetime. Check it out for yourself if you’re not sure. For example, from a backup with parallelism of 12 for one client (i.e,. higher parallelism than savesets), the savesets were generated as follows:

# mminfo -q "client=faero" -r "name,level,savetime(22),nsavetime" -ot
 name                            lvl     date     time      save time
/opt/ActivePerl-5.8             full     01/27/09 09:49:01 1233010141
/opt/IDATA                      full     01/27/09 09:49:02 1233010142
/space/debug/2                  full     01/27/09 09:49:03 1233010143
/space/debug/1                  full     01/27/09 09:49:04 1233010144
/opt/SUNWrtvc                   full     01/27/09 09:49:05 1233010145
/opt/SUNWmlib                   full     01/27/09 09:49:06 1233010146
/etc                            full     01/27/09 09:50:15 1233010215
index:faero                     full     01/27/09 09:55:29 1233010529
bootstrap                       full     01/27/09 09:55:30 1233010530

So you can see – even with parallelism greater than one, there’s always at least one second difference between the start time for savesets.

 

Following a recent discussion I’ve been having on the NetWorker Mailing List, I thought I should put a few details down about clone IDs.

If you don’t clone your backups (and if you don’t: why not?), you may not have really encountered clone IDs very much. They’re the shadowy twin of the saveset ID, and serve a fairly important purpose.

From hereon in, I’ll use the following nomenclature:

  • SSID = Save Set ID
  • CLID = CLone ID

“SSID” is pretty much the standard NetWorker terminology for saveset ID, but usually clone ID is just written as “clone ID” or “clone-id”, etc., which gets a bit tiresome after a while.

Every saveset in NetWorker is tagged with a unique SSID. However, every copy of a saveset is tagged with the same SSID, but a different CLID.

You can see this when you ask mminfo to show both:

[root@nox ~]# mminfo -q "savetime>=18 hours ago,pool=Staging,client=archon,
name=/Volumes/TARDIS" -r volume,ssid,cloneid,nsavetime
 volume        ssid          clone id  save time
Staging-01     3962821973  1228135765 1228135764
Staging-01.RO  3962821973  1228135764 1228135764

(If you must know, being a fan of Doctor Who, all my Time Machine drives are called “TARDIS” – and no, I don’t backup my Time Machine copies with NetWorker, it would be a truly arduous and wasteful thing to do; I use my Time Machine drives for other database dumps from my Macs.)

In this case we’re not only seeing the SSID and CLID, but also a special instance of the SSID/CLID combination – that which is assigned for disk backup units. In the above example, you’ll note that the CLID associated with the read-only (.RO) version of the disk backup unit is exactly one less than the CLID associated with the read-write version of the disk backup unit. This is done by NetWorker for a very specific reason.

So, you might wonder then what the purpose of the CLID is, since we use the SSID to identify an individual saveset, right?

I had hunted for ages for a really good analogy on SSID/CLIDs, and stupidly the most obvious one never occurred to me. One of the NetWorker Mailing List’s most helpful posters, Davina Treiber, posted the (in retrospect) obvious and smartest analogy I’ve seen – comparing savesets to books in a library. To paraphrase, while a library may have multiple copies of the same book (with each copy having the same ISBN – after all, it’s the same book), they will obviously need to keep track of the individual copies of the book to know who has which copy, how many copies they have left, etc. Thus, the library would assign an individual copy number to each instance of the book they have, even if they only have one instance.

This, quite simply, is the purpose of the CLID – to identify individual instances of a single saveset. This means that you can, for example, do any of the following (and more!):

  • Clone a saveset by reading from a particular cited copy.
  • Recover from a saveset by reading from a particular cited copy.
  • Instruct NetWorker to remove from its media database reference to a particular cited copy.

In particular, in the final example, if you know that a particular tape is bad, and you want to delete that tape, you only want NetWorker to delete reference to the saveset instances on that tape – you wouldn’t want to also delete reference to perfectly good copies sitting on other tapes. Thus you would refer to SSID/CLID.

I’ve not been using the terminology SSID/CLID randomly. When working with NetWorker in a situation where you either want to, or must specify a specific instance of a saveset, you literally use that in the command. E.g.,:

# nsrclone -b “Daily Clone” -S 3962821973/1228135764

Would clone the saveset 3962821973 to the “Daily Clone” pool, using the saveset instance (CLID) 1228135764.

The same command could be specified as:

# nsrclone -b “Daily Clone” -S 3962821973

However, this would mean that NetWorker would pick which instance of the saveset to read from in order to clone the nominated saveset. The same thing happens when NetWorker is asked to perform a recovery in standard situations (i.e., non-SSID based recoveries).

So, how does NetWorker pick which instance of a saveset should be used to facilitate a recovery? The algorithm used goes a little like this:

  • If there are instances online, then the most available instance is used.
  • If there are multiple instances equally online, then the instance with the lowest CLID is requested.
  • If all instances are offline, then the instance with the lowest CLID not marked as offsite is requested.

The first point may not immediately make sense. Most available? If you say, have 2 copies on tape, and one tape is in a library, but the other is physically mounted in a tape drive, and is not in use, that tape in the drive will be used.

For the second point, consider disk backup units – adv_file type devices. In this case, both the RW and the RO “version” of the saveset (remembering, there’s only one real physical copy on disk, NetWorker just mungs some details to make it appear to the media database that there’s 2 copies) are equally online – they’re both mounted disk volumes. So, to prevent recoveries automatically running from the RW “version” of the saveset on disk, when the instances are setup, the “version” on the RO portion of the disk backup unit is assigned a CLID one less than the CLID of the “version” on the RW device.

Thus, we get “guaranteed” recovery/reading from the RO version of the disk backup unit. In normal circumstances, that is. (You can still force recovery/reading from the RW version if you so desire.)

In the final point, if all copies are equally offline, NetWorker previously just requested the copy with the lowest CLID. This works well in a tape only environment – i.e.:

  • Backup to tape
  • Clone backup to another tape
  • Send clone offsite
  • Keep ‘original’ onsite

In this scenario, NetWorker would ask for the ‘original’ by virtue of it having the lowest CLID. However, the CLID is only generated when the saveset is cloned. Thus, consider the backup to disk scenario:

  • Backup to disk
  • Clone from disk to tape
  • Send clone offsite
  • Later, when disk becomes full or savesets are too old, stage from disk to tape
  • Keep new “originals” on-site.

This created a problem – in this scenario, if you went to do a recovery after staging, then NetWorker would (annoyingly for many!) request the clone version of the saveset. This either meant requesting it to be pulled back from the offsite location, or doing a SSID/CLID recovery or marking the clone SSID/CLID as suspect or mounting the “original”. However you looked at it, it was a lot of work that you really shouldn’t have needed to do.

NetWorker 7.3.x however introduced the notion of an offsite flag; this isn’t the same as setting the volume location to offsite however. It’s literally a new flag:

# nsrmm -o offsite 800841

Would mark the volume 800841 in the media database as not being onsite – I.e., having a less desirable availability for recovery/read operations.

The net result is that in this situation, even if the offsite clone has a lower CLID, if it is flagged as offsite, but there’s a clone with a higher CLID not flagged as offsite, NetWorker will bypass that normal “use the lowest CLID” preference to instead request the onsite copy.

It would certainly be preferable however if a future version of NetWorker could have read priority established as a flag for pools; that way, rather than having to bugger around with the offsite flag (which, incidentally, can only be set/cleared from the command line, and can’t be queried!), an administrator could nominate “This pool has highest recovery priority, whereas this pool has lower recovery priority”. That way, NetWorker would pick the lowest CLID in the highest recovery priority pool.

(I wait, and hope.)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha