Jan 302013
 

NetWorker Bootstrap

One question you’ll periodically see raised in NetWorker forums is … “why am I getting a bootstrap backup with every backup?”

The answer is actually a fairly simple one: if the backup server is in an auto-starting group, then every time that auto-starting group is run, you’ll get a bootstrap backup.

However, if the backup server isn’t in any group that automatically starts, the server will act to protect the media database and configuration files by doing a bootstrap backup with every backup. That way, between the bootstraps and indices backed up with every backup, there’s a good chance that the backup server will be sufficiently recoverable.

In more recent versions of NetWorker, you’ll even see an alert to tell you when this has happened:

[root@tara ~]# savegrp -c test01 AFTD
3:savegrp: Added 'tara.pmdg.lab' to the group 'AFTD' for bootstrap backup

While some people erroneously think otherwise, electing to not backup the backup server itself is a completely unsound, unsafe and unrecommended data protection strategy. Always make sure your backup server gets backed up daily, with full backups at a suitably regular interval, and make sure those backups are cloned, and tested, as well.

Jan 142013
 

Index

A common question that you see fielded regularly is whether or not it’s “best practice” to have dedicated index pools.

To me, there’s only a couple of pros to having index backups in their own pools, and a lot of potential cons.

Pros

  1. In the event of needing to perform a total index restore, all index data will be on the same set of media.
  2. In more complex setups, it allows indices to be directed to specific devices, or storage nodes.

Cons

  1. In physical and virtual tape environments, it generates an artificial requirement for additional drives and media. Particularly in the realm of physical tape devices and media, this is odious (and costly); even in the realm of virtualised tape drives and media, it increases the likelihood of running up against storage node/device count limitations. That cost of course for physical media isn’t just limited additional tape drives and media, but all forms of handling – e.g., offsiting costs, operator costs, etc.
  2. Further, with physical tape, your available capacity decreases as you add pools, since any media which has been labelled into a particular pool can no longer be used for backups for another pool, until the tape has been recycled/relabelled. Thus, once say, an LTO-5 tape has been labelled for writing with the index pool, that’s 1.5TB of backup capacity that has been immediately sequestered from core data backups. (This typically isn’t an issue for VTLs where storage is thin-provisioned.)
  3. One final consideration for physical tape – having a dedicated index pool often means having larger slot counts to accommodate permanently having index media in the library. This isn’t required when indices are written to the active data pools, since you just factor for the standard media requirements.
  4. When working with full disk backup (advanced file type devices, Data Domain Boost, etc.), it increases the need for cloning and/or staging activities. (And again, the number of devices required increase.)
  5. More devices means more nsrmmd processes, which in turn means more memory requirements for the backup server and any affected storage node. Of course, this will be a relatively small impact, but it still must be considered as a ‘con’.
  6. While NetWorker has, over successive versions, become better at handling changes to pool configurations while backup activities are happening, it isn’t perfect. Artificially increasing the number of pools you’re running increases the risk you’ll want to make changes to pools while backups are being performed.

To me, all those “cons” add up to one conclusion – except under exceptional circumstances, it’s generally best practice to avoid having indices backed up to their own pool. Like all “rules”, there’ll be exceptions, but I’ll wager there are far fewer necessary exceptions to this than there are instances of it being done.

Quibbles – Why can’t I rename clients in the GUI?

 NetWorker, Quibbles  Comments Off on Quibbles – Why can’t I rename clients in the GUI?
Nov 162009
 

For what it’s worth, I believe that the continuing lack of support for renaming clients as a function within NMC, (as opposed to the current, highly manual process), represents an annoying and non-trivial gap in functionality that causes administrators headaches and undue work.

For me, this was highlighted most recently when a customer of mine needed to shift their primary domain, and all clients had been created using the fully qualified domain name. All 500 clients. Not 5, not 50, but 500.

The current mechanisms for renaming clients may be “OK” if you only rename one client a year, but more and more often I’m seeing sites renaming up to 5 clients a year as a regular course of action. If most of my customers are doing it, surely they’re not unique.

Renaming clients in NetWorker is a pain. And I don’t mean a “oops I just trod on a pin” style pain, but a “oh no, I just impaled my foot on a 6 inch rusty nail” style pain. It typically involves:

  • Taking care to note client ID
  • Recording the client configuration for all instances of the client
  • Deleting all instances of the client
  • Rename the index directory
  • Recreate all instances of the client, being sure on first instance creation to include the original client ID

(If the client is explicitly named in pool resources, they have to be updated as well, first clearing the client from those pools and then re-adding the newly “renamed” client.)

This is not fun stuff. Further, the chance for human error in the above list is substantial, and when we’re playing with indices, human error can result in situations where it becomes very problematic to either facilitate restores or ensure that backup dependencies have appropriate continuity.

Now, I know that facilitating a client rename from within a GUI isn’t easy, particularly since the NMC server may not be on the same host as the NetWorker server. There’s a bunch of (potential pool changes), client resource changes, filesystem changes and the need to put in appropriate rollback code so that if the system aborts half-way through it can revert at least to the old client name.

As I’ve argued in the past though, just because something isn’t easy doesn’t mean it shouldn’t be done.

Jul 082009
 

One of the most common mistakes made by people new to NetWorker when setting up directives is using the skip directive when what they actually need to use is the null directive.

Both directives can be used to prevent the backup of nominated content on a client, but using skip in situations where null should be used can result in very dicey recovery situations.

If you’re wanting to (no pun intended) “skip the working” here, this is the rule you should typically follow:

  • If wanting to exclude directories, use null.
  • If wanting to exclude files, use skip.

However, it’s not as cut and dried as the above suggests, so I recommend you keep reading.

The difference between null and skip is simple yet important at the same time, and it strongly affects how recoveries work, for it plays a factor in how NetWorker updates indices. One of the best ways I have of describing this is that:

  • The skip directive acts as an opaque shutter on the indices;
  • The null directive acts as a window on the indices.

This means that if you use skip against different directories in the same tree for different backups, each recovery you run aftwards will only show what wasn’t skipped. If you use null however, then each recovery will show both what has been null’d, and what wasn’t.

The best way to see how this works is by example, so I’ve prepared in my server’s home directory two subdirectories:

  • /home/preston/test_null
  • /home/preston/test_skip

In each directory, I’ve created and populated 2 subdirectories, “01” and “02”. So the full structure is:

  • /home/preston/test_null/01
  • /home/preston/test_null/02
  • /home/preston/test_skip/01
  • /home/preston/test_skip/02

In each case, “.nsr” client side directives were setup in the ‘parent’ directories – /home/preston/test_null and /home/preston/test_skip.

For the first backup, the directives were to “null” or “skip” the 02 directories and allow the 01 directories to be backed up. For the second backup, the directives were to “null” or “skip” the 01 directories, allowing the 02 directories to be backed up.

Here’s what it looked like for the test_null directory:

# cd /home/preston/test_null
# ls -l
total 8
drwxr-xr-x 2 root root 4096 Jul  6 18:23 01
drwxr-xr-x 2 root root 4096 Jul  6 18:23 02

# cat .nsr
<< . >>
null: 02

# save -b Default -e "+2 weeks" .
<output removed>
# recover -s nox
Current working directory is /home/preston/test_null/
recover> ls
 01     .nsr

# NOTE: Reverse contents of .nsr
# cat .nsr
<< . >>
null: 01

# save -b Default -e "+2 weeks" .
<output removed>
# recover -s nox
Current working directory is /home/preston/test_null/
recover> ls
 01     02     .nsr

As you can see above, even after the second backup, where the ’01’ directory was most recently excluded, both directories are shown in the recovery browser.

Here’s what it looked like for the test_skip directory:

# cd /home/preston/test_skip
# ls -l
total 8
drwxr-xr-x 2 root root 4096 Jul  6 18:28 01
drwxr-xr-x 2 root root 4096 Jul  6 18:28 02

# cat .nsr
<< . >>
skip: 02

# save -b Default -e "+2 weeks" .
<output removed>
# recover -s nox
Current working directory is /home/preston/test_skip/
recover> ls
 01     .nsr

# NOTE: Reverse contents of .nsr
# cat .nsr
<< . >>
skip: 01

# save -b Default -e "+2 weeks" .
<output removed>
# recover -s nox
Current working directory is /home/preston/test_skip/
recover> ls
 02     .nsr

As you can see, that’s a substantial difference in what is shown for recovery purposes. Indeed, to be able to recover the “01” part of the test_skip directory, we need to explicitly target a time between when the first backup was run, and when the second backup was run:

# mminfo -q "name=/home/preston/test_skip" -r "name,savetime(23)"
 name                               date     time
/home/preston/test_skip         07/06/2009 06:29:53 PM
/home/preston/test_skip         07/06/2009 06:31:17 PM

# recover -s nox
Current working directory is /home/preston/test_skip/
recover> ls
 02     .nsr
recover> changetime 18:30
6497:recover: time changed to Mon 06 Jul 2009 06:30:00 PM EST
recover> ls
 01     .nsr

As you can see, there’s substantial difference between skip and null. For this reason, please ensure you pick the right mechanism for excluding content from backup!

Does NetWorker scan the indices for changed files when it does a backup?

 Basics, NetWorker, Windows  Comments Off on Does NetWorker scan the indices for changed files when it does a backup?
May 212009
 

This is a fairly common question to see asked – does NetWorker, when a non-full backup is run, scan the existing client indices to determine what files have changed from previous backups?

The short answer is: no.

The more in-depth answer is that NetWorker will use one of a few different mechanisms for determining what files should be backed up in a non-full backup scenario, and none of those mechanisms involve scanning the client indices. These mechanisms are:

  • Check for files that have changed since a certain date. Whenever a non-full backup is run, the NetWorker server includes in the backup command the last savetime. Thus, all changed files can be quickly calculated from this.
  • Check for changes according to the change journal (Windows only).
  • Check for changes based on the archive bit (Windows only).

Personally, I really dislike the use of the archive bit. Too many programmers on Windows take liberty with this odious little setting, and it’s become so bastardised and unreliable that my very firm recommendation is you follow the instructions in the NetWorker administration guide to turn off use of the archive bit in incremental backups. (Hint: search for NSR_AVOID_ARCHIVE*).

So, there’s 3 ways that NetWorker can be expected to use to determine what files should be backed up in a non-full backup – and none of those mechanisms are achieved through an index scan.


* [Updated 2009-06-18]

Expanding on this more fully – on the backup server itself, establish an environment variable called NSR_AVOID_ARCHIVE and set it to any value other than “No”. I prefer to set it to “YES” or 1 so it’s entirely clear what the desired result is.

On Unix, places to set this is in the /etc/profile or the NetWorker startup script; however, the problem with setting it in the NetWorker startup script is that you have to remember to re-create that setting every time you upgrade NetWorker, since the startup script is fully replaced each time.

In Windows, set it as a system environment variable under the properties for the system itself. These variables are established before programs are started, meaning that NetWorker will be aware of them when it starts.

Feb 262009
 

Is your backup server a modern, state of the art machine with high speed disk, significant IO throughput capabilities and ample RAM so as to not be a bottleneck in your environment?

If not, why?

Given the nature of what it does – support systems via backup and recovery – your backup server is, by extension, “part of” your most critical production server(s). I’m not saying that your backup server should be more powerful than any of your production servers, but what I do want to say is that your backup server shouldn’t be a restricting agent in relation to the performance requirements of those production servers.

Let me give you an example – the NetWorker index region. Using Unix for convenience, we’re talking about /nsr/index. This region should either be on equally high speed drives as your fastest production system drives, or on something that is still suitably fast.

For instance, in much smaller companies, I’ve often seen the production servers have SCSI drives or SCSI JBODs, but the backup server just be a machine with a couple of mirrored SATA drives.

In larger companies, you’ll have the backup server connected to the SAN with the rest of the production systems, but while the production systems will get access to 15,000 RPM SCSI drives, the backup server will get instead 7,200 RPM SATA drives (or worse, previously, 5,400 RPM ATA drives).

This is a flawed design process for one very important reason – for every file you backup, you need to generate and maintain index data. That is, NetWorker server disk IO occurs in conjunction with backups*.

More importantly, when it comes time to do a recovery, and indices must be accessed, do you want to pull index records for say, 20,000,000 files from slow disk drives or fast disk drives?

(Now, as we move towards flash drives for critical performance systems, I’m not going to suggest that if you’re using flash storage for key systems you should also use it for backup systems. There is always a price point at which you have to start scaling back what you want vs what you need. However, in those instances I’d suggest that if you can afford flash drives for critical production systems, you can afford 15,000 RPM SCSI drives for the backup servers’ /nsr/index region.)

Where cost for higher speed drives becomes an issue, another option is to scale back the speed of the individual drives but use more spindles, even if the actual space used on each drive is less than the capacity of the drive**.

In that case for instance, you might have 15,000 RPM drives for your primary production servers, but the backup servers’ /nsr/index region might reside on 7,200 RPM SATA drives successfully, so long as they’re arrayed (no pun intended) in such a way that there’s sufficient spindles to make reading back data fast. Equally then, in such a situation, hardware RAID (or software RAID on systems that have sufficient CPUs and cores that it equals or exceeds hardware RAID performance) will allow for faster processing of data for writing (e.g., RAID-5 or RAID-3).

In the end, your backup server should be like a butler (or a personal assistant, if you prefer the term) – always there, always ready and able to assist with whatever it is you want done, but never, ever an impediment.


* I see this as a similar design flaw to say, using 7,200 RPM drives as a copy-on-write snapshot area for 15,000 RPM drives.
** Ah, back in the ‘old’ days, where a database might be spread across 40 x 2GB drives, using only 100 MB from each drive!

Jan 272009
 

A fairly common question I get asked is “How can I find out what files were backed up?”

This is actually fairly easy, particularly if you’re prepared to use the command line. You need to run two commands – mminfo, and nsrinfo.

The command mminfo accesses the NetWorker media database, and is used to pull out details of the saveset whose files you want to view. The nsrinfo command is then used to retrieve the relevant information from the client file index.

For example, consider the following situation – there’s two incremental backups of the “/etc” directory on the machine “faero”, and we want to know what was backed up in each backup. First, run mminfo to retrieve the nsavetime, which we use in nsrinfo. The mminfo command might resemble the following:

# mminfo -q "name=/etc,volume=Default.001.RO,level=incr"
-r "savetime(22),nsavetime"
     date     time      save time
     01/27/09 09:57:52 1233010672
     01/27/09 16:39:04 1233034744

Having retrieved the nsavetime field, we can then feed that into nsrinfo in order to get the list of files for that backup:

# nsrinfo -t 1233034744 faero
scanning client `faero' for savetime 1233034744(Tue Jan 27 16:39:04 2009)
from the backup namespace
/etc/svc/volatile//
/etc/svc/
/etc/mnttab//
/etc/
/
5 objects found

(So the most common invocation format of nsrinfo is: “nsrinfo -t nsavetime clientName”)

Like most NetWorker commands, nsrinfo will also accept a “-v” option for verbosity. Include this in your nsrinfo command and you get a whole lot more information. For example, a short excerpt from the same nsavetime/saveset used above would resemble the following:

# nsrinfo -v -t 1233034744 faero
scanning client `faero' for savetime 1233034744(Tue Jan 27 16:39:04 2009)
from the backup namespace
UNIX ASDF v2 file `/etc/svc/volatile//', NSR size=160, fid = 0.0, file size=512
UNIX ASDF v2 file `/etc/svc/', NSR size=632, fid = 4294967295.1520, file size=1024
  ndirentry->1433       ..
  ndirentry->0  volatile//
  ndirentry->1945       repository.db
  ndirentry->978        repository-boot
  ndirentry->1002       repository-manifest_import
  ndirentry->4310       repository-manifest_import-20070225_055641
  ndirentry->714        repository-boot-20070907_074755
  ndirentry->1001       repository-manifest_import-20070907_074828
  ndirentry->44611      repository-manifest_import-20070225_093651
  ndirentry->988        repository-boot-20071004_111149
  ndirentry->1014       repository-boot-20080414_023012
  ndirentry->1066       repository-boot-20070920_041017
UNIX ASDF v2 file `/etc/mnttab//', NSR size=156, fid = 0.0, file size=512
UNIX ASDF v2 file `/etc/', NSR size=5040, fid = 4294967295.1433, file size=4608
  ndirentry->2  ..
  ndirentry->1434       TIMEZONE

As you can see, this is a lot more information. It’s not necessarily information you need all the time, but like so many other chunks of information retievable from NetWorker, it’s useful to know how to retrieve it, and that it’s available should you need it.

If you’re wondering how NetWorker knows which saveset to retrieve based on the nsavetime, it’s simple – for any individual client, no two savesets will ever be generated with the same nsavetime. Check it out for yourself if you’re not sure. For example, from a backup with parallelism of 12 for one client (i.e,. higher parallelism than savesets), the savesets were generated as follows:

# mminfo -q "client=faero" -r "name,level,savetime(22),nsavetime" -ot
 name                            lvl     date     time      save time
/opt/ActivePerl-5.8             full     01/27/09 09:49:01 1233010141
/opt/IDATA                      full     01/27/09 09:49:02 1233010142
/space/debug/2                  full     01/27/09 09:49:03 1233010143
/space/debug/1                  full     01/27/09 09:49:04 1233010144
/opt/SUNWrtvc                   full     01/27/09 09:49:05 1233010145
/opt/SUNWmlib                   full     01/27/09 09:49:06 1233010146
/etc                            full     01/27/09 09:50:15 1233010215
index:faero                     full     01/27/09 09:55:29 1233010529
bootstrap                       full     01/27/09 09:55:30 1233010530

So you can see – even with parallelism greater than one, there’s always at least one second difference between the start time for savesets.