Backup to disk has well and truly become entrenched as a core backup strategy in most companies. By “backup to disk” I’m referring to either of ADV_FILE devices or VTLs – i.e., the general notion of backing up first to disk. For the rest of the article, since I’m feeling a little lazy today, I’ll follow industry norm and call backup to disk by the generic “B2D”.

Now, in most companies, there’ll still be physical tape involved. Long-term backups held on sufficiently replicated storage – even with deduplication – is going to remain costly for some time to come; but once B2D appears within an organisation, one of two architecture decisions will typically occur:

  1. B2D region designed to hold a “significant” nearline capacity, where “significant” refers to a business-appropriate amount of recent backups.
  2. B2D region designed as a “staging” region to have just enough capacity, where “just enough” means that if data isn’t staged daily (or near-daily), staging areas will become full and backups will stop.

Having observed B2D regions designed as staging-only on several occasions now, I’m even more firmly convinced that B2D as staging is a false economy that fails to take into consideration a few key metrics. Sure, buying say, 5TB or 10TB of disk is cheaper than buying 40TB with deduplication, but the cost of storage doesn’t end with the purchase. In fact, since the actual dollar cost of storage is typically amortised out over its expected deployment time, that cost often ends up being pretty minimal.

There are three distinct costs that I see as evident when using B2D purely as a staging region. These are:

  • Staff time.
  • Physical wear and tear.
  • Increased risk of recovery failure.

Before I go further, I want to cover a term I used in the title of this post; “busy state staging” – it refers to environments where a significant portion of each day is spent with the B2D region being used to stage out from disk to physical tape, so as to free up room. There’s probably four key activities a backup system can be doing at any one time. These are:

  • Backup
  • Recovery
  • Duplication/Cloning
  • Maintenance

Backup, recovery and cloning are all givens; maintenance functions encompass media import/export/labelling, configuration activities, and most definitely includes staging. That’s right – staging is not any of backup, recovery or cloning; it falls into the category of moving data around in order to keep the system running. It’s effectively an overhead function for the environment, and as we know, the aim in any environment is to keep overheads to a minimum.

Over the expected deployment period of the B2D region in a backup system, I’d argue that those three costs previously cited add up to enough to demonstrate that the vast majority of businesses should not deploy B2D in a staging-only configuration. Let’s consider each of them individually.

Staff Time

This is the easiest to factor in. Let’s say your backup administrator has to spend roughly an hour a day between monitoring and maintaining free capacity on a staging-only B2D region. Now add up those hours per day, per week, per year across the lifetime of a deployment, and see how much it represents based on the hourly rate of the backup administrator. Assume $40 per hour, 4 weeks annual leave a year. So that leaves 48 weeks, 5 hours per week at $40 an hour. That’s $9,600 per year of staff costs through managing a poorly provisioned B2D region.

Usually that’s not the final cost though in staff time – my personal experience is that there’s a higher tendency in environments that use B2D for staging to need to engage temporary contractors, etc., to help fill in on projects where systems administration staff don’t have available time to do other projects in the company. So let’s assume that as a result of the backup administrator having to focus on B2D staging an hour a day the organisation has to engage a contractor one week a year to make up the short-fall. Assuming a contracting rate of $80 per hour, that’s $3,200 per year.

Now, assuming B2D storage has been provisioned over a 3 year period, we’re adding $38,400 to the maintenance impact of a staging-only region.

My gut feel, by the way, is that in an appropriately provisioned B2D architecture, the backup administrator will spend at most one fifth of the time in B2D storage administration; and there won’t be a need to engage contractors for that reason. So that $38,400 cost would shrink to say, $5,760 of time. In anyone’s books, that’s a good percentage saving.

Physical Wear and Tear

We’d count ourselves lucky if the only impact of using B2D in a staging configuration were staff costs. There’s more though. The wear and tear on both physical media and physical tape drives will be significantly increased, as these units will be running more frequently. Not only that, rather than having a reduced priority, the service time on physical tape is almost as critical in a tape-only environment. The net consequence is that rather than being able to say, work with a next-day service contract for the physical tape libraries, organisations are forced to stick with a 4-hour same-day response contract. As we know, there’s usually a pretty significant price difference between these types of contracts!

Increased Risk of Recovery Failure

We’d equally count ourselves lucky if the only impacts of using B2D in a staging-only configuration were just staff time and increased maintenance costs. The real insidious cost though is the risk of a recovery failure. In this, I’m not referring to any limitations that may exist around simultaneously recovering to while staging/cloning from B2D media. What I’m referring to is the risk that a backup may not actually run in the first place because a staging region becomes full, blocking new sessions starting. When considered from a backup perspective, that may not sound a lot. Turning it around to the purpose of a backup: imagine the consequence though of that data that was never backed up being needed for a recovery. While it may be logical to say “if it can’t be backed up, then we can’t factor it into recovery requirements”, but disasters, emergencies and auditors do not come when it’s convenient for us.

With this in mind, any backup that fails to run because a staging area is full should be considered from the full impact of a recovery SLA being breached for that data. That may sound harsh, but I’d actually suggest it’s a more business-focused rather than IT-focused approach to backup.

How’s that busy-state staging sounding now?

Enterprise data protection is one of those areas where businesses are most tempted to do cost cutting. We see it with Icarus support contracts, with inappropriate coupling of services, and we see it with B2D staging areas. We can intuit with almost no effort that busy state staging isn’t the best backup model. If your system is busy 20 hours a day between backup, cloning and maintenance functions, then it’s obvious that there’s at least an increased risk of parts failure; but the cost of the architecture is also magnified by wasted staff time, increased maintenance contract costs, and the potential failure to facilitate business-required recoveries.

When we take all those things into consideration, architecting B2D for significant or at least appropriate nearline recovery purposes rather than just staging becomes the cheaper option.

 

The scenario:

  • A clone or stage operation has aborted (or otherwise failed)
  • It has been restarted
  • It hangs waiting for a new volume even though there’s a partially written volume available.

This is a relatively easy problem to explain. Let’s first look at the log messages that happens. To generate this error, I started cloning some data to the “Default Clone” pool, with only one volume in the pool, then aborted. Shortly thereafter I tried to run the clone again, and when NetWorker wouldn’t write to the volume I unmounted and remounted it – a common thing that newer administrators will try in this scenario. This is where you’ll hit the following error in the logs:

media notice: Volume `800829L4' ineligible for this operation; Need a different volume
from pool `Default Clone'
media info: Suggest manually labeling a new writable volume for pool 'Default Clone'

So, what’s the cause of this problem? It’s actually relatively easy to explain.

A core component in NetWorker’s media database design is that a saveset can only ever have one instance on a piece of media. This applies as equally to failed as complete saveset instances.

The net result is that this error/situation will occur because it’s meant to – NetWorker doesn’t permit more than one instance of a saveset to appear on the same piece of physical media.

So what do you do when this error comes up?

  • If you’re backing up to disk, an aborted saveset should normally be cleared up automatically by NetWorker after the operation is aborted. However, in certain instances this may not be the case. For NetWorker 7.5 vanilla and 7.5.1.1/7.5.1.2, this should be done by expiring the saveset instance – using nsrmm to flag the instance as having an expiry date within a few minutes or seconds. For all other versions of NetWorker, you should just be able to delete the saveset instance.
  • When working with tape (virtual or physical), the most recommended approach would be to move on to another tape, or if the instance is the only instance on that tape, relabel the tape. (Some would argue that you can use nsrmm to delete the saveset instance from the tape and then re-attempt the operation, but since NetWorker is so heavily designed to prevent multiple instances of a saveset on a piece of media, I’d strongly recommend against this.)

Overall it’s a fairly simple issue, but knowing how to recognise it lets you resolve it quickly and painlessly.

 

Looking at the stats both for this new site and the previous site, I’ve compiled a list of the top 10 read articles on The NetWorker Blog for 2009. The top 3 of course match the three articles that routinely turn out to be the most popular on any given month, which speaks something of their relevance to the average NetWorker administrator.

(Note: I’ve excluded non-article pages from the top 10.)

Number 10 – Instantiating Savesets

The very first article on the blog, Instantiating Savesets detailed the importance of distinguishing between all instances of a saveset and a specific instance of a saveset.

This distinction between using just the saveset ID, and using a saveset ID/clone ID combination becomes particularly important when staging from disk backup units. If clones exist and you stage using just the saveset ID, when NetWorker cleans up at the end of the staging operation it will remove reference to the clones as well as deleting the original from the disk backup unit. (Something you really don’t want to have happen.)

Recommendation to EMC: Perhaps it would be worthwhile requiring a “-y” argument to nsrstage if staging savesets from disk backup units and specifying only the saveset ID.

Recommendation to NetWorker administrators: Always be careful when staging that you specify both the saveset and the clone ID.

Number 9 – Basics – Important mminfo fields

In May I wrote about a few key mminfo fields – notably:

  • savetime
  • sscreate
  • ssinsert
  • sscomp
  • ssaccess

Sadly, I didn’t get the result I wanted with EMC on ssaccess. Documented as being updated whenever a saveset fragment is accessed for backup and recovery, the most I could get was an acknowledgement that it was currently broken and to lodge an RFE to get it fixed. (The alternative was to have the documentation changed to take out reference to read operations – something I didn’t want to have happen!)

Recommendation to EMC: ssaccess would be a particularly useful mminfo field, particularly when analysing recovery statistics for NetWorker. Please fix it.

Number 8 – Basics – Listing files in a backup

Want to know what files were backed up as part of the creation of a saveset? If you do, you’re not unique – this has remained a very popular article since it was written in January.

Recommendation to EMC: This information can be retrieved via a combination of mminfo/nsrinfo, but it would be handy if NMC supported drilling down into a saveset to provide a file listing.

Number 7 – Using yum to install NetWorker on Linux

NetWorker’s need for dependency resolution on Linux for installation of the client packages in particular drew a lot of people to this article.

Number 6 – Basics – mminfo, savetime, and greater than/less than

This article explained why NetWorker uses the greater than and less than signs in mminfo in a way that newcomers to the product might find backwards. If you’re not aware of why mminfo works the way it does for specifying savetimes, you should be.

Number 5 – 7.5(.1) changed behaviour – deleting savesets from adv_file devices

This was a particularly unpleasant bug introduced into NetWorker 7.5, thankfully resolved now in the cumulative service releases and NetWorker 7.6

The gist of it is that in NetWorker 7.5/7.5.1 (aka 7.5 SP1), if you deleted a saveset on a disk backup unit, NetWorker would suffer a serious failure where it would from that point have issues cleaning regular expired savesets from the disk backup unit and insist that the disk backup unit had major issues. The primary error would manifest as:

nsrd adv_file warning: Failed to fetch the saveset(ss_t) structure for ssid 1890993582

This was fixed in 7.5.1.2, thankfully.

Recommendation to EMC: Never let this bug see the light of day again, please. (So far you’re doing an excellent job, by the way.)

Number 4 – NetWorker 7.5.1 Released

I’ve recently noticed a disturbing trend among many vendors, EMC included, where once a new release is made of a product, sales and account staff become overly enthusiastic about recommending new releases. This comes on top of not really having any technical expertise. (Please be patient, I’m trying to put this as diplomatically as possible.)

One of the worst instances I’ve seen of this in the last few years was the near-hysterical pumping of 7.5 thanks to some useful features to do with virtualisation in particular. I’ll admit that my articles on the integration between Oracle Module 5 and NetWorker 7.5, as well as Probe Based Backups may have added to this. However, there was somewhat of a stampede to 7.5 when it came out, and consequently, when it had some issues, there was strong enthusiasm for the release of 7.5.1.

This is why, by the way, that IDATA maintains for its support customers a recommended versions list that is not automatically updated when new versions of products come out.

Recommendation to EMC: Remind your sales staff that existing users already have the product, and not to just go blindly convincing them to upgrade. Otherwise you’ll eventually start sounding like this.

Number 3 – Carry a jukebox with you (if you’re using Linux)

During 2009, Mark Harvey’s LinuxVTL project first got the open source LinuxVTL working with NetWorker in a single drive configuration, then eventually, in multi-drive configurations. (Mark assures me, by the way, that patches are coming real soon to allow multiple robots on the same storage node/server.)

Lesson for me: With the LinuxVTL configured on multiple lab servers in my environment, I’ve really taken to VTLs this year, and considerably changed my attitude on using them. (I’ll say again: I still resent that they’re needed, but I now respect them a lot more than I previously did.)

Lesson for others: Even Mark himself says that the open source VTL shouldn’t be used for production backups. Don’t be cheap with your backup system, this is an excellent tool for lab setups, training, diagnostics, etc., but it is not a replacement to a production-ready VTL system. If you want a VTL, buy a VTL.

Number 2 – Basics – Parallelism in NetWorker

Some would say that the high popularity of an article about parallelism in NetWorker indicates that it’s not sufficiently documented.

I’m not entirely convinced that’s the case. But it does go to show that it’s an important topic when it comes to performance tuning, and summary articles about how the various types of parallelism interact are obviously popular.

Lesson for everyone: Now that the performance tuning guide has been updated and made more relevant in NetWorker 7.6, I’d recommend people wanting an official overview of some of the parallelism options checking that out in addition to the article above.

Number 1 – Basics – Fixing “NSR peer information” errors

Goodness this was a popular article in 2009 – detailing how to fix the “NSR peer information” errors that can come up from time to time in the NetWorker logs. If you’re not familiar with this error yet, it’s likely you will eventually as a NetWorker administrator see an error such as:

39078 02/02/2009 09:45:13 PM  0 0 2 1152952640 5095 0 nox nsrexecd SYSTEM error: There is already a machine using the name: “faero”. Either choose a different name for your machine, or delete the “NSR peer information” entry for “faero” on host: “nox”

Recommendation for EMC: Users shouldn’t really need to be Googling for a solution to this problem. Let’s see an update to NetWorker Management Console where these errors/warnings are reported in the monitoring log, with the administrator being able to right click on them and choose to clear the peer information after confirming that they’re confident no nefarious activity is happening.

Wrapping Up

I have to say, it was a fantastically satisfying year writing the blog, and I’m looking forward to seeing what 2010 brings in terms of most useful articles.

 

Something that continues to periodically come up is the need to remind people running manual staging to ensure they specify both the SSID and the Clone ID when they stage. I did some initial coverage of this when I first started the blog, but I wanted to revisit and demonstrate exactly why this is necessary.

The short version of why is simple: If you stage by SSID alone, NetWorker will delete/purge all instances of the saveset other than the one you just created. This is Not A Good Thing for 99.999% of what we do within NetWorker.

So to demonstrate, here’s a session where I:

  1. Generate a backup
  2. Clone the backup to tape
  3. Stage the saveset only to tape

In between each step, I’ll run mminfo to get a dump of what the media database says about saveset availability.

Part 1 – Generate the Backup

Here’s a very simple backup for the purposes of this demonstration, and the subsequent mminfo command to find out about the backup:

[root@tara ~]# save -b Default -LL -q /etc
save: /etc  106 MB 00:00:07   2122 files
completed savetime=1258093549

[root@tara ~]# mminfo -q "client=tara.pmdg.lab,name=/etc" -r volume,ssid,cloneid,
savetime
 volume        ssid          clone id  date
Default.001    2600270829  1258093549 11/13/2009
Default.001.RO 2600270829  1258093548 11/13/2009

There’s nothing out of the ordinary here, so we’ll move onto the next step.

Part 2 – Clone the Backup

We’ll just do a manual clone to the Default Clone pool. Here we’ll specify the saveset ID alone, which is fine for cloning – but is often what leads people to being in the habit of not specifying a particular saveset instance. I’m using very small VTL tapes, so don’t be worried that in this case I’ve got a clone of /etc spanning 3 volumes:

[root@tara ~]# nsrclone -b "Default Clone" -S 2600270829
[root@tara ~]# mminfo -q "client=tara.pmdg.lab,name=/etc" -r volume,ssid,cloneid,
savetime
 volume        ssid          clone id  date
800843S3       2600270829  1258094164 11/13/2009
800844S3       2600270829  1258094164 11/13/2009
800845S3       2600270829  1258094164 11/13/2009
Default.001    2600270829  1258093549 11/13/2009
Default.001.RO 2600270829  1258093548 11/13/2009

As you can see there, it’s all looking fairly ordinary at this point – nothing surprising is going on at all.

Part 3 – Stage by Saveset ID Only

In this next step, I’m going to stage by saveset ID alone rather than specifying the saveset ID/clone ID, which is the correct way of staging, so as to demonstrate what happens at the conclusion of the staging. I’ll be staging to a pool called “Big”:

[root@tara ~]# nsrstage -b Big -v -m -S 2600270829
Obtaining media database information on server tara.pmdg.lab
Parsing save set id(s)
Migrating the following save sets (ids):
 2600270829
5874:nsrstage: Automatically copying save sets(s) to other volume(s)

Starting migration operation...
Nov 13 17:34:00 tara logger: NetWorker media: (waiting) Waiting for 1 writable
volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
5884:nsrstage: Successfully cloned all requested save sets
5886:nsrstage: Clones were written to the following volume(s):
 BIG991S3
6359:nsrstage: Deleting the successfully cloned save set 2600270829
Successfully deleted original clone 1258093548 of save set 2600270829
from media database.
Successfully deleted AFTD's companion clone 1258093549 of save set 2600270829
from media database with 0 retries.
Successfully deleted original clone 1258094164 of save set 2600270829
from media database.
Recovering space from volume 4294740163 failed with the error
'Cannot access volume 800844S3, please mount the volume or verify its label.'.
Refer to the NetWorker log for details.
6330:nsrstage: Cannot access volume 800844S3, please mount the volume
or verify its label.
Completed recover space operation for volume 4177299774
Refer to the NetWorker log for any failures.
Recovering space from volume 4277962971 failed with the error
'Cannot access volume 800845S3, please mount the volume or verify its label.'.
Refer to the NetWorker log for details.
6330:nsrstage: Cannot access volume 800845S3, please mount the volume
or verify its label.
Recovering space from volume 16550059 failed with the error
'Cannot access volume 800843S3, please mount the volume or verify its label.'.
Refer to the NetWorker log for details.
6330:nsrstage: Cannot access volume 800843S3, please mount the volume
or verify its label.

You’ll note there’s a bunch of output there about being unable to access the clone volumes the saveset was previously cloned to. When we then check mminfo, we see the consequences of the staging operation though:

[root@tara ~]# mminfo -q "client=tara.pmdg.lab,name=/etc" -r volume,ssid,cloneid,
savetime
 volume        ssid          clone id  date
BIG991S3       2600270829  1258095244 11/13/2009

As you can see – no reference to the clone volumes at all!

Now, has the clone data been erased? No, but it has been removed from the media database, meaning you’d have to manually scan the volumes back in order to be able to use them again. Worse, if those volumes only contained clone data that was subsequently removed from the media database, they may become eligible for recycling and get re-used before you notice what has gone wrong!

Wrapping Up

Hopefully the above session will have demonstrated the danger of staging by saveset ID alone. If instead of staging by saveset ID we staged by saveset ID and clone ID, we’d have had a much more desirable outcome. Here’s a (short) example of that:

[root@tara ~]# save -b Default -LL -q /tmp
save: /tmp  2352 KB 00:00:01     67 files
completed savetime=1258094378
[root@tara ~]# mminfo -q "name=/tmp" -r volume,ssid,cloneid
 volume        ssid          clone id
Default.001    2583494442  1258094378
Default.001.RO 2583494442  1258094377
[root@tara ~]# nsrclone -b "Default Clone" -S 2583494442

[root@tara ~]# mminfo -q "name=/tmp" -r volume,ssid,cloneid
 volume        ssid          clone id
800845S3       2583494442  1258095244
Default.001    2583494442  1258094378
Default.001.RO 2583494442  1258094377
[root@tara ~]# nsrstage -b Big -v -m -S 2583494442/1258094377
Obtaining media database information on server tara.pmdg.lab
Parsing save set id(s)
Migrating the following save sets (ids):
 2583494442
5874:nsrstage: Automatically copying save sets(s) to other volume(s)

Starting migration operation...

5886:nsrstage: Clones were written to the following volume(s):
 BIG991S3
6359:nsrstage: Deleting the successfully cloned save set 2583494442
Successfully deleted original clone 1258094377 of save set 2583494442 from
media database.
Successfully deleted AFTD's companion clone 1258094378 of save set 2583494442
from media database with 0 retries.
Completed recover space operation for volume 4177299774
Refer to the NetWorker log for any failures.

[root@tara ~]# mminfo -q "name=/tmp" -r volume,ssid,cloneid
 volume        ssid          clone id
800845S3       2583494442  1258095244
BIG991S3       2583494442  1258096324

The recommendation that I always make is that you forget about using saveset IDs alone unless you absolutely have to. Instead, get yourself into the habit of always specifying a particular instance of a saveset ID via the “ssid/cloneid” option. That way, if you do any manual staging, you won’t wipe out access to data!

 

For a while now I’ve been working with EMC support on an issue that’s only likely to strike sites that have intermittent connectivity between the server and storage nodes and that stage from ADV_FILE on the storage node to ADV_FILE on the server.

The crux of the problem is that if you’re staging from storage node to server and comms between the sites are lost for long enough that NetWorker:

  • Detects the storage node nsrmmd processes have failed, and
  • Attempts to restart the storage node nsrmmd processes, and
  • Fails to restart the storage node nsrmmd processes

Then you can end up in a situation where the staging aborts in an ‘interesting’ way. The first hint of the problem is that you’ll see a message such as the following in your daemon.raw:

68975 10/15/2009 09:59:05 AM  2 0 0 526402000 4495 0 tara.pmdg.lab nsrmmd filesys_nuke_ssid: unable to unlink /backup/84/05/notes/c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342 on device `/backup’: No such file or directory

(The above was rendered for your convenience.)

However, if you look for the cited file, you’ll find that it doesn’t exist. That’s not quite the end of the matter though. Unfortunately, while the saveset file that was being staged didn’t stay on disk, its media database details did. So in order to restart staging, it becomes necessary to first locate the saveset in question and delete the media database entry for the (failed) server disk backup unit copy. Interestingly, this is only ever to be found on the RW device, not the RO device:

[root@tara ~]# mminfo -q "ssid=c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342"
 volume        client       date      size   level  name
Tara.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001.RO    fawn      10/15/2009 1287 MB manual  /usr/share

We had hoped that it was fixed in 7.5.1.5, but my tests aren’t showing that to be the case. Regardless, it’s certainly around in 7.4.x as well and (given the nature of it) has quite possibly been around for a while longer than that.

As I said at the outset, this isn’t likely to affect many sites, but it is something to be aware of.

 

In NetWorker, staging refers to moving savesets from one piece of media to another. The move operation is two-fold, consisting of:

  • A clone operation (source -> target)
  • A delete operation (source)

Both of these operations are done automatically as part of the staging process, with the delete being defined as the maximum supported operation for the media that the source was on; for tape based media, this means deleting the media database entries for the saveset(s) staged, and for disk this means both the media database delete and the filesystem delete of the saveset from the disk backup unit.

There’s a few reasons why you use staging policies in NetWorker:

  1. To free up space on disk backup units.
  2. To move backups/clones from a previous media type to a new media type.
  3. To move backups/clones from older, expiring media to new media for long-term retention.

The second two options usually refer to tape -> tape staging, which these days is the far less common use of staging in NetWorker. The most common use it now for managing used space on disk backup units, and that’s what we’ll consider here.

There’s two ways you can stage within NetWorker – either as a scheduled task, or as a manual task.

Scheduled Staging

Scheduled staging occurs by creating one or more staging policies. Typically in a standard configuration for disk backup units, you’ll have one staging policy per disk backup unit. For example:

Creating a staging policy

Creating a staging policy

The staging policy consists of settings that define:

  • Name/Comment – Identification details for the staging policy, as you’d expect.
  • Enabled – Default is no, set to Yes if you want the staging policy to run. (Note that you can’t start a disabled staging policy manually – or you couldn’t, last time I checked.)
  • Which disk backup units (devices) will be read from. Choose both the read-only and the read-write version of the disk backup units. (Unless there are significant issues, NetWorker will always read from the read-only version of the disk backup unit anyway.)
  • Destination pool – where savesets will be written to.
  • High water mark – this is expressed as a percentage of the total filesystem capacity that the disk backup resides on (which is why each disk backup unit should be on its own filesystem!). It basically means “if the occupied space in savesets reaches <nominated> percent, then start staging data off”.
  • Low water mark – again, a percentage of the total filesystem capacity that the disk backup resides on. If staging is initiated due to a high watermark value, then staging will continue until the disk backup unit can be freed up such that the used space at the end of the staging is equal to or less than the low water mark.
  • Save set selection – will be one of oldest/largest/youngest/smallest. For most disk backup units, the choice is normally between oldest saveset or largest saveset.
  • Max storage period / period unit – defines the maximum amount of time, in either days or hours, that savesets can remain on disk before they must be staged out. This will occur irrespective of any watermarks (and the watermarks, similarly, will occur irrespective of any maximum storage period).
  • Recover space interval / interval unit – defines how frequently NetWorker will check to see if there are any recyclable savesets that can be removed from disk. (Aborted savesets, while checked for in this, should be automatically cleaned up when they are aborted.)
  • Filesystem check interval / interval unit – defines how frequently NetWorker will check to see whether it should actually perform any staging.

While there’s a lot of numbers/settings there in a small dialog, they actually all make sense. For instance, let’s consider the staging policy defined in the above picture. It shows a policy called “Daily” that will move savesets out from /d/nsr/01 to the Daily pool, with the following criteria:

  • If the disk backup unit becomes 85% full, it will commence staging until it has moved enough savesets to return the disk backup unit to 50% capacity.
  • Any saveset that is older than 7 days will be staged.
  • Whenever savesets are staged, they will be picked in order of oldest to newest.
  • It will check for recyclable/removable savesets every 4 hours.
  • It will check to see if staging can be run every 7 hours.

Note that if the disk backup unit becomes full, staging by watermark will be automatically kicked off, even if the 7 hour wait between staging checks hasn’t been reached.

One important note here – If you have multiple disk backup units with volumes labelled into the same pool, you can choose to either have them all in one staging policy, have one staging policy per disk backup unit, or have a series of staging policies with one or more disk backup unit in each policy. There’s risks/costs associated with each option. If you have too many defined under a single staging policy, then staging becomes very “single-threaded” in terms of disks read from; this can significantly slow down the staging policy. Alternatively, if you have one staging policy per disk backup unit, but the number of disk backup units exceeds the number of tape drives, you can end up with significant contention between staging, cloning and tape-recovery operations. It’s a fine balancing act.

Manual staging

Manual staging is accomplished by running the nsrstage command. That, incidentally, is what happens in the background for scheduled staging – the staging policy runs, evaluates what needs to be staged, then runs nsrstage.

The standard way of invoking nsrstage is:

# nsrstage -b destinationPool -m -v -S ssid/cloneid

or

# nsrstage -b destinationPool -m -v -S -f /path/to/file.txt

In the first, you’re staging a single saveset instance. NOTE: You must always specify ssid/cloneid; if you don’t – if you just specify the saveset ID, then when NetWorker cleans up at the end of the staging operation, it will delete all other instances of the saveset. So if you’ve got a clone, you’ll lose reference to the clone!

In the second instance, you’re staging multiple sssid/cloneid combinations, specified one per line in a plain text file.

(There are alternate mechanisms for calling nsrstage to either clean the disk backup unit, or stage by volume. These aren’t covered here.)

As with all pools these days in NetWorker, you can either have the savesets staged with their original retention period in place, or stage them to a pool with a defined retention policy, in which case their retention policy will be adjusted accordingly when they are staged.

Scripting

You can of course script staging operations; particularly when running manually you’ll likely first run mminfo, then run nsrstage against a bunch of savesets. Alternatively, you may want to check out dbufree, a utility within the IDATA Tools utility suite; this offers considerable enhancements over regular staging, including (but not limited to):

  • Stage savesets selected on a disk backup unit by any valid mminfo query (e.g., stage by group…)
  • Specify how much space you want to free up, rather than watermark based
  • Stage in chunks – stage enough space to free up a nominated amount of capacity, stop to allow reclamation to take place, then start staging again.
  • Only stage savesets that have clones.
  • Enhanced saveset order options – in addition to biggest/smallest/oldest/newest, there’s options to order savesets by client (to logically assist in speeding up recovery from multiple backups) in any of the ‘primary’ sort methods as a sub-order.

Cloning

Talking about staging wouldn’t be complete if we also didn’t mention cloning. In an ideal configuration for disk backup units, you should:

  • Backup to disk
  • Clone to tape
  • (Later), stage to tape.

This can create some recovery implications. I’ve covered that previously in this post.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha