Push/Pull

A common theme of question asked by people new to NetWorker is whether it supports a push or pull recovery model.

The answer, as you’d expect for an enterprise backup product, is both. However, the recoveries processes aren’t named push and pull.

If you’re not aware of push and pull recovery models, they work thusly:

  • A push recovery model is where all recovery requests are handled by the backup administrator, or at least, on the backup server, and the data retrieved is transferred out to the client.
  • A pull recovery model has the client that wishes to receive the data initiate the recovery and retrieve the data from the backup server.

NetWorker supports both, and in fact more, but it uses the term directed recoveries.

Technically, all recoveries in NetWorker are directed. They involve three clients, which are:

  • Source – the host from where the data was originally backed up;
  • Target – the host that the data is to be recovered from;
  • Control – the host that initiates the data.

Now, because the backup server has the NetWorker client software on it, it can be any one of those clients. A workgroup style “push” recovery would typically work with clients aligned as follows:

  • Source and Target – Host where the data came from originally
  • Control – The backup server

On the other hand, a workgroup style “pull” recovery would typically work with clients aligned:

  • Source, Target and Control – The host where the data came from originally.

NetWorker’s directed recovery model is more powerful and flexible than the above two examples, though. For example, you can run a recovery where all three hosts are different machines – e.g.:

  • Source – Production database server
  • Target – Development database server
  • Control – Backup server

In this situation the directed recovery would be used to act as a means of getting data from production into the development area.

So the answer to that original question is: yes, NetWorker supports a push recovery model. And yes, NetWorker supports a pull recovery model. But it also supports more.

 

A recurring theme on the NetWorker Mailing List is people not understanding how Automedia Management operates in tape libraries.

Here’s the complete list of what automedia management does when enabled for a tape library:

  1. It enables NetWorker to automatically label and use any blank tape (i.e., a tape that does not have a NetWorker label header on it) within the library if and only if there are no other appendable or recyclable volumes in the library to service the pool that requires media.

It does not have anything to do with the following:

  1. Recycling volumes within a pool (this is handled automatically, so long as the volume isn’t set for manual recycling)
  2. Recycling volumes between pools (this is handled by configuring the pool to “recycle to other pools” and “recycle from other pools”)
  3. Inventorying volumes as they are added to the library (this is barcode and NetWorker version dependent)
  4. Depositing volumes (this is an operator function)
  5. Withdrawing volumes (this is an operator function)

 

 

Upgrading NetWorker

So a new version of NetWorker has come out, or is coming out, and it’s been decided that you’re going to upgrade, but you want a few tips for making that upgrade as painless as possible. Here’s my 5 rules for upgrading NetWorker:

  1. Read the release notes. If you’re not going to read the release notes, you are better off staying on your current version, no matter what issues you’re having. I can’t stress enough the importance of reading the release notes and having a thorough grasp of:
    • What has changed?
    • What are the known issues with the current release?
    • What were the resolved issues between the current release and the release you’re currently running?
  2. Do a bootstrap and index backup if upgrading between major or minor releases. If going between service packs on the same release, you can skip the index backup so long as your backups have been successful lately, but ensure you still do a bootstrap backup.
  3. Unload all tapes (physical or virtual) in jukeboxes before the upgrade. You’ll see why shortly.
  4. Upgrade in this order:
    • Storage node(s) on the day of the upgrade, before the NetWorker server
    • Server on the day of the upgrade, after the storage node(s)
    • Client(s) later, at suitable times
  5. After the upgrade but before the NetWorker services are restarted on the storage node(s) and server, delete the nsr/tmp directory on those hosts.

Obviously standard caveats, such as following any additional instructions in the release notes or upgrade notes should of course be followed, but sticking to the above rules as well can save a lot of hassle over time. I’ve noticed over the years that a odd, random problems following upgrades can be solved by clearing the nsr/tmp directory on the server and storage nodes. If there’s no tapes in the jukeboxes when the services first start after the upgrade, there’s less futzing for NetWorker to take care of before it’s fully up and running, too.

 

In a previous post, I described how one could use jobquery and jobkill to terminate running scheduled clones in situations where NMC doesn’t allow the clone to be stopped from within the GUI. However, jobquery isn’t necessarily the most intuitive of interfaces if you’re not using it all the time.

I was pleasantly surprised when I was preparing some documentation to note that jobkill, as of 7.6 SP2, has become interactive if there are multiple jobs running, which reduces the need to run jobquery if you’re wanting to just stop one scheduled operation.

In 7.6 SP2, if you run jobkill without any arguments, and there are jobs running, you’ll run into an interactive session such as the following.

# jobkill
                      job id: 3104018;
                        name: tara-5;
                        type: savegroup job;
                     command: ;
           NW Client name/id: ;
                  start time: 1312763880;
------------------------------------------------------
                      job id: 3104025;
                        name: /d/01;
                        type: save job;tara.pmdg.lab
                     command: \
save -s tara.pmdg.lab -g nox-5 -LL -f - -m tara.pmdg.lab -t 1312026303 \
-l 5 -q -W 78 -N /d/01 /d/01;
           NW Client name/id: tara.pmdg.lab;
                  start time: 1312763880;
------------------------------------------------------
                      job id: 3104026;
                        name: /;
                        type: save job;
                     command: \
save -s tara.pmdg.lab -g nox-5 -LL -f - -m tara.pmdg.lab -t 1312026306 \
-l 5 -q -W 78 -N / /;
           NW Client name/id: tara.pmdg.lab;
                  start time: 1312763880;
------------------------------------------------------
Specify jobid to kill ('q' to quit, 'r' to refresh): 3104018
Terminating job 3104018
Specify jobid to kill ('q' to quit, 'r' to refresh): q

So there you go – jobkill is interactive, helpful and now saves the hassle of running jobquery first.

 

So you’ve got a pending message sitting there in NMC, NetWorker logs and nsrwatch:

Waiting for 1 writable volume(s) to backup pool 'Default'...

Backtracking why something is asking for this can be tedious and time consuming. You can dig through logs, or you can do process listings, etc., but here’s the thing: if it doesn’t jump out at your almost immediately, there’s a good chance it’s a client initiated backup, and that may not be as easy to spot.

To me, this problem falls into one of those simple choice categories where you can either:

(a) Spend an indeterminate amount of time working out what is asking for a Default volume; or

(b) Just simply label a volume into the Default pool and see what gets written to it.

To me, there’s practically no choice in the situation. If I can label a volume into the Default pool and see what’s writing, I can then stop that backup by contacting the system owner, or finding and hunting down the backup process and stopping it myself.

While I appreciate it may not appear a graceful solution, we have to remember that sometimes it’s wasteful to spend a large amount of time to solve a problem one way, when there’s a simpler (albeit slightly uglier) way to immediately isolate the cause of a problem.

 

When reviewing a customer’s logs today, I noticed yet another one of those minor (but great) changes that periodically get added to NetWorker. This one is subtle, but both important and useful; it seems to be in at least 7.6.1.3 onwards, but I’m not sure if it appeared earlier.

As the server starts up, the client ID of the NetWorker server now gets clearly, explicitly logged in the daemon file. If you subsequently have an issue (e.g., DNS change, etc.) that triggers a client ID change, having this detail automatically logged will at least (a) help you notice and (b) come in use in changing it back.

The information is logged clearly:

nsrd <<< NSR server host 'tara.pmdg.lab' using 
Client id '85acae6f-00000004-464fbdd1-464fbdd0-00010000-c0a86404' >>>

(Line wrap added for clarity – it’s all one one line in the log.)

While you may not immediately need this information, being aware that it’s available is always A Good Thing.

 

When people are just starting to get into NetWorker, a common situation is that they get confused about the difference between cloning and staging. (This isn’t helped given NetWorker can report in-progress staging operations as cloning – a perennial source of annoyment.)

So, what’s the difference?

  • A clone operation is where NetWorker duplicates a saveset. It makes a registered copy of the saveset, and at the conclusion of the operation is aware that it has an additional copy.
  • A stage operation is where NetWorker moves a saveset. It first makes a registered copy of the saveset, and then at the conclusion of the operation removes reference to the instance that it copied from.

Typically when we talk about staging, we talk about moving from a disk (media type ‘FILE’ or media type ‘ADV_FILE’) volume to tape. In such a situation where NetWorker stages from an actual FILE/ADV_FILE volume, it not only removes reference to the original saveset, but it actually removes it from the source volume as well. That’s to be expected – it’s a real disk filesystem that NetWorker is accessing, and removing a saveset is as simple as just running an operating system ‘delete’ command.

While it’s not often done, NetWorker does support staging from tape – but obviously when it’s done reading from the source volume, it can’t then selectively erase chunks of savesets from the source tape. Instead, all it does in that situation is delete, from the media database, the reference to the saveset having been on the source volume.

(In case you’re new to NetWorker and intend to now run off and try some staging – make sure, please, before you do, to read the second article ever posted on the NetWorker Blog – “Instantiating Savesets“. It has some very cautionary information about staging operations.)

One final thing – while I said that a clone or stage operation is done against a saveset, it’s not always as simple as that. If you tell NetWorker to clone or stage an individual saveset/saveset instance, that’s exactly what it will do. However, if you tell NetWorker to clone or stage a NetWorker tape volume, it will clone/stage the entire volume, with multiplexing left intact.

Regardless of those caveats though, remember the simple rule – a clone is a copy operation, and a stage is a move operation.

 

In a previous blog post, I discussed how much I liked the scheduled cloning operations introduced in NetWorker 7.6 SP1. Since then, I’ve had several people comment on it saying that while they’re able to manually start scheduled cloning operations, they’re not able to stop scheduled cloning operations in NMC – regardless of whether they were manually or automatically started.

Now I thought I’d been able to manually stop a scheduled cloning operation via NMC during beta testing, but I may have confused myself with something else, and when I noticed the same issue, it led me to think – can I stop this some other way, maybe from the command line? (For what it’s worth, the inability to stop a scheduled clone from NMC is a known issue, and there’s an EMC request running for it.)

It turns out without NMC, the command line is how you stop a scheduled cloning operation. It actually turned out to be fairly simple in the end. To do so, you use jobquery and jobkill.

First, use jobquery to identify the scheduled clone job you want:

# jobquery
jobquery> show name:; job id:; job state:
jobquery> print type: clone job; job state: SESSION ACTIVE:
                      job id: 64002;
                   job state: SESSION ACTIVE;
                        name: clone.linux clones;

Once you’ve got that job ID, all you have to do is quit jobquery, and run:

# jobkill -j jobID

In this case – it would be:

# jobkill -j 64002
Terminating job 64002

That’s it – that’s how you stop a scheduled clone job.

 

If you’ve got multiple jukeboxes within a NetWorker environment, but primarily work with one of them, you may find ‘nsrjb’ to be a bit of a pain any time you forget to specify the jukebox name. If you’re not familiar with this, here’s how nsrjb reacts in this situation:

[root@tara ~]# nsrjb
1:	VTL1 	[enabled]
2:	VTL2 	[enabled]
No jukebox selected.
Please select a jukebox to use:? [1] _

(Slight aside: never assume the numbered list is the same; NetWorker doesn’t guarantee the order being the same between executions – in fact, I actually only put in an RFE about this a couple of days ago, as I’m hoping it could at least be alphabetically ordered at all times…)

If you want to avoid the jukebox-prompt from nsrjb, one of the easiest ways is to specify the jukebox name as part of the command – e.g.,

[root@tara ~]# nsrjb -j VTL1

That’s fine of course, but if the vast majority of the time you perform operations on a single jukebox, you can specify a default jukebox as an environment variable (NSR_JUKEBOX) and streamline your processes. For example, on Linux, using the bash, this might look as follows:

[root@tara ~]# export NSR_JUKEBOX=VTL1
[root@tara ~]# nsrjb
Jukebox VTL1: (Ready to accept commands)
slot  volume         pool           barcode   volume id        recyclable
1: 800840L4       ClientTesting  800840L4  3814088325       no
2: 800841L4       ClientTesting  800841L4  3797311146       no
3: 800842L4       ClientTesting  800842L4  3847642669       no
4: 800843L4       ClientTesting  800843L4  3780533937       no
5: 800844L4       ClientTesting  800844L4  3763756765       yes
6: 800845L4       ClientTesting  800845L4  3864419885       yes
<snip>

Being an environment variable, this is something you can choose to set locally – say, on a per storage-node basis, when you have multiple storage nodes. It’s relatively common for instance to have a tape library on one or more storage nodes, so for the appropriate logins (or even at a system level) on each storage node it would be possible to set the local jukebox as the default, thereby streamlining usage of the units.

As an example, here’s a lab storage node with the setting in use:

[root@fawn ~]# export NSR_JUKEBOX="rd=fawn:VTL3"
[root@fawn ~]# nsrjb -s tara
Jukebox rd=fawn:VTL3: (Ready to accept commands)
<snip>

For something that can take you less than 30 seconds to set, the environment variable NSR_JUKEBOX can certainly be a big time saver if you have multiple jukeboxes in your environment and (like me) you’re a command line junkie.

 

Normally you don’t want to be in this position, but sometimes you’ll strike a situation where the only possible location of data that you need to get back is in a saveset that aborted (i.e., failed) during the backup process. Now, if the saveset/media is almost completely hosed, you’re probably going to need to recover using the scanner|uasm process, but if it was just a case of a failed backup, you can direct a partial saveset recovery using the recover command.

When you’re at this point the first thing you need to do is find the saveset ID of the aborted saveset, but I’ll leave that as an exercise to the reader. Now, once you’ve got the aborted saveset ID, it’s as simple as running a saveset recovery. The basic command might look like this:

C:\> recover -d path -s buServer -iN -S ssid

Where:

  • ‘path’ is the path that you want to recover to. Note that in these situations, it’s usually a very, very good idea to make sure you recover to somewhere new, rather than overwriting any existing files.
  • ‘buServer’ is the backup server that you want to recover from.
  • ‘ssid’ is the saveset ID for the aborted saveset that you want to recover from.

Depending on whether you’re doing a directed recovery, etc., you may end up with a few additional arguments, but the above is fairly much what you need in this situation. (If you’re confident that a specific path or file you want back is going to be in the part of the saveset backed up, you can always add that path at the end of the recovery command, too.)

Once the recovery runs, you’ll get a standard file-by-file listing of what is being recovered, but the recovery will end with what looks like an error – it’s effectively though just a notification that NetWorker has hit the data that was ‘in transit’, so to speak, when the saveset was aborted. This error will look similar to the following:

5041:recover: Unable to read checksum from save stream

16294:recover: Encountered an error recovering C:\temp2\Temp\744\win_x86\networkr\hba\emc-homebase-agent-6.1.2-win-x86.exe

53363:recover: Recover of rsid 851692923 failed: Error receiving files from NSR server `tara'

The process cannot access the file because it is being used by another process.

Received 231 matching file(s) from NSR server `tara'

Recover errors with 1 file(s)

Recover completion time: 4/20/2010 3:41:12 PM

At that point, you know that you’ve got back all the data you’re going to get back, and you can search through the recovered files for the data you want.

(As an aside, don’t forget to join the forums if you’ve got questions that aren’t answered in this blog.)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha