Periodically people who are new to NetWorker will lament that it doesn’t support “push” recoveries. That is, a recovery run on the server that pushes data out to the client.

This supposed lack of support comes down to NetWorker using an alternate nomenclature for the term, and making it more generic. In NetWorker, you have two distinct styles of recovery:

  • A local recovery (run on the machine where the backup was taken)
  • A directed recovery

In actual fact, a local recovery is effectively just a “special case” of a directed recovery. So let’s look at what a directed recovery involves.

For directed recoveries, we need to think of 3 different types of client. These are:

  • The source client – the host from where the original backup was taken.
  • The destination client – the host to where the recovery will be written.
  • The control client – the host that runs the recovery.

You can hopefully see from the above how a local recovery is just a special case of a directed recovery – it’s just a directed recovery where the source, destination and control clients are the original source client.

(Similarly, a push recovery, in terms of workgroup backup products, is one where the control client is the backup server, and the source/destination clients are different to the backup server.)

As you can imagine, in order to perform directed recoveries, you need some special permissions; this prevents any user on any host from recovering data from another machine. There are two ways that one can have permission to do directed recoveries:

  • Any member of the NetWorker administrator user group can conduct a directed recovery without any additional setup.
  • Users configured in the ‘Remote Access’ properties for a client can access that client either as a source or destination client in a directed recovery.

As an example, let’s consider a 2-way directed recovery, where our destination and control clients are the same host, and the source client is another machine. Our destination/control client is called ‘cyclops’, and our source client is called ‘medusa’.

In this case, without running as a member of the NetWorker administrator group, we need to configure the ‘Remote Access’ field on medusa to permit the appropriate user on cyclops to have access to its data. Within the client properties for ‘medusa’, the remote access field resembles the following:

Remote Access properties for directed recoveries

Remote Access properties for directed recoveries

Note that users can be specified either in the old method (as per the above), or in the newer standard. (The above user specification, in the new standard, would be “user=preston,host=cyclops”).

Once this has been setup, the recovery program can be run on the client cyclops. You can always see whether permissions are working for directed recoveries right from the outset of a recovery, as the recovery program will list, in the “Select Source” dialog, all hosts that the client has permission to access data from:

Selecting source client in directed recovery

Selecting source client in directed recovery

Thus, in order start a directed recovery once permissions are setup, all you have to do is change the source client from the current host (cyclops) to the client you want to recover from (in this case, medusa), and you’re already well progressed down the directed recovery path.

Having selected the client we want to recover from, we can then select the client we want to recover to:

Selecting target client in directed recovery

Selecting target client in directed recovery

With that done, we can now browse the source client for files we want to recover, selecting them the same way we would select files from a local client:

Selecting files for directed recovery

Selecting files for directed recovery

Note that at the bottom of the window we can see in the status area a statement indicating that we’re recovering from medusa to cyclops.

Now, once you’ve selected files to recover, remember that in directed recoveries it’s usually a very, very good idea to change where you want to recover to on the destination – often you don’t want it to be the same as the original backup location on the source. To do this of course, use the “Recover Options” item in the “Options” menu:

Relocating files for directed recovery

Relocating files for directed recovery

With that done, you can start the recovery. Here’s the recovery details from our example case:

Directed recovery complete

Directed recovery complete

There you go, that’s all there is to it; as you can see, a directed recovery is really no more difficult than a regular recovery in NetWorker.

To close off though, here’s some tips:

  1. Cross platform directed recoveries are no longer supported in NetWorker (they were, for a time, in NetWorker 6, but posed too much of a security issue). While the control client may be of a different platform, you must ensure that your source and destination clients are of the same platform type – Unix/Linux to Unix/Linux, and Windows to Windows.
  2. All 3 of the source, destination and control clients must be defined clients within NetWorker. While previously this was not required, in more recent versions of NetWorker it has been introduced to increase the security of directed recoveries.
  3. Watch out for special savesets, particularly with Windows. Don’t try, for instance, to run a directed recovery from the backup server of a clients’ SYSTEM FILES: saveset back out to the client. I believe these days NetWorker warns you, but in previous cases it would overwrite the control clients’ SYSTEM FILES: saveset with the one being recovered, since these special savesets didn’t support directed recovery.
  4. Be sure to decide in advance, for directed recoveries, how to handle duplicate files (i.e., files already existing on the target) when the control and target clients aren’t the same host – NetWorker won’t allow interactive prompting for duplicate files when the control and target clients are different.
  5. In Unix environments, directed recoveries are a great way of getting out of particularly sticky situations – e.g., someone accidentally clobbering the /etc/passwd or /etc/shadow file. Push the recovery back out to the client in question to restore access.
  6. When using directed recoveries for binaries, make sure the source and destination clients are the same operating system type and bitness.
 

I’ll preface this by saying that I’m not from Victoria; I hail from New South Wales in Australia, and there’s usually quite a strong rivalry between the two states. Personally, I don’t subscribe to that rivalry – the reasons behind it have absolutely no pull for me.

Traditional rivalry aside though, what can’t be argued about is that Tourism Victoria produces simply the most amazing TV advertisements. I don’t need these ads to convince me of the merits of visiting Victoria, but I do really love their ads.

Two in particular are:

Honestly, if you’re looking for somewhere to holiday, immerse yourself in Victoria.

 

Perhaps one of the most common mistakes that companies can make is to focus on their backup window. You might say this is akin to putting the cart before the horse. While the backup window is important, in a well designed backup system, it’s actually only of tertiary importance.

Here’s the actual order of importance in a backup environment:

  1. Recovery performance.
  2. Cloning (duplication) performance.
  3. Backup performance.

That is, the system must be designed to:

  1. First ensure that all data can be recovered within the required timeframes,
  2. Second ensure that all data that needs to be cloned is cloned within a suitable timeframe to allow off-siting,
  3. Third ensure that all data is backed up within the required backup window.

Obviously for environments with well considered backup windows (i.e., good reasons for the backup window requirements), the backup window should be met – there’s no questioning about that. However, meeting the backup window should not be done at the expense of impacting either the cloning window or the recovery window.

Here’s a case in point: block level backups of dense filesystems often allow for much smaller backup windows – however, due to the way that individual files are reconstructed (read from media, reconstruct in cache, copy back to filesystem), they do this at the expense of required recovery times. (This also goes to the heart of what I keep telling people about backup: test, test, test.)

The focus on the recovery performance in particular is the best possible way (logically, procedurally, best practices – however you want to consider it) to drive the entire backup system architecture. It shouldn’t be a case of how many TB per hour you want to backup, but rather, how many TB per hour you need to recover. Design the system to meet recovery performance requirements and backup will naturally follow*.

If your focus has up until now been the backup window, I suggest you zoom out so you can see the bigger picture.


* I’ll add that for the most part, your recovery performance requirements shouldn’t be “x TB per hour” or anything so arbitrary. Instead, they should be decided by your system maps and your SLAs, and instead should focus on business requirements – e.g., a much more valid recovery metric is “the eCommerce system must be recovered within 2 hours” (that would then refer to all dependencies that provide service to and access for the eCommerce system).

 

When Snow Leopard first came out, I was reasonably impressed with how easily NetWorker continued to operate with it – and for desktop users and administrators of fixed-location servers, that should remain the case.

For laptop users though, it’s turning out to be slightly different story. My ongoing experience now is that if I switch locations repeatedly (e.g., home to work, work to customer site, customer site to work), the NetWorker client daemons eventually get so bogged down that it’s necessary to reboot to get back to working backups. In fact, a couple of times I’ve needed to go so far as to reinstall NetWorker on my laptop in order to get it running again smoothly. (That’s using NetWorker 7.5.1.)

If you’ve got mobile users upgraded to Snow Leopard who are now experiencing backup problems, a reboot (unfortunately) may be your first point of call – the nature of the daemon hang-up seems to prevent proper process shutdown, which in turn prevents the daemons from properly restarting. If the reboot fails, a client reinstall should fix it.

From my experience so far, it seems to only happen when locations are changed multiple times.

 

A while ago, I ran a post titled Ethical Obligations of Backup Administrators. Following up from that now I want to talk about the procedural obligations implicit to working in the role of being a backup administrator.

Now, to start with, if you think that the primary procedural obligation of a backup administrator is to ensure that the backups work or run, then you need to think more about the end obligation than the start obligation. (This is a primary topic of consideration in my book.)

Before I set out the procedural obligations, I need to define recoverable. You may think this is a self-obvious definition – however, if it were, a lot of problems that regularly occur in backup systems wouldn’t happen at all. Thus, by recoverable I mean the following:

  1. The item that was backed up can be retrieved from the backup media.
  2. The item that is retrieved from the backup media is usable as a replacement to the data that was backed up.
  3. The item can be retrieved within the required window.

A backup should not be deemed to be recoverable unless it meets all three of the above requirements. No ifs, no buts, no maybes. (Indeed, it’s worth noting that many “soft” recovery failures are caused by a failure to meet the third requirement – getting the data back in time is equally as important in mission critical systems as getting the data back.)

Since most people work well with lists, I’ll define these procedural obligations as a list, ordered in priority starting at the highest:

  1. To ensure that all required data is recoverable. By “data” I’m not just referring to raw data, but all items, files, information, databases, systems, etc., designated as requiring recovery.
  2. To maintain a zero error policy. There is no such thing as 100% certainty, but the closest you can get to it is by maintaining a zero error policy. In essence, by maintaining a zero error policy, you become immediately aware of any issues that may compromise the above rule.
  3. To maintain documentation for the environment. No system is complete without documentation. In particular, if someone with adequate skills cannot interact with it after reading the documentation, then the system is not documented and is not a system.
  4. To maintain an issues register. This is somewhat implicit in the maintenance of a zero error policy, but it is worth remembering that not all issues in a backup system are to do with errors. Issues may be that department heads approve of, or insist on non-standard backups, or that a system went into production without adequate testing, etc.
  5. To be across ongoing capacity management and forecasting requirements. A backup system can’t reliably work if it could halt due to capacity restraints at any random moment or minor data growth. Thus, the backup administrator must have a finger on the pulse of the capacity of the system.
  6. To maintain reports. A backup system does not work in isolation, and thus a backup administrator must ensure that reports (both daily/operational and long term/management) are accurate and timely.
  7. To document all data that is not required for recovery. There should be no “unknowns” in a backup system. Thus, any systems or data that are designated to not require recovery (e.g., QA systems) must be documented as such, and periodically rechecked to confirm this remains the case.

As I said from the outset, many of these obligations are implicit to the role of being a backup administrator. However, for organisations wanting to formalise their processes and their role descriptions, thus achieving higher guarantees of reliability within their backup system, clearly documenting these obligations are vital.

 

A few days ago a customer was having a rather odd problem. They’re currently running NetWorker 7.3.3 and getting ready to jump directly to NetWorker 7.5.1, but to do so they wanted to first run up a NetWorker 7.5.1 server and confirm current client types, databases, etc., will backup without issue*.

So the customer installed NetWorker 7.5.1 on a new Linux host, created some devices and pools, but then encountered a particularly odd problem when they went to create the clients. NMC would allow them to fill in all the properties for the client, but when they clicked OK in the new client dialog box, nothing would happen. No errors were produced, but nor were any clients actually created.

When they raised this with me I was a little puzzled for a few minutes, then asked if they were using the NMC that comes with NetWorker 7.5.1, or the NMC that comes with 7.3.3 and had just added the new server to the control zone.

The answer was that the control zone for the existing NMC that came with NetWorker 7.3.3 was just extended to include the 7.5.1 server.

For pools, devices and groups this was not a problem – these were all successfully created on the 7.5.1 server using the 7.3.3 NMC. However, when it came to clients, it wouldn’t work.

The reason is quite simple – as new features and functions are added to NetWorker over time, different fields within a configuration resource may or may not become mandatory. Some of the time this is obvious, because we’re required to fill in certain fields – e.g., client names, schedules, etc. However, in other instances, NetWorker has predefined defaults that it slots into place if a value isn’t entered – e.g., parallelism, priority, browse/retention time, etc. Just because defaults are put into place however doesn’t mean that fields are any less mandatory – it’s all about allowing you to create resources quicker.

So, what’s all this got to do with differing NMC/NSR versions? In short, everything!

You see, what happened for this customer is that between NetWorker 7.3 and 7.5, there has been a raft of client based functionality added – e.g., data deduplication, support for defining a client as being virtual, etc.

Undoubtedly some of these new features have mandatory values – so that if the server is probing details for the clients, it can safely request say, dedupe status or virtual status without worrying about getting an (undefined!) style response. Each version of NetWorker is “aware”, via base configuration, what fields must be supplied when creating a new resource, and thus, the scenario for this customer would have been:

  1. Fill in client properties in NMC 7.3.3
  2. Attempt “client create” 7.3.3 -> 7.5.1.
  3. The 7.5.1 server reviews the proposed client resource and,
  4. The 7.5.1 server rejects the proposed client resource as not having all the mandatory fields filled in.

Should NetWorker/NMC have provided an error to explain what was going wrong? Undoubtedly that would have been good, and I’d suggest that NMC/NSR should be able to better communicate resource creation/update failure in these circumstances. However, that being said, the fundamental problem remained the same – the version of NMC in use couldn’t create new clients because it wasn’t supplying all the mandatory details to the more recent version of NetWorker.

In many small sites, the NMC server and the NetWorker server are on the same host, and are thus upgraded in lock-step. However, for sites where the NMC server is installed on another host, this is a valuable lesson – unless you have a very valid reason, don’t run a version of NMC that matches an earlier version of NetWorker than the current server version. It may work (mostly), but if it does fail, it’s unlikely to be immediately obvious why it’s failing.


* This is what I’d call an excellent upgrade policy – you can read the release notes until they’re 100% memorised, but nothing quite beats actually running up your own test server.

 

I’m pleased to report that IDATA Tools v4.1 is now available. This new version features a host of updates, including but not limited to:

  • New utilities:
    • mediafree – Designed for use in VTL environments, this utility allows you to free up media within VTLs on the basis of all savesets on individual volumes (a) exceeding a user-nominated time since generation, and (b) having clones in all user nominated pools. This can be run interactively or automated using command line options.
    • backup-report – Designed to produce and email a daily report of all backups generated the previous day, delivering for each backup the client, saveset name, size, start and finish time, pools and volumes written to. This can be delivered in one of CSV, HTML or Excel format, with HTML and Excel format including totals, etc. While the default execution is for the previous day, the actual timeframe can be user specified. A sample HTML report covering weekend backups on a lab server can be seen here.
  • Reporting enhancements:
    • Various utilities have been updated to support non-US date formats as a configuration option.
    • Utility recyclable-volumes now has an option to report recyclable volumes by location rather than pool; for “lights out” style environments this allows quick checks of available media per jukebox. Additionally, sites that make use of the NetWorker location field will be able to quickly see what volumes in external storage have become recyclable as well.
  • Configuration and documentation enhancements:
    • Core utilities that require configuration file setup now include a -H help option which produces a sample configuration class; this sample class can then be copied and pasted into the configuration file and adapted to suit local needs.
    • Some previously included features were inadequately documented; these have been corrected.
  • And of course, reported bugs have also been fixed.

For more information on all the utilities in IDATA Tools, check out my original post on them, Turbocharged Administration with IDATA Tools, and the announcement for IDATA Tools v4.

 

In my opinion (and after all, this is my blog), there’s a fundamental misconception in the storage industry that backup is a part of Information Lifecycle Management (ILM).

My take is that backup has nothing to do with ILM. Backup instead belongs to a sister (or shadow) activity, Information Lifecycle Protection – ILP. The comparison between the two is somewhat analogous to the comparison I made in “Backup is a Production Activity” between operational production systems and infrastructure support production systems; that is, one is directly related to the operational aspects of the data, and the other exists to support the data.

Here’s an example of what Information Lifecycle Protection would look like:

Information Lifecycle Protection

Information Lifecycle Protection

Obviously there’s some simplification going on in the above diagram – for instance, I’ve encapsulated any online storage based fault-protection into “RAID”, but it does serve to get the basic message across.

If we look at say, Wikipedia’s entry on Information Lifecycle Management, backup is mentioned as being part of the operational aspects of ILM – this is actually a fairly standard definition of the perceived position of backup within ILM; however, standard definition or not, I have to disagree.

At its heart, ILM is about ensuring correct access and lifecycle retention policies for data: neither of these core principles encapsulate the activities in information lifecycle protection. ILP on the other hand is about making sure the data remains available to meet the ILM policies. If you think this is a fine distinction to make, you’re not necessarily wrong. My point is not that there’s a huge difference, but there’s an important difference.

To me, it all boils down to a fundamental need to separate access from protection/availability, and the reason I like to maintain this separation is how it affects end users, and the level of awareness they need to have for it. In their day-to-day activities, users should have an awareness of ILM – they should know what they can and can’t access, they should know what they can and can’t delete, and they should know where they will need to access data from. They shouldn’t however need to concern themselves with RAID, they shouldn’t need to concern themselves with snapshots, they shouldn’t need to concern themselves with replication, and they shouldn’t need to concern themselves with backup.

NOTE: I do, in my book, make it quite clear that end users have a role in backup in that they must know that backup doesn’t represent a blank cheque for them to delete data willy-nilly, and that they should know how to request a recovery; however, in their day to day job activities, backups should not play a part in what they do.

Ultimately, that’s my distinction: ILM is about activities that end-users do, and ILP is about activities that are done for end-users.

 

Anyone who has either an understanding of the role that the computers played in World War II, or has a formal training in computer science will instantly recognise the name Alan Turing. An incredibly intelligent mathematician and one of the founding fathers (if you will) of computer science, Alan Turing was a giant of his time.

However, his profound contribution to the start of computer science and more importantly, the code breaking in World War II, was not enough to save him from an utterly draconian punishment for the crime of being who he was. This punishment, and the extreme psychological pressures that came with it, led to his suicide in 1954.

So while his punishment and his suicide should never have happened, it’s gratifying to at last see an official apology for his treatment come from the British government.

 

In order to speed up jukebox operations, NetWorker maintains a cache, or a map, if you will, of the current expected jukebox state based on the operations that have happened since it was last fully queried. This avoids having to do (time) costly SCSI probes before every operation.

(This, for what it’s worth, is why you can’t have another process, or another person, playing with the jukebox as well as NetWorker. For instance, a customer once had their jukebox accessible to all the developers on-site. They found on average the jukebox got into a terrible state several times a day, and thought they had a lemon of a product (either NetWorker or the STK L700) until they found out that having developers open the library door, arbitrarily pull tapes out and put new tapes in was not a good idea.)

Coming back to jukeboxes though, there are times when the cache is out of sync with reality. A few of the more common scenarios where this will happen are:

  • In disaster recovery situations
  • In situations where someone has manually moved around media
  • In situations where NetWorker has lost track of state due to a lengthy timeout on an error

In situations such as these, there’s an invaluable tool called sjirdtag that can come to the rescue. Instead of checking with the NetWorker cached contents of the library, sjirdtag instead delves down into what the library describes as its own content. I.e., it’s like peeking inside the library without having to leave your desk.

In order to use sjirdtag, you need to know the SCSI control port of the library; this is reported in the library properties in NetWorker management console, or you can find it out relatively quickly via inquire:

[root@tara ~]# inquire -l

-l flag found: searching all LUNs, which may take over 10 minutes per adapter
 for some fibre channel adapters.  Please be patient.

scsidev@0.0.0:STK     L700            5500|Autochanger (Jukebox), /dev/sg1
                                           S/N:    XYZZY     
                                           ATNN=STK     L700            XYZZY     
                                           WWNN=5123456003030303
scsidev@0.1.0:QUANTUM SDLT600         5500|Tape, /dev/nst0
                                           S/N:    ZF7584364
                                           ATNN=QUANTUM SDLT600         ZF7584364
                                           WWNN=5123456003030303

In this case, our library (a VTL presenting itself as an STK L700) is on scsidev@0.0.0. So, when we want to check the contents of the library, we run the command sjirdtag 0.0.0 – which looks like the following:

[root@tara ~]# sjirdtag 0.0.0
Tag Data for 0.0.0, Element Type DATA TRANSPORT:
        Elem[001]: tag_val=0 pres_val=1 med_pres=0 med_side=0
Tag Data for 0.0.0, Element Type STORAGE:
        Elem[001]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800843S3                       >
        Elem[002]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800844S3                       >
        Elem[003]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800845S3                       >
        Elem[004]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800846S3                       >
        Elem[005]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800847S3                       >
        Elem[006]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800848S3                       >
        Elem[007]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800849S3                       >
        Elem[008]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800850S3                       >
        Elem[009]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800851S3                       >
        Elem[010]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800852S3                       >
        Elem[011]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800853S3                       >
        Elem[012]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800854S3                       >
        Elem[013]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800855S3                       >
        Elem[014]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800856S3                       >
        Elem[015]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800857S3                       >
        Elem[016]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800858S3                       >
        Elem[017]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800859S3                       >
        Elem[018]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800860S3                       >
        Elem[019]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800861S3                       >
        Elem[020]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<800862S3                       >
        Elem[021]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG990S3                       >
        Elem[022]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG991S3                       >
        Elem[023]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG992S3                       >
        Elem[024]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG993S3                       >
        Elem[025]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG994S3                       >
        Elem[026]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG995S3                       >
        Elem[027]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG996S3                       >
        Elem[028]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG997S3                       >
        Elem[029]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG998S3                       >
        Elem[030]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<BIG999S3                       >
        Elem[031]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<CLN001L1                       >
        Elem[032]: tag_val=1 pres_val=1 med_pres=1 med_side=0
                   VolumeTag=<CLN002L1                       >
Tag Data for 0.0.0, Element Type MEDIA TRANSPORT:
        Elem[001]: tag_val=0 pres_val=1 med_pres=0 med_side=0
Tag Data for 0.0.0, Element Type IMPORT/EXPORT:
        Elem[001]: tag_val=0 pres_val=1 inp_enab=1 exp_enab=1 access=1 full=0 imp_exp=1
        Elem[002]: tag_val=0 pres_val=1 inp_enab=1 exp_enab=1 access=1 full=0 imp_exp=1
        Elem[003]: tag_val=0 pres_val=1 inp_enab=1 exp_enab=1 access=1 full=0 imp_exp=1
        Elem[004]: tag_val=0 pres_val=1 inp_enab=1 exp_enab=1 access=1 full=0 imp_exp=1

For those who are unfamiliar with sjirdtag, let’s break this up into the four sections presented (using the capitalisation in the output – not shouting):

  • DATA TRANSPORT – Refers to the tape drives within the library – i.e., the units responsible for transporting the data.
  • STORAGE – The slots used by the library for storage of cartridges. This does not refer to the slot(s) in the CAP/MAS.
  • MEDIA TRANSPORT – The robot head(s). There’ll be one per robot head.
  • IMPORT/EXPORT – The contents of the slots in the CAP/MAS.

If you’re wondering about those element numbers, they’re essentially the positions or numbers of the units as assigned by the library. In particular, for the drives (DATA TRANSPORT) section, these refer to the drives in order as they are presented by the tape library; this means that if your operating system drive mappings don’t match the library sequence, the output here also won’t match the operating system sequence of devices.

Now for each element other than the CAP/MAS areas, we get the following selection of information:

tag_val=[0|1] pres_val=[0|1] med_pres=[0|1] med_side=[0|1]

Each of these items mean:

  • tag_val – Indicates that there’s SCSI tag data for that element. 1 for yes, 0 for no.
  • med_pres – Jukebox state indicates that there is media present in this location. 1 for yes, 0 for no.
  • pres_val – A bit of an airy-fairy value; if set to 1, then it means that the med_pres value should be fairly believable. If set to 0 but the med_pres value is 1, then while there may be media present, there may also be an error condition. If set to 0, and med_pres is set to 0, then it also means that the med_pres value should be fairly believable.
  • med_side – For jukeboxes/media that supports double-sided media (e.g., older optical disk libraries), this indicates which side of the media is in use; for tape based libraries, this will always be 0.

For any element that has a volume with a barcode, this will be shown on the line underneath the element details with the format:

VolumeTag=<PCL                 >

For our import/export regions, the additional options, inp_enab, exp_enab, access, full and imp_exp are effectively undocumented, but my assumption on these items are:

  • inp_enab – Slot can be used for import.
  • exp_enab – Slot can be used for export.
  • access – Slot is accessible.
  • imp_exp – Slot is an import/export slot.

(The other option, “full”, most definitely indicates whether the slot is occupied or not.)

As can be evidenced by the “airy-fairy” nature of the pres_val tag, there’s no 100% guarantee that this information is physically accurate. However, it is an accurate reflection of the state that the library thinks it’s in, and thus is an accurate reflection of how the library will behave in response to requested operations. Furthermore, if the state shown by sjirdtag differs from the state shown by nsrjb, then it’s a good indication that it’s time to reset/reinventory the library. I.e., time to run:

# nsrjb -HEvvv
# nsrjb -II

(The reset instructs NetWorker to throw away its state information, tell the library to reinitialise itself, and then refreshes the volume state.The inventory command specified is assuming a barcode-supported library with barcoded volumes.)

Things that I routinely use (or get customers to use) sjirdtag for include:

  • Checking to see if there is a tape in a drive that NetWorker thinks is empty.
  • Checking to see if the tape NetWorker thinks is in a drive really is in the drive.
  • Checking to see if operators at a remote library have loaded media into the CAP/MAS.
  • Checking to see if there is a tape stuck in the robot gripper.
  • Finding the bootstrap volume when a disaster recovery (mmrecov) is required.

If you’ve not used sjirdtag before, it’s worthwhile scheduling a time where there’s minimal activity in the library so you can check it out.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha