Resolutions Check-in

In December last year I posted “7 new years backup resolutions for companies”. Since it’s the end of January 2012, I thought I’d check in on those resolutions and suggest where a company should be up to on them, as well as offering some next steps.

  1. Testing – The first resolution related to ensuring backups are tested. By now at least an informal testing plan should be in place if none were before. The next step will be to deal with some of the aspects below so as to allow a group to own the duty of generating an official data protection test plan, and then formalise that plan.
  2. Duplication – There should be documented details of what is and what isn’t duplicated within the backup environment. Are only production systems duplicated? Are only production Tier 1 systems duplicated? The first step towards achieving satisfactory duplication/cloning of backups is to note the current level of protection and expand outwards from that. The next step will be to develop tier guidelines to allow a specification of what type of backup receives what level of duplication. If there are already service tiers in the environment, this can serve as a starting point, slotting existing architecture and capability onto those tiers. Where existing architecture is insufficient, it should be noted and budgets/plans should be developed next to deal with these short-falls.
  3. Documentation – As I mentioned before, the backup environment should be documented. Each team that is involved in the backup process should have assigned at least one individual to write documentation relating to their sections (e.g., Unix system administrators would write Unix backup and recovery guidelines, etc., Windows system administrators would do the same for Windows, and so on). This should actually include 3 people: the writer, the peer reviewer, and the manager or team leader who accepts the documentation as sufficiently complete. The next step after this will be to handover documentation to the backup administrator(s) who will be responsible for collation, contribution of their sections, and periodic re-issuing of the documents for updates.
  4. Training – If staff (specifically administrators and operators) had previously not been trained in backup administration, a training programme should be in the works. The next step, of course, will be to arrange budget for that training.
  5. Implementing a zero error policy – First step in implementing a zero error policy is to build the requisite documents: an issues register, an exceptions register, and an escalations register. Next step will be to adjust the work schedules of the administrators involved to allow for additional time taken to resolve the ‘niggly’ backup problems that have been in the environment for some time as the switchover to a zero error policy is enacted.
  6. Appointing a Data Protection Advocate – The call should have gone out for personnel (particularly backup and/or system administrators) to nominate themselves for the role of DPA within the organisation, or if it is a multi-site organisation, one DPA per site. By now, the organisation should be in a position to decide who becomes the DPA for each site.
  7. Assembling an Information Protection Advisory Council (IPAC) – Getting the IPAC in place is a little more effort because it’s going to involve more groups. However, by now there should be formal recognition of the need for this council, and an informal council membership. The next step will be to have the first formal meeting of the council, where the structure of the group and the roles of the individuals within the group are formalised. Additionally, the IPAC may very well need to make the final decision on who is the DPA for each site, since that DPA will report to them on data protection activities.

It’s worth remembering at this point that while these tasks may seem arduous at first, they’re absolutely essential to a well running backup system that actually meshes with the needs of the business. In essence: the longer they’re put off, the more painful they’ll be.

How are you going?

 

Obviously the NetWorker Blog gets a lot of referrals from search engines via people looking specifically for help on particular NetWorker issues they’re encountering. Even just in the last 8+ hours, here are just some of the search terms that people used:

nmc doesn’t start

restore networker aborted saveset

networker disk backup module

nsr_render_log command

nsr_render_log daemon.raw

networker centos support

39077:jbconfig: error, you must install the lus scsi passthrough driver before configuring

And the list goes on and on, on a daily basis. This was reflected in the Top 10 for 2011 (and indeed, the top 10 for every previous year, too).

I’ll let you all in on a little secret though: all of those tips, all of those NetWorker basics articles and how to use nsradmin user guides – they’re all just the tip of the iceberg when it comes to getting a working backup system in place.

You see, a lot of sites don’t have a backup system at all – they just have some backup software and backup hardware and configuration. That doesn’t represent a backup system at all. From my article, “What is a backup system?“, I provided this diagram to explain such beasts:

Backup system

As you can see, the technology (the backup software, hardware and configuration) represents just one entry point to having a backup system. The others though are all equally critical; and when you add them all in together, it becomes clear that a backup system will derive much of its success and reliability from the human and business factors.

The technology, you see, is the easiest part of the backup environment; and it’s also the part that’s most likely to appeal to IT people. If you were to graph how much time the average site spends on each of those activities, it would probably look like this:

Imbalanced backup systemsWhen in actual fact, it should look more like this:

Balanced backup system

The short description? If you chart the amount of time you spend on your backup “system”, and the the Technology aspect (software, hardware, configuration) becomes a Pacman to the rest of the components, eating away at the rest of those facets, then you’ve got a cannibalistic environment that’s surviving as much as anything on luck/good fortune as it is on good design.

That’s why I bang on so much about backup theory – because all the latest and greatest technology in the world won’t help you at all if you don’t have everything else set up in conjunction with it:

  • The people involved need to know their roles, and participate in both the architecture of the environment and its ongoing operation;
  • The processes for use of the system must be well established;
  • The system must be thoroughly documented;
  • The system must be tested or you’ve got no way of establishing reliability;
  • The Service Level Agreements have to be established or else there’s no point whatsoever to what you’re doing.

Backup theory isn’t the boring part of a backup system; I’d suggest it’s actually the most interesting part of it. Just as I suggested that companies need to plan to follow some new years resolutions for backup systems, I’d equally suggest that the people involved in backups should start making it their goal to spend a balanced amount of time on the components that form a backup system.

If you don’t have the theory, you actually don’t have a system.

If you want to know more, you should treat yourself to my book (now available in Kindle format).

 

One of the core concepts I try to drive home in my book is that you don’t get a backup system by installing enterprise backup software.

Here’s a diagram to help explain what really goes into making a backup system:

Backup system

In short, you can have as much technology as you want, but without the rest of those pieces all you’ve got is a budget sink-hole.

If you want to understand how all these concepts fit together, you really should take the time to invest in my book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy“.

 

When something is going wrong in a NetWorker environment, the first thing you need to do is be able to run up some basic tests. If the issue has anything to do with NetWorker clients, you’ll want to be able to initiate a series of network, probe and index based tests. If you’ve got nothing scripted, ‘check-clients’ from IDATA Tools may very well be what you’re looking for.

As a command line tool, ‘check-clients’ can power through a suite of different tests and data gathering activities against your clients, all with very minimum effort on your part. Let’s look at the tests that are currently available:

[root@nox bin]# check-clients -l
Test Name           Test Description
------------------- ------------------------------------------------------
client_ids          Returns client ID for each configured client
empty               Report clients with empty indices
index               Perform nsrck -L3 on each client
index_rebuild       Perform nsrck -L6 on each client
info                Retrieve client information
list_active         List all configured clients in active groups
list_all            List all clients currently configured
performance         Check backup performance via bigasm
ping                Ping each client
probe               Savgroup probe for each client
resolution          Test/confirm name resolution
rpcinfo             Test rpcinfo/portmapper access
used_space          Calculates used space for backups

Now technically, not all of the above are actually tests as such – for instance, the used_space option was one recently requested by a customer to report on all backups currently held by a backup server for a client. Running it on one of my lab machines, the output looks like the following:

[root@nox bin]# check-clients -g all_active -t used_space
============================================================
Running test: used_space (Calculates used space for backups)
============================================================
        Client                         Used Space (GB)
        ----------------------------   --------------------
        archon                                    362.60783
        faero                                       0.00000
        luyten                                      0.00000
        nox                                       544.40887
        ----------------------------   --------------------
                 Total for 4 clients              907.01669
        ----------------------------   --------------------

To me, that’s a combo test/information gathering option; specifically the customer was after this particular test so that they could spot any newly added clients that hadn’t been backing up (i.e., by having a “Used Space” of 0 GB).

Equally, there’s use in periodically running the “client_ids” test – running and keeping the output of this test will give you help in any sticky situation where you suddenly need access to a previous clients’ host ID:

[root@nox bin]# check-clients -a -t client_ids
=======================================================================
Running test: client_ids (Returns client ID for each configured client)
=======================================================================
        aralathan = 65100d33-00000004-464fcacc-464fcacb-00050000-c0a86404
        archon = 3f33ca7b-00000004-43a4837c-43a484d7-00030000-c0a80006
        asgard = 00b151ed-00000004-43a4837b-43a4837a-00010000-c0a80006
        djwmp = 5560bbf6-00000004-4910cd4b-4910cd4a-01961a00-3d2a4f4b
        faero = 76c06b0a-00000004-453e8e44-453e8e43-00310000-c0a86406
        loki = d3f277da-00000004-4857452f-4857452e-00020000-c0a86404
        luyten = 93166424-00000004-4a2f8cde-4a2f8cdd-01041a00-3d2a4f4b
        nimrod = d6454919-00000004-496aaadc-496aaadb-006f1a00-3d2a4f4b
        nox = 85acae6f-00000004-464fbdd1-464fbdd0-00010000-c0a86404
        valhalla = 61d3ca1e-00000004-495525db-4955299a-00051500-98e71c17

Moving on into actual test territory, multiple tests can be teamed up to do a chunk of information gathering in one command. For instance, combining a ping test and a name resolution test against all active clients is as simple as:

[root@nox bin]# check-clients -g all_active -t ping,resolution
=====================================
Running test: ping (Ping each client)
=====================================
	archon  (0 responses, expected 4)
	faero  (0 responses, expected 4)
	luyten  (4 responses)
	nox.pmdg.lab  (4 responses)

=======================================================
Running test: resolution (Test/confirm name resolution)
=======================================================

	archon
		Name: archon (archon.pmdg.lab) (192.168.100.1) 
		Name: archon.pmdg.lab (archon.pmdg.lab) (192.168.100.1) 
		Addr: 192.168.100.1 (archon.pmdg.lab) 

	faero
		Name: faero (faero.pmdg.lab) (192.168.100.10) 
		Name: faero.pmdg.lab (faero.pmdg.lab) (192.168.100.10) 
		Addr: 192.168.100.10 (faero.pmdg.lab) 

	luyten
		Name: luyten (luyten.pmdg.lab) (192.168.100.18) 
		Name: luyten.pmdg.lab (luyten.pmdg.lab) (192.168.100.18) 
		Addr: 192.168.100.18 (luyten.pmdg.lab) 

	nox.pmdg.lab
		Name: nox.pmdg.lab (nox.pmdg.lab) (192.168.100.4) 
		Name: nox (nox.pmdg.lab) (192.168.100.4) 
		Addr: 192.168.100.4 (balrog.pmdg.lab (unknown))

None of this is re-inventing the wheel of course, but being able to just run a single command that cycles through and tests every active client (or even all clients) is particularly useful.

Even performance testing is catered for with check-clients; reaching out to the clients, the utility can run bigasm tests automatically – a great way for easily testing where performance hits are happening on the network. For example, a quick/basic demo of this option is below:

[root@nox bin]# check-clients -c luyten,nox.anywebdb.com -b Staging -S 50 -t performance
===============================================================
Running test: performance (Check backup performance via bigasm)
===============================================================
        luyten (Solaris/UNIX style test)
                Backup 50 MB to Staging
                50 MB took 12 seconds (4.17 MB/s)
        nox.pmdg.lab (Linux/UNIX style test)
                Backup 50 MB to Staging
                50 MB took 3 seconds (16.67 MB/s)

If you are looking around for a test kit option for NetWorker – and want access to a heap of other goodies at the same time – then ‘check-clients’ out of the IDATA Tools suite may very well be what you need.

 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

 

Over at Daily WTF, there’s a new story that has two facets of relevance to any backup administrator. Titled “Bourne Into Oblivion“, the key points for a backup administrator are:

  • Cascading failures.
  • Test, test, test.

In my book, I discuss the both the implications of cascading failures, and the need to test within a backup environment. Indeed, my ongoing attitude is that if you want to assume something about an untested backup, assume it’s failed. (Similarly, if you want to make an assumption about an unchecked backup, assume it failed too.)

While normally in backup, cascading failures come down to situations such as “the original failed, and the clone failed too”, this article points out a more common form of data loss through cascading failures –  the original failure coupled with backup failure.

In the article, a shell script includes the line:

rm -rf $var1/$var2

Any long-term Unix user will shudder to think of what can happen with the above script. (I’d hazard a guess that a lot of Unix users have themselves written scripts such as the above, and suffered the consequences. What we can hope for in most situations is that we do it on well backed up personal systems rather than corporate systems with inadequate data protection!)

Something I’ve seen in several sites however is the unfortunate coupling of the above shell script with the execution of said script on a host that has read/write network mounted a host of other filesystems across the corporate network. (Indeed, the first system administration group I ever worked with told me a horror story about a script with a similar command run from a system with automounts enabled under /a.)

The net result in the story at Daily WTF? Most of a corporate network wiped out by a script run with the above command where a new user hadn’t populated either $var1 or $var2, making the script instead:

rm -rf /

You could almost argue that there’s already been a cascading failure in the above – allowing scripts to be written that have the potential for that much data loss and allowing said scripts to be run on systems that mount many other systems.

The true cascading failure however was that the backup media was unusable, having been repeatedly overwritten rather than replaced. Whether this meant that the backups ran after the above incident, or that the backups couldn’t recover all required data (e.g., running an incremental on top of a tape with a previous incremental on top of a tape with a previous full, each time overwriting the previous data), or that the tapes were literally unusable due to high overuse (or indeed, all 3), the results were the same – data loss coupled with recovery loss.

With backups not being tested periodically, such errors (in some form) can creep into any environment. Obviously in the case in this article, there’s also the problem that either (a) procedures were not established regarding rotation of media or (b) procedures were not followed.

The short of it: any business that collectively thinks that either formalisation of backup processes or the rigorous checking of backups is unnecessary is just asking for data loss.

 

A rather smart gent whom I used to work with at another company, Mark Harvey, has in his own time been working on an open source VTL implementation that can be used for testing/lab purposes. I.e., he’s not aiming for it to be the next competitor to EDLs, FalconStor, etc., but rather, something that people can use when needing to test jukebox functionality without wanting to carry a jukebox around with them.

While Mark has primarily focused on getting his VTL software working with NetBackup, recently he’s made some progress in getting it to work (with a couple of limitations) with NetWorker.

The current limitation is that NetWorker doesn’t quite like the identity of the virtual drives – it sees them all as having the same serial number, and prohibits creating multiple drives with the same serial number. (The VTL presents differing serial numbers, but NetWorker may be working on the WWNN, which is the same on each device…)

Getting the VTL installed and configured

Limit yourself to one drive though, and you’re fine. To get started, you first need to download the VTL code – Mark hosts it at linuxvtl.googlepages.com.

My testing was with the 2009-06-09 tar ball on a CentOS 5.3 virtual machine and NetWorker 7.5.1. I’m not going to repeat the installation instructions – I suggest you build the RPMs, install sg3_util package (required), following the instructions included in Mark’s package.

Before you actually configure a jukebox in NetWorker, you need to strip down the number of devices in the VTL to 1, and the instructions below are geared towards that. Assuming you’ve not yet started the VTL software:

Create /etc/mhvtl/device.conf

Marks’ /etc/init.d/mhvtl startup script will create this file if it doesn’t exist, but we want to manually configure the file to only device. Below is the device.conf file I’ve used:

[root@tara mhvtl]# cat device.conf

VERSION: 2

# VPD page format:
# <page #> <Length> <x> <x+1>... <x+n>

# NOTE: The order of records is IMPORTANT...
# The 'Unit serial number:' should be last (except for VPD data)
# i.e.
# Order is : Vendor ID, Product ID, Product Rev and serial number finally
# Zero, one or more VPD entries.
#
# Each 'record' is sperated by one (or more) blank lines.
# Each 'record' starts at column 1

Library: 0 CHANNEL: 0 TARGET: 0 LUN: 0
 Vendor identification: STK
 Product identification: L700
 Product revision level: 5500
 Unit serial number: XYZZY

Drive: 1 CHANNEL: 0 TARGET: 1 LUN: 0
 Vendor identification: QUANTUM
 Product identification: SDLT600         
 Product revision level: 5500
 Unit serial number: ZF7584364         
 Max density: 0x46
 VPD: b0 04 00 02 01 00

(Note – yes, you can specify the serial number above, but no, if you create a second device with a different serial number it doesn’t yet work.)

Defining the library contents

After creating the device config, you need to configure the library contents – this is done by creating the file /etc/mhvtl/library_contents. Mine looks like the following:

[root@tara mhvtl]# cat library_contents
# Define how many tape drives you want in the vtl..
# The 'XYZZY_...' is the serial number assigned to
# this tape device.

Drive 1: ZF7584364

# Place holder for the robotic arm. Not really used.
Picker 1:

# Media Access Port
# (mailslots, Cartridge Access Port, <insert your favourate name here>)
# Again, define how many MAPs this vtl will contain.
MAP 1:
MAP 2:
MAP 3:
MAP 4:

# And the 'big' on, define your media and in which slot contains media.
# When the rc script is started, all media listed here will be created
# using the default media capacity.
Slot 1:    800843S3
Slot 2: 800844S3
Slot 3: 800845S3
Slot 4: 800846S3
Slot 5: 800847S3
Slot 6: 800848S3
Slot 7: 800849S3
Slot 8: 800850S3
Slot 9: 800851S3
Slot 10: 800852S3
Slot 11: 800853S3
Slot 12: 800854S3
Slot 13: 800855S3
Slot 14: 800856S3
Slot 15: 800857S3
Slot 16: 800858S3
Slot 17: 800859S3
Slot 18: 800860S3
Slot 19: 800861S3
Slot 20: 800862S3
Slot 21:
Slot 22:
Slot 23:
Slot 24:
Slot 25:
Slot 26:
Slot 27:
Slot 28:
Slot 29:
Slot 30:
Slot 31: CLN001L1
Slot 32: CLN002L1

In the above configuration, we’ve got a library with 32 presented slots, with slots 1-20 occupied by writable tapes, and slots 31-32 occupied with cleaning cartridges. Feel free to manipulate the numbers as you wish. (If you’re wondering about the choice of barcode labels, I’m terribly predictable. Every time I start a sequence of barcode labels in examples, I always start with 800843.)

Getting NetWorker and the VTL working together

Once the VTL has been configured, start it using the init script:

[root@tara mhvtl]# /etc/init.d/mhvtl start
vtllibrary process PID is 5315

So long as everything is working, you should see processes along the lines of:

[root@tara mhvtl]# ps -eaf | grep vtl
vtl       5310     1  0 08:56 ?        00:00:00 vtltape -q 1
vtl       5315     1  0 08:56 ?        00:00:00 vtllibrary -q 0

Looking in /opt/vtl, the default location for the VTL data, you should see the following files (suitably adjusted for any changes you make to barcodes/contents):

[root@tara mhvtl]# ls /opt/vtl
800843S3  800845S3  800847S3  800849S3  800851S3  800853S3
800855S3  800857S3  800859S3  800861S3  CLN001L1  800844S3
800846S3  800848S3  800850S3  800852S3  800854S3  800856S3
800858S3  800860S3  800862S3  CLN002L1

If we check the NetWorker inquire output, we get the following*:

[root@tara mhvtl]# inquire -l

-l flag found: searching all LUNs, which may take over 10 minutes per adapter
 for some fibre channel adapters.  Please be patient.

scsidev@0.0.0:STK     L700     5500|Autochanger (Jukebox), /dev/sg1
                                    S/N:    XYZZY     
                                    ATNN=STK     L700            XYZZY     
                                    WWNN=5123456003030303
scsidev@0.1.0:QUANTUM SDLT600  5500|Tape, /dev/nst0
                                    S/N:    ZF7584364
                                    ATNN=QUANTUM SDLT600         ZF7584364
                                    WWNN=5123456003030303

Assuming you get inquire output like the above, you next need to create your tape library. Below is the output of jbconfig command:

[root@tara mhvtl]# jbconfig

Jbconfig is running on host tara.pmdg.lab (Linux 2.6.18-128.1.16.el5),
 and is using tara.pmdg.lab as the NetWorker server.

 1) Configure an AlphaStor Library.
 2) Configure an Autodetected SCSI Jukebox.
 3) Configure an Autodetected NDMP SCSI Jukebox.
 4) Configure an SJI Jukebox.
 5) Configure an STL Silo.

What kind of Jukebox are you configuring? [1] 2
14484:jbconfig: Scanning SCSI buses; this may take a while ...
Installing 'Standard SCSI Jukebox' jukebox - scsidev@0.0.0.

What name do you want to assign to this jukebox device? MHVTL
15814:jbconfig: Attempting to detect serial numbers on the jukebox and drives ...

15815:jbconfig: Will try to use SCSI information returned by jukebox to configure drives.

Turn NetWorker auto-cleaning on (yes / no) [yes]? yes

The following drive(s) can be auto-configured in this jukebox:
 1> sdlt600 @ 0.1.0 ==> /dev/nst0
These are all the drives that this jukebox has reported.

To change the drive model(s) or configure them as shared or NDMP drives,
 you need to bypass auto-configure. Bypass auto-configure? (yes / no) [no] no

Jukebox has been added successfully

The following configuration options have been set:

> Jukebox description to the control port and model.
> Autochanger control port to the port at which we found it.
> Networker managed tape autocleaning on.
> Barcode reading to on.
> Volume labels that match the barcodes.
> Slot intended to hold cleaning cartridge to 32.  Please insure that a
 cleaning cartridge is in that slot
> Number of times we will use a new cleaning cartridge to 5.
> Cleaning interval for the tape drives to 6 months.

You can review and change the characteristics of the autochanger and its
 associated devices using the NetWorker Management Console.

Would you like to configure another jukebox? (yes/no) [no]no

Using the VTL with NetWorker

Once you’ve got the jukebox created, start up with a simple command – plain old nsrjb:

[root@tara mhvtl]# nsrjb

Jukebox MHVTL: (Ready to accept commands)
14118:nsrjb: No volumes found in the media database...continuing.
slot  volume                      pool  barcode   volume id  recyclable
 1: -*                                  800843S3  -                    
 2: -*                                  800844S3  -                    
 3: -*                                  800845S3  -                    
 4: -*                                  800846S3  -                    
 5: -*                                  800847S3  -                    
 6: -*                                  800848S3  -                    
 7: -*                                  800849S3  -                    
 8: -*                                  800850S3  -                    
 9: -*                                  800851S3  -                    
10: -*                                  800852S3  -                    
11: -*                                  800853S3  -                    
12: -*                                  800854S3  -                    
13: -*                                  800855S3  -                    
14: -*                                  800856S3  -                    
15: -*                                  800857S3  -                    
16: -*                                  800858S3  -                    
17: -*                                  800859S3  -                    
18: -*                                  800860S3  -                    
19: -*                                  800861S3  -                    
20: -*                                  800862S3  -                    
21:                                                                        
22:                                                                        
23:                                                                        
24:                                                                        
25:                                                                        
26:                                                                        
27:                                                                        
28:                                                                        
29:                                                                        
30:                                                                        
31: -*                                  CLN001L1  -                    
32: Cleaning Tape (5 uses left)         CLN002L1  -                    
 *not registered in the NetWorker media data base

drive 1 (/dev/nst0) slot   :

(Note that I ran that about 30 seconds after the jukebox was created, so it had already transitioned into the “Ready to accept commands” state.)

The VTL isn’t built for speed, but it’s still zippy enough for lab testing. Here’s the output from a verbose label command, with timestamps added:

[root@tara mhvtl]# date ; nsrjb -Lvvv -b Default -S 1; date
Sat Jul 11 09:09:53 EST 2009
setting verbosity level to `3'
Info: Preparing to load volume `-' from slot 1 into device `/dev/nst0'.
Info: Loading volume `-' from slot `1' into device `/dev/nst0'.
Info: Load sleep for 5 seconds.
Info: Performing operation `Verify label' on device `/dev/nst0'.
Info: Operation `Verify label' in progress on device `/dev/nst0'
Info: Cannot read the current volume label `Tape label read for volume
 ? in pool ?, is not recognised by Networker: Input/output error'.
Info: nsrmmgd assumes the volume is unlabeled and will write a new label.
Info: Performing operation `Label without mount' on device `/dev/nst0'.
Info: Operation `Label without mount' in progress on device `/dev/nst0'
Info: Label: `800843S3', pool: `Default', capacity: `<NULL>'.
Info: Performing operation `Eject' on device `/dev/nst0'.
Info: Operation `Eject' in progress on device `/dev/nst0'
Info: Eject sleep for 5 seconds.
Info: Preparing to unload volume `800843S3' from device `/dev/nst0' to slot 1.
Info: Unloading volume `800843S3' from device `/dev/nst0' to slot 1.
Info: Unload sleep for 5 seconds.
Sat Jul 11 09:10:33 EST 2009

Writing a backup, we get the obligitory screen-shot:

Screenshot showing backup to Mark's VTL

Screenshot showing backup to Mark's VTL

It’s still about recovery

Even though this is for lab usage only, we still need to make sure that what we write to virtual tape is what we get back. So after that backup, I ran a recovery, restoring the backup to another location. Performing checksums against the source and the original yielded:

[root@tara /]# md5sum /usr/share/doc/crash-4.0/README
/backup/recover_test/doc/crash-4.0/README
73568e4d9e09ce2847673dd5156cb571  /usr/share/doc/crash-4.0/README
73568e4d9e09ce2847673dd5156cb571  /backup/recover_test/doc/crash-4.0/README

Caveats

In conclusion, I’d like to offer a few caveats:

  1. In case I’ve not mentioned this enough, this is not a production VTL. Please don’t think I’m advocating it as a replacement to a full VTL.
  2. The VTL will let you backup as much as you want to any piece of media, so be careful with space management – it’s on your own head to manage media sizes, etc.
  3. The default placement of the VTL object files (i.e., media) is in /opt/vtl, which is likely to be in the root filesystem on an average Linux host. Thus, if you don’t keep an eye on media capacity, you’re going to overrun your root filesystem (or whatever filesystem the VTL data is stored in).
  4. You still need either a NetWorker autochanger license or to be running in eval mode to be able to use/configure this.

Again, if I haven’t said this enough – this is for lab testing.

[2009-07-13 Edit]

Proving that sometimes I just don’t read the documentation sufficiently, with a little bit more digging I discovered that Mark has also implemented a mktape command, that creates media with user nominated sizes. By stopping NetWorker, deleting the VTL media, recreating the media with the nominated sizes then restarting the VTL and NetWorker, you can control capacity using this VTL. Most importantly, that means you can simulate tape changes.

[2009-11-15]

See here for an update article covering multiple drive support, now that Mark has this working in a way which is compatible with NetWorker.


* Note – acknowledging I’ve adjusted the spacing slightly in the inquire output to ensure it fits on the average browser. That’s the only manipulation that was done though.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha