Continuing the Commands you should know topic, here’s a single variant of nsrjb that you really, really should know.

Introduced somewhere in the NetWorker 7.3.x, the fast inventory operation available in nsrjb is a boon if you’re in a situation where you’re introducing a lot of barcoded media to a library.

By running the command:

# nsrjb -II

NetWorker will inventory only those volumes that don’t need to be loaded. I.e., just the barcoded volumes that are already in the media database. Thus, this command can be a great time-saver.

 

Is your backup server a modern, state of the art machine with high speed disk, significant IO throughput capabilities and ample RAM so as to not be a bottleneck in your environment?

If not, why?

Given the nature of what it does – support systems via backup and recovery – your backup server is, by extension, “part of” your most critical production server(s). I’m not saying that your backup server should be more powerful than any of your production servers, but what I do want to say is that your backup server shouldn’t be a restricting agent in relation to the performance requirements of those production servers.

Let me give you an example – the NetWorker index region. Using Unix for convenience, we’re talking about /nsr/index. This region should either be on equally high speed drives as your fastest production system drives, or on something that is still suitably fast.

For instance, in much smaller companies, I’ve often seen the production servers have SCSI drives or SCSI JBODs, but the backup server just be a machine with a couple of mirrored SATA drives.

In larger companies, you’ll have the backup server connected to the SAN with the rest of the production systems, but while the production systems will get access to 15,000 RPM SCSI drives, the backup server will get instead 7,200 RPM SATA drives (or worse, previously, 5,400 RPM ATA drives).

This is a flawed design process for one very important reason – for every file you backup, you need to generate and maintain index data. That is, NetWorker server disk IO occurs in conjunction with backups*.

More importantly, when it comes time to do a recovery, and indices must be accessed, do you want to pull index records for say, 20,000,000 files from slow disk drives or fast disk drives?

(Now, as we move towards flash drives for critical performance systems, I’m not going to suggest that if you’re using flash storage for key systems you should also use it for backup systems. There is always a price point at which you have to start scaling back what you want vs what you need. However, in those instances I’d suggest that if you can afford flash drives for critical production systems, you can afford 15,000 RPM SCSI drives for the backup servers’ /nsr/index region.)

Where cost for higher speed drives becomes an issue, another option is to scale back the speed of the individual drives but use more spindles, even if the actual space used on each drive is less than the capacity of the drive**.

In that case for instance, you might have 15,000 RPM drives for your primary production servers, but the backup servers’ /nsr/index region might reside on 7,200 RPM SATA drives successfully, so long as they’re arrayed (no pun intended) in such a way that there’s sufficient spindles to make reading back data fast. Equally then, in such a situation, hardware RAID (or software RAID on systems that have sufficient CPUs and cores that it equals or exceeds hardware RAID performance) will allow for faster processing of data for writing (e.g., RAID-5 or RAID-3).

In the end, your backup server should be like a butler (or a personal assistant, if you prefer the term) – always there, always ready and able to assist with whatever it is you want done, but never, ever an impediment.


* I see this as a similar design flaw to say, using 7,200 RPM drives as a copy-on-write snapshot area for 15,000 RPM drives.
** Ah, back in the ‘old’ days, where a database might be spread across 40 x 2GB drives, using only 100 MB from each drive!

 

We want to be able to do index recoveries, right? If something terrible happens on the backup server, being able to recover the indices is pretty important.

There’s something which is often forgotten however when it comes to index backups, and if you don’t know about it, you may get stung.

In some environments, when a client is decommissioned, the client is deleted from the resource database but the index is left in place. (I’m not a fan of this – if you want to continue to recover from a client, you should leave the client configured!)

NetWorker will only do index maintenance tasks on clients that are configured. These maintenance tasks are:

  • Backup
  • Recovery
  • Upgrade

If you really want to be able to recover from clients at a future date, even if you’re not actively backing them up now, do you really want to risk that the indices for those clients are no longer available?

 

My colleague Brian Norris has an excellent suggestion in his blog about periodically capturing the client IDs within the NetWorker datazone.

I plan on making his suggestion slightly redundant for users of IDATA Tools – I plan to update the client-report utility to include the client ID to assist with this very strategy.

 

There was recently a discussion on the NetWorker mailing list that was sparked by someone asking what sort of LTO-4 library would be recommended by the community.

This led to some very interesting and useful feedback to the person posing the question. A lot of people had feedback about different libraries they’d used – both good and bad – and questions to ask, such as slot count and CAP/Mail slot size. I felt it was also important to weigh in on media movement speed, as I think this is often something which is disregarded when evaluating libraries – even though it often can play a factor in backup throughput, usability and perceived performance.

Rather than summarise myself, here’s what I had to say on the topic then:

Too often people worry about the speed and capacity of the media, and forget about the incidental factors, such as robotic movement times and even load/seek time on the media. These can play an important factor in backup and  – more importantly, really – recovery schedules. When it comes to measuring backup performance, the sequence of “returning to slot, picking next tape, placing in drive” can actually start to make a significant impact on what I refer to as your overnight “backup bandwidth”. If it takes say, 70 seconds for one library to do it and your drives write at 160MB/s, then that’s a 10GB interruption to your backups. If another library can do the same thing in 30 seconds, that’s just a 4.7GB interruption to your backups. (I’m deliberately excluding load/unload times of the media, because in a realistic comparison it would be the same drives in both libraries…)  Repeat that say, 30 times a night, and suddenly you’re deciding whether you can afford to lose 300GB in backup time a night or 141GB in backup time a night. For bigger sites, these numbers can actually become very important.

If you are considering a new tape library, be sure to ask about not only the “easy” numbers, such as how much it costs, how much maintenance costs, and how fast the drives read/write, but also the more challenging numbers – how much time it takes to move media around.

 

I’ve had a few customers in the last few days now upgrade from NetWorker 7.3.x to 7.4.x, and the most common issue that seems to be coming up is a failure in CDI. The scenario is that tape drives become effectively unusable, and whenever an attempt is made to use the drives, an error along the lines of the following comes up:

nsrd media alert: device deviceName: serial number mismatch, check system device ordering. Expected…

or:

nsrd media warning: deviceName reading: read open error: command completed successfully (drive status is … The drive serial number has changed)

The first thing to do when you get this error, it seems, is to check the status of CDI. If you had CDI turned on (i.e., set to “SCSI Commands”), turn it off for every tape device and then test again. That will usually do the trick.

 

If you’re using a modern NetWorker environment, the chances are that you’ll periodically notice entries such as the following in the daemon.log / daemon.raw files on the backup server:

39078 02/02/2009 09:45:13 PM  0 0 2 1152952640 5095 0 nox nsrexecd SYSTEM error: There is already a machine using the name: “faero”. Either choose a different name for your machine, or delete the “NSR peer information” entry for “faero” on host: “nox”

While this may look confronting, it’s actually a trivially easy error to fix that requires just a minute or so of your time with nsradmin. First, note the client that the error is about, and the client that the error is being recorded from. In this case, the error is about the client faero, while the error is being registered against the host nox.

To fix, run up nsradmin against the client service on nox:

# nsradmin -p nsrexec -s nox

(alternatively, you can use: nsradmin -p 390113 -s nox)

At the nsradmin> prompt, enter the command:

delete type: NSR peer information; name: faero

And answer yes when prompted to confirm. For example, the session might resemble the following:

nsradmin> delete type: NSR peer information; name: faero
                        type: NSR peer information;
               administrator: root, "user=root,host=nox";
                        name: faero;
               peer hostname: faero;
          Change certificate: ;
    certificate file to load: ;
Delete? y
deleted resource id 17.0.83.117.0.0.0.0.210.37.85.73.0.0.0.0.10.0.0.1(1)

There, you’ve done it. Note that you should be periodically scanning your daemon raw/log files for errors and trying to eliminate them. The goal should be that any error or warning reported in the file is something that you do need to worry about/investigate, rather than having a lot of “false positives” floating around in the system.

[Update, 2009-05-12]

I thought I’d mention that one of the most common times I see these warnings occur is after I’ve uninstalled/reinstalled NetWorker on a client, as opposed to having upgraded. Since on some clients it’s more or less necessary to uninstall/reinstall rather than upgrade, that helps to understand why the information is lost periodically. My surmise is that on a new install, the NetWorker client processes generate a new ‘certificate’ or ‘identity’. As this new information conflicts with existing information the backup server has on the client, that’s what triggers the error.

It could be that other factors can cause this, but it seems that this is at least a primary cause.

 

Totally off topic, if you’re interested in AI at all, you may want to check out 3 Laws Unsafe, a site that takes a fresh and analytical look at the much touted “save humanity from evil robot/AI oppression” 3 laws of Robotics as proposed by Isaac Asimov. Personally, I’m not a fan of the 3 laws – I think they’re highly unethical, completely breakable by a single rogue programmer, and approach the problem from a purely mechanical perspective, failing to understand that if humanity does create artificial intelligence – particularly if it leads to a singularity (which seems inevitable) – then humanity does have an obligation not to create such intelligences as slaves.

 

I’ve previously been amused by stories claiming that tape is dead due to disk backup or VTL being able to seamlessly replace tape. However, lately it just sounds like a tiresome broken record. It’s usually accompanied by poorly described comments such as “tape is unreliable” or “tape is slow”. Honestly, do people who make the claims that tape is dead because of the latest backup-to-disk options understand what enterprise tape is? Sure, tape is slow and unreliable if you’re say, looking at DDS for your backups, but step up to enterprise media and the story is quite different.

So the story that’s set me off on this little rebuttal is “Disk Encroaches on Tape Backup’s Territory”. The article includes the following gem:

Consequently, many companies have been on the lookout for a better option. With the steady decline in prices for disk storage, it has become a replacement for tape at some companies. Virtual Tape Libraries (VTLs), which mimic tape backup systems while storing information on disk, deliver the robust features found with a tape backup system while eliminating tape’s shortcomings.

This conveniently leaves off the rather obvious point of – if your building is on fire and you have 30 seconds to spare, you might be able to run into the computer room and pull out the most recent tape, but I challenge you to pull out your VTL in that time.

More realistically though, you can offsite your media in a regular fashion, but you can’t offsite VTL or disk backup. Yes, you could position these devices offsite, connected by say, dark fibre, but what then happens if that site catches on fire? Yes, you could then replicate the VTLs or disk backup units between that site and another site, but what (a) how expensive will that be, and (b) what happens if corruption is introduced? For true safety, you need to know that at least one of your backups is not only offsite, but also offline.

I’m not dismissing either disk backup or VTL – both have a valid and in fact very important role to play in most enterprise backup solutions. However, the ongoing need to parrot that “tape is dead” is not only inaccurate, but tiresome.

 

(Or, information is like water – you can drink from it, you can swim in it, or you can drown in it. What do you want to do?)

IT people work in what I’d refer to as information rich domains. That is, there’s a huge amount of information out there that can be of use, and so the struggle is not necessarily a lack of information, but a challenge in finding the information you want.

(This, for what it’s worth, is why I think that certification exams as a whole are at best poorly representative of skills. Certifications for the most part seem to be about rote recall, which doesn’t reflect real life situations. That is, in real life when faced with a challenging technical problem, I don’t think many people lock themselves in a room devoid of any contact with anyone or anything else and attempts to solve the problem based on memory.)

Real life problem resolution is about not only having access to a plethora of information, but also being able to find the key bits of information. Yes, you need a certain base amount of knowledge in the area to get started, but after that the solution will come from your overall ability to problem solve, and your skill or capability to retrieve the right information.

There’s a few things I do that I think helps me to access information I need quickly. I’m not saying this suits everyone, but it works for me, so people of a similar ilk may find it useful.

First, when it comes to file storage, I’m incredibly anal retentive. (That means for instance, that it literally gives me the shudders if I look at someone’s desktop (Windows, Linux or Mac) and it’s full of files. To me that’s just like having a desk covered in papers and files 3 inches deep on every surface*.)

So I have lots of folders – lots, and lots, and lots of folders, nested, structured named in such a way that I can quickly access stored data. Yes, it may take a few clicks to navigate through folders, but I found this easier to do than searching through a few folders with hundreds or even thousands of files.

Being on a Mac, I make heavy use of Spotlight, the integrated search tool. To be quite frank, in Mac OS X 10.4 this feature sucked, performance-wise, and I regretted every time I tried to use it. In 10.5/Leopard however, it screams, and is fast enough that I even use it as an application launcher when I’m in a hurry.

Next, and what has helped me most in the last two years is a product called Yojimbo, from Bare Bones Software. This makes use of the SQLite component of Mac OS X to do information storage with full text search. I simply drop PDFs, text files, web locations, HTML pages, etc., into Yojimbo, (and also when I have the time add a few tags for additional identification).

The beauty of Yojimbo is that I don’t have to actually go to it in order to search it. One of its features is full Spotlight integration, so thus at any point that I’m looking for information I can just hit CMD+Space, type in the query in Spotlight, and get search results for both files still on disk, and content in Yojimbo’s database. Currently my Yojimbo database is about 1.5GB and continuing to grow as I bring more documentation into it**. (In fact, I don’t store documentation on my filesystem any more – unless I’m being lazy, I put everything into Yojimbo as I get it now.) If I need to send a document I find to a customer or colleague, I can export it and drop it in an email in a matter of seconds.

The final bit of organisation I do is archiving. I don’t like deleting information – in the past I’ve suffered the consequences of deleting something that I later found was no longer available. (E.g., needing to access a software compatibility guide from say, 1999 due to ancient versions of software in use.) At the same time though, I don’t want searches for current issues and problems to be cluttered with matching keywords from documents that are so old that all they’ll do is hinder, not help me. So I keep out of reach of my day to day searches older, historical information. Usually that’s stored on a separate fileserver – it’s there, it’s protected, it’s available if I want to access it, but it’s not getting in the way of what I need to do today.

So there’s a rough overview of how I stay organised. I know it won’t help everyone, but if you’re drowning in information, it may just be the start of a lifeline.


* I had a boss once who called such disorganisation a “discussion feature” for when he had customers in his office. You can imagine what I thought of that.

** Of course, it gets backed up! (I had to learn some AppleScript in order to properly quit Yojimbo at a strategically appropriate time of the day, copy the files for subsequent backup, and then restart it.)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha