Whenever I conduct training, I tell a story about a disaster recovery test that I ran for a customer in the early 00’s. This had to be run within the customer datacentre, and the setup looked vaguely like this:
As anyone who has worked within an actual datacentre computer room can attest to, these rooms get pretty noisy, and pretty cold. For instance, even at that relatively close distance I actually couldn’t hear the tape library in operation. If I needed to be physically aware of what it was doing, I had to walk to the front of it and cup my hands between my face and the glass panel and peer at the robot and drives. (It was voyeuristic, in a geeky sort of way.)
Now this customer had STK 9840 tape drives, which you might know are very fast when it comes to load/unload operations. When you combine a fast drive with a fast robot mechanism such as was in the STK 9740, it means that the following NetWorker process can actually happen quite quickly:
Load Tape -> Read Label -> Eject Tape -> Unload Tape
In fact, I wasn’t quite aware when I first started the disaster recovery exercise just how quickly these operations would run, as the implementation had been done by other staff before I joined the company – I was just in to do the recovery testing. The testing was made all the more interesting given this was the first system I used with SmartMedia, Legato’s library virtualisation software at the time.
The challenge I had was a simple one: I kept on issuing volume load operations that NetWorker appeared to be accepting as valid, but not processing. I.e., I’d issue a volume load operation, get up from my desk, wander around to the front of the library, peer inside and … see no activity. No tapes in drives, no moving robot, nothing.
After a few iterations of this – you know, the mindless “maybe if I keep doing it, it’ll start working” sort of approach that we all sometimes suffer – it occurred to me to check the logs, and sure enough, NetWorker was reporting that it was loading the volume, detecting a tape label it didn’t know about (of course! – doofus!) and spitting the tape back out again.
The only thing was that it was happening just fast enough that (because I’d felt no need to rush) it would be done by the time I’d get up, grab my phone, and head around to the front of the library. Because I couldn’t hear the damn thing, I had no idea what was going on.
Once I realised how the system worked – in terms of speed of operations – the rest of the process worked smoothly.
A year or two later, I was helping a customer transition from ArcServe to NetWorker, and they had an interesting tape library. I can’t remember the brand now, but it had 66 slots, with 22 slots only ever facing the robot at any one time. The slots themselves were on a carousel, and that carousel could spin into 1 of 3 positions to allow the robot access to the slots. I thought that was a pretty weird design, but then I was confronted with just how long it would take the library to become ready to import a tape after it was dropped in its CAP. In fact, confronted would be better said as confounded – with the computer room several floors beneath the floor I was working on, it was possible to go and put a tape in the CAP and still have the library offline before I got back upstairs. It left me certain that NetWorker was misbehaving.
In fact, NetWorker was reasonably fine – but again, I was working with a library I wasn’t familiar with. It turned out that the bar code reader on the robot head couldn’t actually read the media barcode from the CAP – instead, the robot had to (slowly, laboriously), take the tape out of the CAP, put it a special “reading the barcode slot” at the bottom of the library, read the barcode, then (slowly, laboriously), take the tape out of the special “reading the barcode slot” and return it to the CAP.
Both the STK 9730/9840 experience and the carousel robot experience taught me some very important lessons – lessons which I think every backup administrator should ensure to experience:
- You can’t accurately diagnose your environment unless you know how it normally works.
- You can’t know how your environment normally works unless you are aware of the physical timings of activities.
- You can’t know the physical timings of activities unless you physically watch them.
- Therefore you can’t accurately diagnose your environment unless you physically watch your environment.
Now, I’m not suggesting you have to watch your environment all the time – for most backup administrators that would suggest having a desk in a very chilly and noisy computer room. (I once spent every work day for a month in a frigid computer room with overhead cooling. While listening to Justin Bieber for 30 minutes would have been more torturous, it would have only been by a very small amount.)
What you do need to do though is ensure you know how long the basic operations take – loading tape, unloading tape, withdrawing and depositing tapes, re-initialising the library after a reset, powering on, etc. This means sitting with the physical components of your backup system and running the various commands and becoming familiar with how long they take to complete. You can read as many tech specs and manuals as you want, but until you’ve sat down with your tape library (or a comparable one), and experienced the timings yourself, you’re going to be working in the dark when it comes to debugging the system if issues occur.
It’s actually a natural extension to standard system administration practices. A system administrator who is familiar with his or her system will have a reasonably good idea of what processes should be running under normal operations (or rather – what key processes), and what average/peak loading conditions should do to the host. Taking it to the physical layer as a backup administrator is perfectly normal.