Whenever I conduct training, I tell a story about a disaster recovery test that I ran for a customer in the early 00′s. This had to be run within the customer datacentre, and the setup looked vaguely like this:

System setupAs anyone who has worked within an actual datacentre computer room can attest to, these rooms get pretty noisy, and pretty cold. For instance, even at that relatively close distance I actually couldn’t hear the tape library in operation. If I needed to be physically aware of what it was doing, I had to walk to the front of it and cup my hands between my face and the glass panel and peer at the robot and drives. (It was voyeuristic, in a geeky sort of way.)

Now this customer had STK 9840 tape drives, which you might know are very fast when it comes to load/unload operations. When you combine a fast drive with a fast robot mechanism such as was in the STK 9740, it means that the following NetWorker process can actually happen quite quickly:

Load Tape -> Read Label -> Eject Tape -> Unload Tape

In fact, I wasn’t quite aware when I first started the disaster recovery exercise just how quickly these operations would run, as the implementation had been done by other staff before I joined the company – I was just in to do the recovery testing. The testing was made all the more interesting given this was the first system I used with SmartMedia, Legato’s library virtualisation software at the time.

The challenge I had was a simple one: I kept on issuing volume load operations that NetWorker appeared to be accepting as valid, but not processing. I.e., I’d issue a volume load operation, get up from my desk, wander around to the front of the library, peer inside and … see no activity. No tapes in drives, no moving robot, nothing.

After a few iterations of this – you know, the mindless “maybe if I keep doing it, it’ll start working” sort of approach that we all sometimes suffer – it occurred to me to check the logs, and sure enough, NetWorker was reporting that it was loading the volume, detecting a tape label it didn’t know about (of course! – doofus!) and spitting the tape back out again.

The only thing was that it was happening just fast enough that (because I’d felt no need to rush) it would be done by the time I’d get up, grab my phone, and head around to the front of the library. Because I couldn’t hear the damn thing, I had no idea what was going on.

Once I realised how the system worked – in terms of speed of operations – the rest of the process worked smoothly.

A year or two later, I was helping a customer transition from ArcServe to NetWorker, and they had an interesting tape library. I can’t remember the brand now, but it had 66 slots, with 22 slots only ever facing the robot at any one time. The slots themselves were on a carousel, and that carousel could spin into 1 of 3 positions to allow the robot access to the slots. I thought that was a pretty weird design, but then I was confronted with just how long it would take the library to become ready to import a tape after it was dropped in its CAP. In fact, confronted would be better said as confounded – with the computer room several floors beneath the floor I was working on, it was possible to go and put a tape in the CAP and still have the library offline before I got back upstairs. It left me certain that NetWorker was misbehaving.

In fact, NetWorker was reasonably fine – but again, I was working with a library I wasn’t familiar with. It turned out that the bar code reader on the robot head couldn’t actually read the media barcode from the CAP – instead, the robot had to (slowly, laboriously), take the tape out of the CAP, put it a special “reading the barcode slot” at the bottom of the library, read the barcode, then (slowly, laboriously), take the tape out of the special “reading the barcode slot” and return it to the CAP.

Both the STK 9730/9840 experience and the carousel robot experience taught me some very important lessons – lessons which I think every backup administrator should ensure to experience:

  1. You can’t accurately diagnose your environment unless you know how it normally works.
  2. You can’t know how your environment normally works unless you are aware of the physical timings of activities.
  3. You can’t know the physical timings of activities unless you physically watch them.
  4. Therefore you can’t accurately diagnose your environment unless you physically watch your environment.

Now, I’m not suggesting you have to watch your environment all the time – for most backup administrators that would suggest having a desk in a very chilly and noisy computer room. (I once spent every work day for a month in a frigid computer room with overhead cooling. While listening to Justin Bieber for 30 minutes would have been more torturous, it would have only been by a very small amount.)

What you do need to do though is ensure you know how long the basic operations take – loading tape, unloading tape, withdrawing and depositing tapes, re-initialising the library after a reset, powering on, etc. This means sitting with the physical components of your backup system and running the various commands and becoming familiar with how long they take to complete. You can read as many tech specs and manuals as you want, but until you’ve sat down with your tape library (or a comparable one), and experienced the timings yourself, you’re going to be working in the dark when it comes to debugging the system if issues occur.

It’s actually a natural extension to standard system administration practices. A system administrator who is familiar with his or her system will have a reasonably good idea of what processes should be running under normal operations (or rather – what key processes), and what average/peak loading conditions should do to the host. Taking it to the physical layer as a backup administrator is perfectly normal.

 

There’s a simple rule to remember when it comes to removable media handling (both within backups, and generally within IT) – if you don’t know where your media is, you can’t be certain someone hasn’t misappropriated it.

Taking this further, if you can’t be sure of the security of your backup media, you can’t be sure of the security of your backups; and if you can’t be sure of the security of your backups, you can’t be sure of your security of your data.

So, how can you be certain of the security of your media, and therefore your backups and data?

Here’s a few guidelines:

  • Always use reputable media handling companies. This is for a two-fold requirement. First, you want to make sure that the company that handles and stores your media knows how to treat it carefully. That means correct handling procedures, storage in appropriate environmental conditions, and storage in a location that is unlikely to be affected by disasters that could affect your datacentre. The second part of the requirement is knowing that the media is always secure. This means signed, authorised access, a known reputation for security, audited processes and (preferably) premises that you can periodically visit to confirm security levels.
  • Store media securely on-site too. It is far from the case that media can only be stolen when off-site or travelling to/from site. Indeed, some of Australia’s biggest media losses have occurred on-site due to poor media handling security. (I seriously doubt Australia is unique in this). Tapes shouldn’t be kept insecurely anywhere on-site. When being transported from the computer room to on-site storage, they should be securely monitored at all times. When readying for transport off-site, they should be kept under lock and key, or kept in a secure location. And when at-rest on-site, they should also be kept under lock and key.
  • Media encryption. For a long time media encryption has been available only to the high end of enterprise backup. However, with tape technologies such as LTO-4 incorporating hardware encryption, any company using removable media in their backup environment should either:
    • Already be using media encryption, or
    • Be actively planning moving to media encryption, or
    • If nothing else, use NetWorker’s software encryption on critically sensitive data if the business is too small to afford hardware-encryption devices. This means taking a hit on backup performance, but as the old saying goes, you can’t have your cake and eat it too. I.e., there’s always a cost to encryption.
  • Secure key management. Media encryption doesn’t mean a thing if you’re not using some form of secure key management. Discuss and plan backup key management with your corporate security policy makers.
  • Have established, immutable processes for the recall of media. Media that has been sent to offsite storage should either be returned under specific, agreed circumstances. That may be a fixed rotation policy normally, with provisions for recall for recoveries with specific authorisation. Make sure that authorisation process is locked down with your media offsite vendor so that social engineering attacks can’t be employed (particularly when it comes to ex-employees).
  • Use strong password management for backup server access. As I’ve previously discussed, your entire backup environment is only as secure as your backup server. This places a special responsibility on backup and system administrators to ensure that the backup environment is highly secure.

Of course, there’s more to backup systems security than the above, but I wanted to focus primarily on physical security considerations for removable media, which for a lot of sites represent the weakest link in the security of the backup environment (and by extension, a significantly weak link in the security of the company’s IT systems and data as well).

If you fail to focus on removable media security, you potentially leave your company open to data loss.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha