Jul 312009

Everyone makes mistakes. That’s part of being human. Indeed, I’d suggest that anyone who expects you to never make mistakes in your job may perhaps be either demented or live at right angles to reality.

I believe the best, the most realistic and useful thing we can aim for is to never make the same mistake twice.

Thus, below I present my “worst recovery ever” as an example of a mistake that I certainly don’t intend to ever have happen to me again. It happened in my last job, and since then I’ve changed the way I work when it comes to recoveries.

It was Friday afternoon – about 3pm in fact. There was a training course running, and for once the training network was behaving itself. As our management had (once) bought into the notion of network computing, our training environment was sufficiently convoluted such that Sun-Rays referred to our production backup server+fileserver+SunRay server, then ran RDP to a VMware server.

One of our engineers who had little to do with NetWorker (ironically, he was hired to be my field replacement for NetWorker, but circumstances changed when he was hired) ran one of those notoriously bad RedHat updates that for a while killed glibc and a bunch of other system files if you didn’t happen to be in North America. So after much diagnosing and discussing, it was necessary to do an OS recovery. The only problem was that the OS on his laptop was so hosed that you couldn’t start any more login sessions, so you were stuck with what was there.

I started the recovery, selected files and viewed volumes. However, some of the files we needed were on media that was outside of the library, and because we were backing up to disk, then cloning, then staging, NetWorker wanted the (offsite) clones, not the onsite originals. His laptop didn’t have administration rights on the backup server, so I ssh’d across to the backup server, set the appropriate tapes to have a ‘suspect’ status, then ran up the recovery again. I selected the root filesystem (/) and kicked off the recovery.

About 2-3 minutes later someone came to ask me about why they couldn’t access the fileserver any more. Checking, I couldn’t log in either. It was odd – the training course was still working, but nothing new would work.

And then it hit me. I never logged out of the Sun system before I kicked off the recovery. Not only that, when I kicked off the recovery, I’d run “recover -c linuxClienton the Sun server.

That’s right, I was recovering a Linux filesystem on top of a Solaris system – including /dev, including all the base binaries. And because it was a recovery designed to overwrite a clobbered operating system, I’d told it to force-overwrite everything it came across.

I was …unhappy… with myself. I aborted the recovery, obviously, but the damage had already been done. No-one else could do anything, and I couldn’t start the recovery of the backup server because it was also the Sun-Ray server, and there were a bunch of paying students wrapping up their training course. So I had to wait for the training course to complete before I could even start the recovery. Practically everyone else got an early mark, since there wasn’t much they could do.

The recovery turned out to be somewhat problematic. Because the Solaris OS was hosed, no new processes could be started on that – the only solution, after much consideration, was an OS reinstall. At some point though the OS disks for that server had been taken out to a customer site because that customer had lost their disks, or something along those lines, so an earlier revision Solaris installer disk was used. However, it turned out that earlier revision disk wasn’t compatible with that hardware, and these were the days when Solaris took forever to install when you didn’t have tonnes of RAM. So by 10pm that night, I’d given up on being able to install the OS that night. Murphy’s law strikes as much with recoveries as it does with anything else in IT, so of course I’d not slept a wink the night before, and I lived one and a half hours away from the office by train at the best of times – at that time of the night, 3 hours at least.

I fell into bed around 2am, but fretting over the recovery, didn’t sleep much either. I’d recently purchased a Sun workstation myself, so I knew I had an install disk that would work, which was why I’d chosen to come home rather than download Solaris from the office. Later that morning saw me heading back into the office to keep going, and of course, my installer of Solaris was now faulty so the OS install hit bad sectors on the disk during the process and I hit my head repeatedly against the column in my office.

Lady luck then struck. Or at least wandered by and waved. There was a detached mirror left over from some disk swapping from about 2-3 months ago.

Finally I managed to boot the system from the previously attached mirror, get that mirror re-syncing, and installed NetWorker. Once the mirrors had resynced the recovery started in earnest, and the system finally came back.

By that stage though it was about 5pm on Saturday afternoon. I was wrecked from having not slept for a couple of days, not to mention still bloody angry with myself for the entire chain of events.

So I vowed I’d never make the same mistake again.

How, I hear you ask, will I prevent myself from ever making this mistake again? I check, check, check. In real recovery situations, I never run a recovery command any longer without first checking what host I’m logged into. It takes me an extra 10 seconds, maybe even fewer, but it guarantees that I don’t get that sick filling in the pit of my stomach that comes with FUBAR’ing one system when trying to recover another.

If I had done that simple check all those years ago, I would have had a lovely, quiet weekend.

Sorry, the comment form is closed at this time.