Stop

 Architecture, Backup theory, NetWorker, Recovery  Comments Off on Stop
Mar 292012
 

Stop

The last 6 weeks my life has seemingly constantly been about interruptions. The house we’re renting has just been sold, and while I appreciate as a landlord myself the constraints of home ownership, I’ve also been made acutely aware of the challenges of trying to live a normal life while you’re constantly being asked to facilitate inspections, access, etc. The simple fact is that for 6 weeks, I’ve not been able to do anything much at all on weekends. Sure, the interruptions may only take an hour or two each day they occur, but since they happen in the middle of the day, there’s a whole bunch of things that you just can’t get to. Such as, a couple of weeks ago, a festival over a long weekend that was entirely unattainable.

Which brings me to the topic of this post – how much does your backup system interrupt you from your work?

If you’re a backup administrator, you probably question the logic of my question – after all, having to spend time on the backup system is just a case of doing your job.

However, this isn’t really the full story. Even if you’re a dedicated backup administrator, your job shouldn’t really be interruption based. An interruption based job, in that respect, implies a firefighting role – and a firefighting role is going to occur because of any combination of the following:

  • Architectural issues;
  • Procedural issues;
  • Hardware/software issues.

None of these should be all-encompassing enough that they become a dominating factor. Timesheets often demonstrate this in terms of how we start notating our used time. For more years than I can count I’ve worked in jobs where time has to be accounted for, and usually in 15 minute increments. But timesheets never account for spin-down and spin-up time. That is, if you’re working on something already, and a new task comes up that you have to switch across to, that switch-time is not instantaneous. (For further details, check here.)

So if your backup system is regularly acting as an interrupt system, are you working productively, or do you have an annoy-a-tron in your environment?

If you’re suffering high levels of interrupts in your backup environment, it’s time to look at changing the environment, even if that change means a temporary spike in work load or a requirement to bring some temporary staff on. With the possible exception of recoveries, no backup environment should be interrupt driven.

With the exception of recoveries, all other activities within a backup environment should be handled either as:

  • Change requests – a formal system tracking and monitoring successful implementation of non-major updates and alterations to the environment. This would cover new clients, new backup modules, etc.
  • Projects – a formal process for delivering substantial changes to the backup environment. (E.g., replacing an existing tape library with a combined backup to disk + long-term tape solution.)

Now I said “with the exception of recoveries” because, quite frankly, recoveries are the most important activity that can be done in a backup environment. As such, I want to note their processes explicitly. Recoveries should fall into one of three different categories:

  • User serviced – Recoveries that end-users or people other than backup administrators/operators can initiate, monitor and complete without intervention. This may be file recoveries from NAS units that integrate with snapshot/rollback functionality, it may be access to a NetWorker recovery GUI, or it may be the ability to initiate recovery from within an application module. These should be practically invisible to the backup administrators/operators.
  • Scheduled – Non-urgent recoveries that are requested via a formal process and submitted to the appropriate recovery facilitator to complete. These would be slotted into the facilitator’s work schedule on a priority basis.
  • Emergency – Critical recoveries (you could call these priority 1 recoveries – regardless of whether the official recovery request has been submitted or not)

In any environment, no matter how well architected, there will always be the risk of emergency situations requiring immediate action – critical faults don’t tend to be something you can just schedule into your work day, for instance.

However, in a well architected backup environment with functioning equipment, it should be the case that fire-fighting is a minimum job aspect, rather than an all-encompassing part of the backup administrator’s role.

Sep 162011
 

You all know about POETS day, don’t you? It’s a great acronym:

P-ss Off Early, Tomorrow’s Saturday

It’s a pretty good summation of a lot of the IT industry – we’re reluctant to kick off major changes on a Friday because … well, the weekend follows, and if something goes wrong, it could be disruptive.

But a day or so ago, Matt Stace (@matstace) tweeted:

If it’s not good enough to deploy on a Friday, what makes it good enough to deploy any other day of the week?

Some might think this is a little trite, but there’s actually good wisdom in Mat’s comment – if we lack the confidence that something we’re working on can be deployed safely on Friday, why should we be any more confident that it can be deployed safely at another time? In fact, when you stop and think about it, in the light of cold logic, there’s only two explanations:

  1. You aren’t sufficiently certain that what you’re going to deploy is ready, or
  2. You’re superstitious.

Now, I’m as willing as the next person to claim that Murphy’s Law takes a perverse delight in visiting computer rooms, but realistically, that’s just a tendency to catastrophise* things when they come up unexpectedly.

So, if you’re sitting back and saying that you or the company should hold off doing something on a Friday because, well, it’s Friday, it’s time to sit back and ask yourself – is it because you’re being superstitious, or is it because it’s just simply not ready to be done, regardless of what day it is?

I know I will be.


* Thanks to my good friend Christopher Banks (aka @bipolarbearnz) for introducing me to that word last night. I’ll be using it daily for months, I think.

Jul 152011
 

Your backup server is behaving perfectly normally, but you want to do one minor change to it. For example, you’ve read the performance tuning guide and realised you need to double the amount of RAM in the server. So you shut it down, install the extra memory, reboot it and it … goes to hell in a handbasket.

What happened?

Maybe filesystems didn’t mount.

Maybe a tape drive or library didn’t reappear.

Maybe … just maybe, someone made a change previously, but either (a) didn’t commit it to happen permanently or (b) didn’t test it with a reboot.

Your backup server is like any other production system, and therefore there’s a strong risk that uncontrolled change will cause issues. So, always make sure you follow these two rules:

  • If you make a change that takes you from a non-working to a working-state, make sure you commit the change and reboot to test;
  • If you make an addition to the system that would be lost or otherwise not present after a reboot, make sure you commit the change and have it peer reviewed. If unsure, reboot.
Peer review is everything in these situations, but reboot tests are quite critical. In particular, the more hardware is involved in the system (and nothing says hardware like “tape library”!), the more you should be rigorously testing change. No ifs, no buts. This is important.

The A-Z of Backup and Recovery

 Architecture, Backup theory, Data loss, Features, NetWorker, Support  Comments Off on The A-Z of Backup and Recovery
Jan 072010
 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

ProsCons
Maximum control over backup granularity, down to the individual file level.Each in-guest backup is unaware of other backups that may be happening on other virtual machines on the same server. Thus, the backups have the potential to actively compete for CPU, RAM, Network Bandwidth and Storage IOs. An aggressive or ill-considered approach to in-guest backup configuration can bring an entire virtual environment to its knees.
Coupled with NetWorker modules, allows for comprehensive application-consistent backups of enterprise products such as Oracle, Exchange Server, Sybase, Microsoft SQL Server, SAP, etc.Suffers same problems as conventional per-host agent backup solutions, most notably in consideration of potential performance inhibitors such as dense filesystems. Can result in the longest backups of all options.
Very strong support for granular recovery options.Bare Metal Recovery options are often more problematic or involved.
Least affected by changes to underlying virtual machine backup options.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

Dec 282009
 

So you’re a busy backup administrator and you’re getting ready to go on leave. It’s 4pm on your final day before the holiday, you’ve finally got everything off your plate, and you think to yourself, “Now I’ve finally got the time, I’ll just quickly upgrade NetWorker before I leave.”

This unfortunately is an alternative of that Friday change rule violation known as POETS.

There’s three distinctly wrong things with this scenario:

  • Infrastructure upgrade done without change control.
  • Infrastructure upgrade done at the last minute.
  • Infrastructure upgrade done without follow-up monitoring.

Any one of those scenarios is enough to cause a nightmare situation – either for yourself, getting call-outs when you’re meant to be on holidays, or for your colleagues, left in the lurch after you switch your phone off for two weeks and go on a holiday to the East Islands.

All three though? That’s just asking for trouble.

(This lesson doesn’t actually just apply to NetWorker – it applies across the board for system, application and storage administration. Don’t modify the system just before going away for a while.)

Just before this holiday season, I had a customer upgrade* their NetWorker server from 7.3.x to 7.5 before going on leave. Not 7.5.1, not 7.5.1.8, 7.5. This didn’t go so well, and a few days later when the fill-in administrators noticed the issue**, there was a bit of work to rectify the various issues and some backups during that time didn’t work.

This however is by no means unique. Following Twitter I noticed one on-call person suffer a hideous xmas day and following day working on a call-out from what appeared to be an untested change done by someone else before that other person went on holiday.

And non-betting man that I am, I’d bet a considerable wad of money (and win) that this fellow’s experience wasn’t unique for IT workers over xmas 2009.

In short: choosing to do an untested/uncontrolled upgrade just before going on holidays can be either self-destructive or selfish (or even both) – it may lose your your holiday, depending on the level of the fail and the backup (or lack thereof) within your company, or it may cause a colleague to have an insufferably unpleasant time. (Alternatively, if you can be reached, it may result in you having a bad time on your holiday in order to help out a colleague having a bad time as well.)

The problem with rushing through upgrades at the last minute is that they tend to be poorly done, even if they seem simple enough. Even if change control is being followed, if that change control has been rushed through (as it can sometimes be done as a “last minute” activity), then it provides no guarantee that the change will work smoothly. And don’t forget: Murphy’s Law works in the datacentre as well. Something that looks easy, that you should be able to do with your eyes closed, when done as a rush job at the last minute can come unstuck quite easily.

So please – for your sake, for your colleagues sake, for NetWorkers’ sake and for the sake of your company: please don’t upgrade just before you go on holidays.


* upgrade = “update” in NetWorker speak

** Which should serve as a reminder that you should never only have one backup administrator.

Directives and change control

 NetWorker, Policies  Comments Off on Directives and change control
Jul 152009
 

It’s easy to change NetWorker directives. A few clicks here and there if you use NMC, then a couple of lines of text rattled off into the right fields, and suddenly you’ve made anywhere from small, precise changes to massive changes to a backup.

It’s for this reason that I think that modifying directives within the backup configuration should be considered important enough that they warrant their own change control processes. (I’ve previously talked about the backup administrator needing to be part of the change control authorisation process – this is another aspect however.)

Now, don’t get me wrong – despite what former employees may think, I’m not keen on excessive levels of red tape. In fact, I think a smart system should be designed at all times to minimise administrative overheads while ensuring that all accounting is still correctly done.

That being said, directives are, for want of a better term, dangerous. Mis-used, they can result in recovered systems being unusable – in data loss.

With this in mind, like other aspects of the backup system (adding clients, removing clients, adjusting savesets etc.), adjusting directives or applying directives to clients should also form part of change control.

Whenever directives are being changed, or applied, the following questions should be asked:

  • What is not working as desired?
  • What is the solution required?
  • What are the minimal steps required to make those changes?
  • How can system recoverability following the changes be tested?

It’s that final point that often goes missing with directives. Once, a long time ago (long enough to be NetWorker 5.5.3), a customer providing backup services to a host of companies setup a “zero error policy” but due to budget and time constraints merely kept on adjusting directives to remove any file from the backup that couldn’t be opened/read during the backup process. The end result was unrecoverable systems.

By placing directive maintenance into the realm of change control, we don’t seek to add more red tape to the backup system, but more thought, and more consideration of the consequences of changes that may adversely affect data and systems recovery.

Mar 052009
 

This blog post has moved, and can now be found at the Enterprise Backup Blog, here.