You all know about POETS day, don’t you? It’s a great acronym:

P-ss Off Early, Tomorrow’s Saturday

It’s a pretty good summation of a lot of the IT industry – we’re reluctant to kick off major changes on a Friday because … well, the weekend follows, and if something goes wrong, it could be disruptive.

But a day or so ago, Matt Stace (@matstace) tweeted:

If it’s not good enough to deploy on a Friday, what makes it good enough to deploy any other day of the week?

Some might think this is a little trite, but there’s actually good wisdom in Mat’s comment – if we lack the confidence that something we’re working on can be deployed safely on Friday, why should we be any more confident that it can be deployed safely at another time? In fact, when you stop and think about it, in the light of cold logic, there’s only two explanations:

  1. You aren’t sufficiently certain that what you’re going to deploy is ready, or
  2. You’re superstitious.

Now, I’m as willing as the next person to claim that Murphy’s Law takes a perverse delight in visiting computer rooms, but realistically, that’s just a tendency to catastrophise* things when they come up unexpectedly.

So, if you’re sitting back and saying that you or the company should hold off doing something on a Friday because, well, it’s Friday, it’s time to sit back and ask yourself – is it because you’re being superstitious, or is it because it’s just simply not ready to be done, regardless of what day it is?

I know I will be.


* Thanks to my good friend Christopher Banks (aka @bipolarbearnz) for introducing me to that word last night. I’ll be using it daily for months, I think.

 

Your backup server is behaving perfectly normally, but you want to do one minor change to it. For example, you’ve read the performance tuning guide and realised you need to double the amount of RAM in the server. So you shut it down, install the extra memory, reboot it and it … goes to hell in a handbasket.

What happened?

Maybe filesystems didn’t mount.

Maybe a tape drive or library didn’t reappear.

Maybe … just maybe, someone made a change previously, but either (a) didn’t commit it to happen permanently or (b) didn’t test it with a reboot.

Your backup server is like any other production system, and therefore there’s a strong risk that uncontrolled change will cause issues. So, always make sure you follow these two rules:

  • If you make a change that takes you from a non-working to a working-state, make sure you commit the change and reboot to test;
  • If you make an addition to the system that would be lost or otherwise not present after a reboot, make sure you commit the change and have it peer reviewed. If unsure, reboot.
Peer review is everything in these situations, but reboot tests are quite critical. In particular, the more hardware is involved in the system (and nothing says hardware like “tape library”!), the more you should be rigorously testing change. No ifs, no buts. This is important.
 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

 

So you’re a busy backup administrator and you’re getting ready to go on leave. It’s 4pm on your final day before the holiday, you’ve finally got everything off your plate, and you think to yourself, “Now I’ve finally got the time, I’ll just quickly upgrade NetWorker before I leave.”

This unfortunately is an alternative of that Friday change rule violation known as POETS.

There’s three distinctly wrong things with this scenario:

  • Infrastructure upgrade done without change control.
  • Infrastructure upgrade done at the last minute.
  • Infrastructure upgrade done without follow-up monitoring.

Any one of those scenarios is enough to cause a nightmare situation – either for yourself, getting call-outs when you’re meant to be on holidays, or for your colleagues, left in the lurch after you switch your phone off for two weeks and go on a holiday to the East Islands.

All three though? That’s just asking for trouble.

(This lesson doesn’t actually just apply to NetWorker – it applies across the board for system, application and storage administration. Don’t modify the system just before going away for a while.)

Just before this holiday season, I had a customer upgrade* their NetWorker server from 7.3.x to 7.5 before going on leave. Not 7.5.1, not 7.5.1.8, 7.5. This didn’t go so well, and a few days later when the fill-in administrators noticed the issue**, there was a bit of work to rectify the various issues and some backups during that time didn’t work.

This however is by no means unique. Following Twitter I noticed one on-call person suffer a hideous xmas day and following day working on a call-out from what appeared to be an untested change done by someone else before that other person went on holiday.

And non-betting man that I am, I’d bet a considerable wad of money (and win) that this fellow’s experience wasn’t unique for IT workers over xmas 2009.

In short: choosing to do an untested/uncontrolled upgrade just before going on holidays can be either self-destructive or selfish (or even both) – it may lose your your holiday, depending on the level of the fail and the backup (or lack thereof) within your company, or it may cause a colleague to have an insufferably unpleasant time. (Alternatively, if you can be reached, it may result in you having a bad time on your holiday in order to help out a colleague having a bad time as well.)

The problem with rushing through upgrades at the last minute is that they tend to be poorly done, even if they seem simple enough. Even if change control is being followed, if that change control has been rushed through (as it can sometimes be done as a “last minute” activity), then it provides no guarantee that the change will work smoothly. And don’t forget: Murphy’s Law works in the datacentre as well. Something that looks easy, that you should be able to do with your eyes closed, when done as a rush job at the last minute can come unstuck quite easily.

So please – for your sake, for your colleagues sake, for NetWorkers’ sake and for the sake of your company: please don’t upgrade just before you go on holidays.


* upgrade = “update” in NetWorker speak

** Which should serve as a reminder that you should never only have one backup administrator.

 

It’s easy to change NetWorker directives. A few clicks here and there if you use NMC, then a couple of lines of text rattled off into the right fields, and suddenly you’ve made anywhere from small, precise changes to massive changes to a backup.

It’s for this reason that I think that modifying directives within the backup configuration should be considered important enough that they warrant their own change control processes. (I’ve previously talked about the backup administrator needing to be part of the change control authorisation process – this is another aspect however.)

Now, don’t get me wrong – despite what former employees may think, I’m not keen on excessive levels of red tape. In fact, I think a smart system should be designed at all times to minimise administrative overheads while ensuring that all accounting is still correctly done.

That being said, directives are, for want of a better term, dangerous. Mis-used, they can result in recovered systems being unusable – in data loss.

With this in mind, like other aspects of the backup system (adding clients, removing clients, adjusting savesets etc.), adjusting directives or applying directives to clients should also form part of change control.

Whenever directives are being changed, or applied, the following questions should be asked:

  • What is not working as desired?
  • What is the solution required?
  • What are the minimal steps required to make those changes?
  • How can system recoverability following the changes be tested?

It’s that final point that often goes missing with directives. Once, a long time ago (long enough to be NetWorker 5.5.3), a customer providing backup services to a host of companies setup a “zero error policy” but due to budget and time constraints merely kept on adjusting directives to remove any file from the backup that couldn’t be opened/read during the backup process. The end result was unrecoverable systems.

By placing directive maintenance into the realm of change control, we don’t seek to add more red tape to the backup system, but more thought, and more consideration of the consequences of changes that may adversely affect data and systems recovery.

 

…and if not, why?

A common mistake made in many companies is the failure to include the backup administrator (or, if there is a team, the team leader for data protection) in the change control approval process.

Typically the sorts of roles involved in change control include:

  • CIO or other nominated “final say” manager.
  • Tech writing the change request.
  • Tech’s manager approving the change request.
  • Network team.

Obviously there’s exceptions, and many companies will have variances – for instance, in most consulting companies, a sales manager will also get to have a say in change control, since interruptions to sales processes at the wrong time can break a deal.

Too infrequently included in change control is the backup administrator, or the team responsible for backup administration. The common sense approach to data protection would seem to suggest this is lunacy. After all, if a change fails, surely one potential remedy will be to recover from backup?

The error is three-fold:

  • Implicit assumption that any issue is recoverable from;
  • Implicit assumption that the backup system is always available;
  • Implicit assumption that what you need backed up is backed up.

Out of all of those assumptions, perhaps only the last is forgivable. As I point out in my book, and many have pointed out before me, it’s always better to backup a little too much than not quite enough. Thus, in a reasonable environment that has been properly configured, systems should be protected.

The three-fold assumptions error can actually be sumarised more succinctly though – assuming that having a backup system is a blank cheque on data recovery.

Common issues I’ve seen caused by failures to include backup administrators in change control include:

  • Having major changes timed to occur at the same time as scheduled down-time in the backup environment;
  • Kicking off full backups of large systems prior to changes without notification to the backup administrators, swamping media availability;
  • Scheduling changes to occur just prior to the next backup, making possible the maximal amount of data loss within the periodic backup frequency;
  • Not running fresh, full backups of version-critical database content after upgrades, and thus suffering significant outages later when a cross-version recovery is required;
  • Not checking version compatibility for applications or operating systems, resulting in “upgrades” that can’t be backed up;
  • Wasting backup administrators time searching for reasons why failures occurred because change outages ran during the backups.

To be blunt, any of the above scenarios that occur without pre-change signoff are inexcusable and represent a communications flaw within an organisation.

Any change that has potential to impact on or be impacted by the backup system should be subject to approval, or at the least, notification by the backup administrators. The logical consequence of this rule is: any change that has anything to do with IT systems should logically impact on or be impacted by the backup system.

Note that by impact on, I don’t mean just cause a deleterious effect to the backup system, but also more simply, require resources from the backup system (e.g., for the purposes of recovery, or even additional resources for more backups).

All of this falls into establishing policies surrounding the backup system, and I’m not talking what backs up when – but rather, implications that companies must face as a result of having backup systems in place. Helping organisations understand those policies is a major focus of my book.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha