I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

 

There was a recent discussion on the NetWorker mailing list as to whether some additional logging information that appeared in 7.4.x was worthwhile or whether it was worthless to the point of getting in the way of an administrator.

So that everyone is across what I’m talking about, the messages that started in 7.4.x are along the lines of:

nsrim: Only one browsable Full exists for saveset X. Its browse period is equal to retention period.

So here’s my take on the discussion: log files aren’t to be resented.

I recognise there’s a point where log files become either useless or waste people’s time. However, there’s really only one time for this – when the exact same information is needlessly repeated. In the case of these log messages though, it’s not the exact same information needlessly repeated. It’s different information – it’s going to be about a different saveset each time.

What is the message about, you may be wondering? Well, I actually don’t 100% know for sure. My suspicion is that it’s a message introduced to deal with processing saveset retention following changes introduced for pool based retention policies. But it doesn’t matter.

One thing that will drive me nuts with just about any product is encountering an issue where there’s insufficient logs to actually work out what is going on. Obviously, there’s a fine line to walk – log too much and you waste space and potentially reveal too much about the IP of the package. However, don’t do enough and it becomes extremely challenging for the people doing support (or the people who write the patches (or the people who wrote the software)) to resolve an issue. I don’t believe that having accurate logs guarantees quickly resolving an issue, but they certainly help – and not having them certainly hinders.

So my point is – don’t resent your log files. The amount of space they generally take up in NetWorker is quite minimal (compared to say, the index region), and so you shouldn’t be concerned about space. Nor, I’ll insist, should you be concerned about how to go about stripping out messages you don’t need to review when scanning log files. Backup administrators of enterprise products in particular should be quite conversant with log analysis and text extraction.

If those extra logged entries allow me to quickly find something in a Knowledge Base, or similarly allows support to find something quickly in an engineering database, or allows a patch developer to isolate the section of code that causes the problem, or allows the core developer to target the section of code to write an enhancement, it’s fantastic, and well worth the extra few bytes here and there that occupy my filesystems.

 

Something I mention in my book, but which is worth elaborating further upon, is the need to keep backups of your backup server for as long as your longest backups – if not longer. One of the primary reasons for this of course is the indices; recovering older indices is traditionally easier than the laborious alternative of scanning in potentially a multitude of media.

There is however another, equally important reason why your backup server should have at least equally the longest browse/retention time in your site – the logs.

Being able to recover your backup logs (i.e., nsr/logs/daemon*, nsr/logs/messages, etc.) is like having your own personal time machine for the backup system. This becomes important when  you hit recovery situations that you just can’t explain. That is, an error you are getting now, when you try to do a recovery of files backed up 2 years ago, may not make any sense at all. However, if you’re able to recover the backup server logs from that period in time, they may very well fill in the missing information for you. The most common thing I find this helps with is identifying whether what you’re trying to recover was ever actually backed up in the first place. I.e., the scenario runs something along the lines of:

  • User asks for file from arbitrary date – e.g., 29 May 2006.
  • Can’t browse to 29 May 2006, but can browse to 28 May 2006 and 30 May 2006.
  • Recover backup server logs from 30 May 2006 to see that the client could not be contacted for backup on that day.

Now, some would argue that not being able to recover is the real problem – this isn’t always the case. Sometimes, due to circumstances beyond your control, you literally can’t recover – such as say a situation like the above where there was a failure to backup in the first place. In situations such as this, being unable to explain why the recovery can’t be facilitated is equally as bad as not being able to recover.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha