Resolutions Check-in

In December last year I posted “7 new years backup resolutions for companies”. Since it’s the end of January 2012, I thought I’d check in on those resolutions and suggest where a company should be up to on them, as well as offering some next steps.

  1. Testing – The first resolution related to ensuring backups are tested. By now at least an informal testing plan should be in place if none were before. The next step will be to deal with some of the aspects below so as to allow a group to own the duty of generating an official data protection test plan, and then formalise that plan.
  2. Duplication – There should be documented details of what is and what isn’t duplicated within the backup environment. Are only production systems duplicated? Are only production Tier 1 systems duplicated? The first step towards achieving satisfactory duplication/cloning of backups is to note the current level of protection and expand outwards from that. The next step will be to develop tier guidelines to allow a specification of what type of backup receives what level of duplication. If there are already service tiers in the environment, this can serve as a starting point, slotting existing architecture and capability onto those tiers. Where existing architecture is insufficient, it should be noted and budgets/plans should be developed next to deal with these short-falls.
  3. Documentation – As I mentioned before, the backup environment should be documented. Each team that is involved in the backup process should have assigned at least one individual to write documentation relating to their sections (e.g., Unix system administrators would write Unix backup and recovery guidelines, etc., Windows system administrators would do the same for Windows, and so on). This should actually include 3 people: the writer, the peer reviewer, and the manager or team leader who accepts the documentation as sufficiently complete. The next step after this will be to handover documentation to the backup administrator(s) who will be responsible for collation, contribution of their sections, and periodic re-issuing of the documents for updates.
  4. Training – If staff (specifically administrators and operators) had previously not been trained in backup administration, a training programme should be in the works. The next step, of course, will be to arrange budget for that training.
  5. Implementing a zero error policy – First step in implementing a zero error policy is to build the requisite documents: an issues register, an exceptions register, and an escalations register. Next step will be to adjust the work schedules of the administrators involved to allow for additional time taken to resolve the ‘niggly’ backup problems that have been in the environment for some time as the switchover to a zero error policy is enacted.
  6. Appointing a Data Protection Advocate – The call should have gone out for personnel (particularly backup and/or system administrators) to nominate themselves for the role of DPA within the organisation, or if it is a multi-site organisation, one DPA per site. By now, the organisation should be in a position to decide who becomes the DPA for each site.
  7. Assembling an Information Protection Advisory Council (IPAC) – Getting the IPAC in place is a little more effort because it’s going to involve more groups. However, by now there should be formal recognition of the need for this council, and an informal council membership. The next step will be to have the first formal meeting of the council, where the structure of the group and the roles of the individuals within the group are formalised. Additionally, the IPAC may very well need to make the final decision on who is the DPA for each site, since that DPA will report to them on data protection activities.

It’s worth remembering at this point that while these tasks may seem arduous at first, they’re absolutely essential to a well running backup system that actually meshes with the needs of the business. In essence: the longer they’re put off, the more painful they’ll be.

How are you going?

 

New years resolutions for backup

I’d like to suggest that companies be prepared to make (and keep!) 7 new years resolutions when it comes to the field of backup and recovery:

  1. We will test our backups: If you don’t have a testing regime in place, you don’t have a backup system at all.
  2. We will duplicate our backups: Your backup system should not be a single point of failure. If you’re not cloning, replicating or duplicating your backups in some form, your backup system could be the straw that breaks the camel’s back when a major issue occurs.
  3. We will document our backups: As for testing, if your backup environment is undocumented, it’s not a system. All you’ve got is a collection of backups, which, if the right people are around at the right time and in the right frame of mind, you could get a recovery from it. If you want a backup system in place, you not only have to test your backups, you also have to keep them well documented.
  4. We will train our administrators and operators: It never ceases to amaze me the number of companies that deploy enterprise backup software and then insist that administrators and operators just learn how to use it themselves. While the concept of backup is actually pretty simple (“hey, you, back it up or you’ll lose it!”), the practicality of it can be a little more complex, particularly given that as an environment grows in size, so does the scope and the complexity of a backup system. If you don’t have some form of training (whether it’s internal, by an existing employed expert, or external), you’re at the edge of the event horizon, peering over into the abyss.
  5. We will implement a zero error policy: Again, there’s no such thing as a backup system when there’s no zero error policy. No ifs, no buts, no maybes. If you don’t rigorously implement a zero error policy, you’re flipping a coin every time you do a recovery, regardless of what backup product you use. (To learn more about a zero error policy, check out the trial podcast I did where that was the topic.)
  6. We will appoint a Data Protection Advocate: There’s a lot of data “out there” within a company, not necessarily under central IT control. Someone needs to be thinking about it. That someone should be the Data Protection Advocate (DPA). This person should be tasked with being the somewhat annoying person who is present at every change control meeting, raising her or his hand and saying “But wait, how will this affect our ability to protect our data?” That person should also be someone who wanders around the office(s) looking under desks for those pesky departmental servers and “test” boxes that are deployed, the extra hard drives attached to research machines, etc. If you have multiple offices, you should have a DPA per office. (The role of the DPA is outlined in this post, “What don’t you backup?“)
  7. We will assemble an Information Protection Advisory Council (IPAC): Sitting at an equal tier to the change control board, and reporting directly to the CTO/CIO/CFO, the IPAC will liaise with the DPA(s) and the business to make sure that everyone is across the contingencies that are in place for data protection, and be the “go-to” point for the business when it comes to putting new functions in place. They should be the group that sees a request for a new system or service and collectively liaises with the business and IT to ensure that the information generated by that system/service is protected. (If you want to know more about an IPAC and its role in the business, check out “But where does the DPA fit in?“)

And there you have it – the new years resolutions for your company. You may be surprised – while there’ll be a little effort getting these in place, once they’re there, you’re going to find backup, recovery, and the entire information protection process a lot easier to manage, and a lot more reliable.

 

For some time I’ve been debating whether to generate podcasts for the NetWorker blog.

Rather than continue to vacillate, I’ve decided to do a sample podcast, make it available here for downloading, and decide what to do based on feedback received.

While raw technical posts don’t translate well to podcasts (how do you quote screen output, for instance?), there’s a lot of backup theory related posts I make which can readily converted.

So, please follow the link below to the first podcast, in which I go over a topic near and dear to my heart: What is a zero error policy?

If you’re interested in me producing more podcasts, please let me know. Without feedback, I’ll likely leave it at just this trial. If people are interested though, I’ll setup a proper podcast stream within iTunes and get to work.

Podcast 001: What is a zero error policy?

Cheers!

 

Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

 

Consider the following two questions:

  1. Do you manage your backups, or do your backups manage you?
  2. Does your organisation decide how backups should be done based on SLAs, etc., or do the backups dictate how production operates?

As you can well imagine, the answers to the above questions will very quickly tell you whether you’ve got a healthy, or a sick backup environment.

While it’s obvious how both questions should be answered, I’d wager that at least some readers will be getting that little twinge reading the above knowing that I’ve just described their backup environment as sick. And I don’t mean sick as in Gen-Y “fully sick”, I mean unwell.

If your backup environment manages you (most specifically your time and the amount of hair you’ve got left), or your backup environment dictates how production works, then you’ve got some problems you need to address. Now.

A lazy backup admin is a healthy backup admin

In 1996, I joined a system administration team that had one guiding motto: be lazy. Their attitude towards work was without a doubt the most influential one I’ve ever encountered, and it still guides my work life to this day.

I don’t mean lazy as in “avoid work”.

I mean lazy as in “automate! automate! automate!”

As far as they were concerned, the goal of the system administrator should be to automate all regular activities to the point that they should either be only ever doing one of four activities:

  1. Automating processes.
  2. Checking results of automated processes.
  3. Waiting for something to go wrong/intervention to be required.
  4. Working on a project.

The same approach should be taken in backups. You should not be say, mindlessly doing repetitive tasks that could be automated – you should be automating them and then checking the automation results. You shouldn’t be fixing errors on a daily basis, you should have a zero error policy, and error processing as an exceptional rather than an every day task. Or you should be working on the next phase of expanding or updating the backup environment.

Et tu, defendo?

The backup system shouldn’t be ambushing primary production. It should be there as a guardian, a defender – not the system that stabs from the shadows, or hogs the limelight.

Every backup product, and every backup system, will of course have limitations. But these limitations should not prevent critical activities in production from being undertaken. Instead, limitations should be ameliorated such that what needs to be done in production can still be done, with appropriate workarounds in place. If the limitations are hard ones which require a rethink of how production is done, it should not be at the expense of the business functions or the end users. This may require mitigation with other technologies – for instance, a classic scenario in situations where the backup product can’t run backups as frequently as SLAs require is to mix traditional backups and snapshots.

Some SLAs, in the light of the available budget and technology should be reassessed. However, that’s not to say all of them should in such situations. A sick backup system is where any SLA, no matter how justified, that can’t be immediately met by the backup system “as is”, is abandoned.

You’re not the boss of me

So, are you in charge of your backup system, or is your backup system is in charge of you?

If you can’t answer that question the right way, it’s time to seize control and make sure next time someone asks you, you can.

 

In the past I’ve talked about the importance of having zero error policies.

In “What is a zero error policy?“, I said:

Having a zero error policy requires the following three rules:

1. All errors shall be known.

2. All errors shall be resolved.

3. No error shall be allowed to continue to occur indefinitely.

If you’ve not read that article, I suggest you go read it, as well as the follow-up article, “Zero error policy management“.

I’m going to make, and stand by, with fervid determination, the following assertion:

If you do not have a zero-error policy for your backups, you do not have a valid backup system.

No ifs, no buts, no maybes, no exceptions.

Why? Because why. Because across all the sites I’ve seen, regardless of size, regardless of complexity, the only ones that actually work properly are those where every error is captured, identified, and dealt with. Only those sites would I point at and say “They have every chance of meeting their SLAs”.

In my book, I introduce the notion that just deploying software and thinking you have a backup system is like making a sacrifice to a volcano. So, without a zero error policy, what does a network diagram of your IT environment look like?

It looks like this:

Network diagram of backup environment without zero error policies

 

In previous articles I’ve discussed the need for zero error policies. This was covered first in What is a Zero Error Policy?, and followed up in Zero Error Policy Management. (If you’ve not read those articles, you really should before continuing.)

Key to ensuring a zero error policy is not only adopted, but also achieved, is a good understanding of the error lifecycle. That’s right – errors have a lifecycle, which is not only well defined, but actually helps us to keep them under control. An error lifecycle will resemble the following:

The error lifecycleThe start of the lifecycle is our Test and Detect loop:

  • Detect – An error is determined to have happened either as a result of a significant fault, or as a result of routine monitoring and analysis.
  • Test – An error is determined to have happened as a result of actual testing (formal or informal).

Once it’s determined that an error has happened, we then move into the resolution cycle, which consists of:

  • Diagnose – Determine the nature of the error – i.e., the root cause. If you don’t understand the actual cause, you can’t be certain that any solution you come up with is complete.
  • Rectify – Having understood the error, it’s time to resolve it. There’s two standard resolution techniques: complete resolution or workaround. Either are acceptable, so long as the resolution technique chosen is acceptable to the business and appropriate to the error.
  • Document – Once an error is solved, it needs to be documented. As has been said on numerous occasions, “Those who don’t learn from history are doomed to repeat it.” One of the worst possible error situations for instance is one where you’ve solved it in the past, but you can’t remember what you did and thus have to repeat the entire process. At minimum, documentation requires 3 components: (a) what lead to the error, (b) how the error manifests/is detected, and (c) how the error was resolved.

The error lifecycle doesn’t stop there though, as indicated by the diagram; instead, we add that error into a test and detection register – having encountered it, we should be able to more easily be on the look out for another instance. This is hopefully where the error finishes: being monitored for, but never again recurring. In the event though that it does reoccur, the diagnosis, rectification and documentation process should be simpler.

There you have it – the error lifecycle. Knowing it allows you to manage errors, rather than errors managing you.

 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

 

The most visited post in August was again, Carry a jukebox with you (if you’re using Linux). I think part of this must be attributed to the linkage of Linux with Free. I.e., because Linux is seen as low cost (or no cost), there’s a core group, particularly of open source fans, who want to come up with a totally free solution for their environment, no matter what environment that is.

However, I don’t think that’s all that can be attributed to why this article keep on drawing people in. Despite my reservations about VTL, a lot of people are interested in deploying them. It’s important to stress again – I don’t dislike VTLs, I just wish we didn’t need them. Recognising though that we do need them, I can appreciate the management benefits that they bring to an environment.

From a support perspective of course I’m a big fan – with a VTL I can carry a jukebox around wherever I go.

The Linux VTL post even beat out old standards – the parallelism and NSR peer information related posts, which normally win hands down every month.

(From a policy and procedural perspective though, it was good to see that the introductory post to zero error policies, What is a Zero Error Policy?, got the next most attention. I can’t really stress enough how important I think zero error policies are to systems management in general, and backup/data protection specifically.)

 

In the first article on the subject, What is a zero error policy?, I established the three rules that need to be followed to achieve a zero error policy, viz:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

As a result of various questions and discussions I’ve had about this, I want to expand on the zero error approach to backups to discuss management of such a policy.

Saying that you’re going to implement a zero error policy – indeed, wanting to implement a zero error policy, and actually implementing are significantly different activities. So, in order to properly manage a zero error policy, the following three components must be developed, maintained and followed:

  1. Error classification.
  2. Procedures for dealing with errors.
  3. Documentation of the procedures and the errors.

In various cases I’ve seen companies try to implement a zero error policy by following one or two of the above, but they’ve never succeeded unless they’ve implemented all three.

Let’s look at each one individually.

Error Classification

Classification is at the heart of many activities we perform. In data storage, we classify data by its importance and its speed requirements, and assign tiers. In systems protection, we classify systems by whether they’re operational production, infrastructure support production, development, Q&A, test, etc. Stepping outside of IT, we routinely do things by classification – we pay bills in order of urgency, or we go shopping for the things we need sooner rather than the things we’re going to run out of in three months time, etc. Classification is not only important, but it’s also something we do (and understand the need for) naturally – i.e., it’s not hard to do.

In the most simple sense, errors for data protection systems can be broken down into three types:

  • Critical errors – If error X occurs then data loss occurs.
  • Hard errors – If error X occurs and data loss occurs, then recoverability cannot be achieved.
  • Soft errors – If error X occurs and data loss occurs, then recoverability can still be achieved, but with non-critical data recoverability uncertain.

Here’s a logical follow-up from the above classification – any backup system designed such that it can cause a critical error has been incorrectly designed. What’s an example of a critical error? Consider the following scenario:

  • Database is shutdown at 22:00 for cold backups by scheduled system task
  • Cold backup runs overnight
  • Database is automatically started at 06:00 by scheduled system task

Now obviously our preference would be to use a backup module, but that’s actually not the risk of critical error here: it’s the divorcing of the shutdown/startup from the actual filesystem backup. Why does this create a “critical error” situation, you may ask? On any system where exclusive file locking takes place, if for any reason the backup is still running when the database is started, corruption is likely to occur. (For example, I have seen Oracle databases on Windows destroyed by such scenarios.)

So, a critical error is one where the failure in the backup process will result in data loss. This is an unacceptable error; so, not only must we be able to classify critical errors, but all efforts must be made to ensure that no scenarios which permit critical errors are ever introduced to a system.

Moving on, a hard error is one where we can quantify that if the error occurs and we subsequently have data loss (recovery required), then we will not be able to facilitate that recovery to within our preferred (or required) windows. So if a client completely fails to backup overnight, or one filesystem on the client fails, then we would consider that to be a hard error – the backup did not work and thus if there is a failure on that client we cannot use that backup to recover.

A soft error, on the other hand, is an error that will not prevent core recovery from happening. These are the most difficult to classify. Using NetWorker as an example, you could say that these will often be the warnings issued during the backups where the backup still manages to complete. Perhaps the most common example of this is files being open (and thus inaccessible) during backup. However, we can’t (via a blanket rule) assume that any warning is a soft error – it could be a hard error in disguise.

To use language as an example, a syntax error is one which is immediately obvious. A semantic error is one where the meaning is not obvious. Thus, syntax errors cause an immediate failure, whereas semantic errors usually cause a bug.

Taking that analogy back to soft vs hard errors, and using our file-open example, you can readily imagine a scenario where files open during backup could constitute a hard or a soft error. In the case of a soft error, it may refer to temporary files that are generated by a busy system during backup processing. Such temporary files may have no relevance to the operational state of a recovered system, and thus the recoverability of the temporary files does not affect the recoverability* of the system as a whole. On the other hand, if critical data files are missed due to being open at the time of the backup, then the recoverability of the system as a whole is compromised.

So, to achieve a zero error policy, we must be able to:

  1. Classify critical errors, and ensure situations that can lead to them are designed out of the solution.
  2. Classify hard errors.
  3. Classify soft errors and be able to differentiate them from hard errors.

One (obvious) net result of this is that you must always check your backup results. No ifs, no buts, no maybes. For those who want to automatically parse backup results, as mentioned in the first article, it also means you must configure the automatic parser such that any unknown result is treated as an error for examination and either action or rule updating.

[Note: An interesting newish feature in NetWorker was the introduction of the "success threshold" option for backup groups. Set to "Warning", by default, this will see savesets that generated warnings (but not hard errors) flagged as successful. The other option is "Success", which means that in order for a saveset to be listed as a successful saveset, it must complete without warning. One may be able to argue that in an environment where all attempts have been made to eliminate errors, and the environment operates under a zero-error policy, then this option should be changed from the default to the more severe option.]

Procedures for dealing with errors

The ability to classify an error as critical, hard, or soft is practically useless unless procedures are established for dealing with the errors. Procedures for dealing with errors will be driven, at first, by any existing SLAs within the organisation. I.e., the SLA for either maximum amount of data loss or recovery time will drive the response to any particular error.

That response however shouldn’t be an unplanned reaction. That is, there should be procedures which define:

  1. By what time backup results will be checked.
  2. To whom (job title), to where (documentation), and by when critical and hard errors shall be reported.
  3. To where (documentation) soft errors shall be reported.
  4. For each system that is backed up, responses to hard errors. (E.g., some systems may require immediate re-run of the backup, whereas others may require the backup to be re-run later, etc.)

Note that this isn’t an exhaustive list – for instance, it’s obvious that any critical errors must be immediately responded to, since data loss has occurred. Equally it doesn’t take into account routine testing, etc., but the above procedures are more for the daily procedures associated with enacting a zero error policy.

Now, you may think that that the above requirements don’t constitute the need for procedures – that the processes can be followed informally. It may seem a callous argument to make, but in my experience in data protection, informal policies lead to laxity in following up those policies. (Or: if it isn’t written down, it isn’t done.)

Obviously when checks aren’t done it’s rarely for a malicious reason. However, knowing that “my boss would like a status report on overnight backups by 9am” is elastic – and so if we’re feeling there’s other things we need to look at first, we can choose to interpret that as “would like by 9am, but will settle for later”. If however there’s a procedure that says “management must have backup reports by 9am”, it takes away that elasticity. Where that is important is it actually helps in time management – tasks can be done in a logical and process required order, because there’s a definition of importance of activities within the role. This is critically important – not only for the person who has to perform the tasks, but also for those who would otherwise feel that they can assign other tasks that interrupt these critical processes. You’ve heard that a good offense is a good defense? Well, a good procedure is also a good defense – against lower priority interruptions.

Documentation of the procedures and the errors

There are two acutely different reasons why documentation must be maintained (or three, if you want to start including auditing as a reason). So, to rephrase that, there are three acutely different reasons why documentation must be maintained. These are as follows:

  1. For auditing and compliance reasons it will be necessary to demonstrate that your company has procedures (and documentation for those procedures) for dealing with backup failures.
  2. To deal with sudden staff absence – it may be as simple as someone not being able to make it in on time, or it could be the backup administrator gets hit by a bus and will be in traction in the hospital for two weeks (or worse).
  3. To assist any staff member who does not have an eidetic memory.

In day to day operations, it’s the third reason that’s the most important. Human memory is a wonderfully powerful search and recall tool, yet it’s also remarkably fallible. Sometimes I can remember seeing the exact message 3 years prior in an error log from another customer, but forget that I’d asked a particular question only a day ago and ask it again. We all have those moments. And obviously, I also don’t remember what my colleagues did half an hour ago if I wasn’t there with them at the time.

I.e., we need to document errors because that guarantees us being able to reference them later. Again – no ifs, no buts, no maybes. Perhaps the most important factor in documenting errors in a data protection environment though is documenting in a system that allows for full text search. At bare minimum, you should be able to:

  1. Classify any input error based on:
    • Date/Time
    • System (server and client)
    • Application (if relevant)
    • Error type – critical, hard, soft
    • Response
  2. Conduct a full text search (optionally date restricted):
    • On any of the methods used to classify
    • On the actual error itself

The above scenario fits nicely with Wiki systems, so that may be one good scenario, but there are others out there that can be equally used.

The important thing though is to get the documentation done. What may initially seem time consuming when a zero error policy is enacted will quickly become quick and automatic; combined with the obvious reduction in errors over time in a zero error policy, the automatic procedural response to errors will actually streamline the activities of the backup administrator.

That documentation obviously, on a day to day basis, provides the most assistance to the person(s) in the ongoing role of backup administrator. However, in any situation where someone else has to fill in, this documentation becomes even more important – it allows them to step into the role, data mine for any message they’re not sure of and see what the local response was if a situation had happened before. Put yourself into the shoes of that other person … if you’re required to step into another person’s role temporarily, do you want to do it with plenty of supporting information, or with barely anything more than the name of the system you have to administer?

Wrapping Up

Just like when I first discussed zero error policies, you may be left thinking at the end of this that it sounds like there’s a lot of work involved in managing a zero error policy. It’s important to understand however that there’s always effort involved in any transition from a non-managed system to a managed system (i.e., from informal policies to formal procedures). However, for the most part this extra work mainly comes in at the institution of the procedures – namely in relation to:

  • Determining appropriate error categorisation techniques
  • Establishing the procedures
  • Establishing the documentation of the procedures
  • Establishing the documentation system used for the environment

Once these activities have been done, day to day management and operation of the zero error policy becomes a standard part of the job, and therefore doesn’t represent a significant impact to work. That’s for two key reasons: once these components are in place then following them really doesn’t take a lot of extra time, and that time that it does take is actually factored into the job, so the extra time taken can hardly be considered wasteful or frivolous.

At both a personal and ethical level, it’s also extremely satisfying to be able to answer the question, “How many errors slipped through the net today?” with “None”.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha