Long-term blog readers will know that I advocate a zero error policy within backup environments.
This is elucidated in my posts:
- What is a zero error policy?
- Zero error policy management
- Error lifecycle management
- No zero error policy? No backup system
You could say that those posts are precursors to this post, and if you’re not familiar with what I’ve had to say there, you may want to read those first.
One of the critical mistakes I periodically see when companies try to implement a zero error policy is they focus too much on the errors.
The errors though, are often just the “tip of the iceberg”.
For instance, take the most simple of errors – an open file error. You might run a backup of a Windows filesystem which reports a collection of errors relating to files that were skipped because they were open at the time.
Yet, those open files aren’t really the error. Seeing them as the error is usually a case of mistaking cause and effect. In this scenario, the error is one of:
- The backup software is misconfigured, or
- The backup software is missing modules that allow it to backup open files.
In the first case, it may be that the file(s) which are reported as open and couldn’t be backed up actually don’t need to be backed up. They may be temporary files, or cache files, or some other short-lived collection of files that have no importance in terms of data protection. So the error there isn’t the individual files that failed to backup, but the failure to configure the exclusions for the client appropriately.
In the second case, it may be that those files really do need to be backed up, but to do so requires a special module. They may be database files (e.g., Microsoft SQL Server, Microsoft Exchange, etc.), or some other collection of files that must be quiesced before backup. In this case, the error is that the system is being backed up inconsistently.
Zero error policies aren’t about playing whack-a-mole with errors; they’re about solving problems.
After all, the captain of the Titanic couldn’t have averted the disaster by stopping the ship just short of the iceberg and having someone take a pick axe to the top of it.
The net result of this is that having a zero error policy requires the following two processes/activities:
- Discussion of errors with system owners/nominated key users;
- Root cause analysis.
If either of those are missing, you’re more likely making (at best educated) guesses as to the correct resolution to the errors. However, if you have those in place, you can more confidently review any error as it hits and make an informed (and even documented) decision as to how to resolve the underlaying issue that it represents.
Without it, a zero error policy may actually make the situation worse.