Error lifecycle management

In previous articles I’ve discussed the need for zero error policies. This was covered first in What is a Zero Error Policy?, and followed up in Zero Error Policy Management. (If you’ve not read those articles, you really should before continuing.)

Key to ensuring a zero error policy is not only adopted, but also achieved, is a good understanding of the error lifecycle. That’s right – errors have a lifecycle, which is not only well defined, but actually helps us to keep them under control. An error lifecycle will resemble the following:

The error lifecycleThe start of the lifecycle is our Test and Detect loop:

  • Detect – An error is determined to have happened either as a result of a significant fault, or as a result of routine monitoring and analysis.
  • Test – An error is determined to have happened as a result of actual testing (formal or informal).

Once it’s determined that an error has happened, we then move into the resolution cycle, which consists of:

  • Diagnose – Determine the nature of the error – i.e., the root cause. If you don’t understand the actual cause, you can’t be certain that any solution you come up with is complete.
  • Rectify – Having understood the error, it’s time to resolve it. There’s two standard resolution techniques: complete resolution or workaround. Either are acceptable, so long as the resolution technique chosen is acceptable to the business and appropriate to the error.
  • Document – Once an error is solved, it needs to be documented. As has been said on numerous occasions, “Those who don’t learn from history are doomed to repeat it.” One of the worst possible error situations for instance is one where you’ve solved it in the past, but you can’t remember what you did and thus have to repeat the entire process. At minimum, documentation requires 3 components: (a) what lead to the error, (b) how the error manifests/is detected, and (c) how the error was resolved.

The error lifecycle doesn’t stop there though, as indicated by the diagram; instead, we add that error into a test and detection register – having encountered it, we should be able to more easily be on the look out for another instance. This is hopefully where the error finishes: being monitored for, but never again recurring. In the event though that it does reoccur, the diagnosis, rectification and documentation process should be simpler.

There you have it – the error lifecycle. Knowing it allows you to manage errors, rather than errors managing you.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.