Obviously the NetWorker Blog gets a lot of referrals from search engines via people looking specifically for help on particular NetWorker issues they’re encountering. Even just in the last 8+ hours, here are just some of the search terms that people used:

nmc doesn’t start

restore networker aborted saveset

networker disk backup module

nsr_render_log command

nsr_render_log daemon.raw

networker centos support

39077:jbconfig: error, you must install the lus scsi passthrough driver before configuring

And the list goes on and on, on a daily basis. This was reflected in the Top 10 for 2011 (and indeed, the top 10 for every previous year, too).

I’ll let you all in on a little secret though: all of those tips, all of those NetWorker basics articles and how to use nsradmin user guides – they’re all just the tip of the iceberg when it comes to getting a working backup system in place.

You see, a lot of sites don’t have a backup system at all – they just have some backup software and backup hardware and configuration. That doesn’t represent a backup system at all. From my article, “What is a backup system?“, I provided this diagram to explain such beasts:

Backup system

As you can see, the technology (the backup software, hardware and configuration) represents just one entry point to having a backup system. The others though are all equally critical; and when you add them all in together, it becomes clear that a backup system will derive much of its success and reliability from the human and business factors.

The technology, you see, is the easiest part of the backup environment; and it’s also the part that’s most likely to appeal to IT people. If you were to graph how much time the average site spends on each of those activities, it would probably look like this:

Imbalanced backup systemsWhen in actual fact, it should look more like this:

Balanced backup system

The short description? If you chart the amount of time you spend on your backup “system”, and the the Technology aspect (software, hardware, configuration) becomes a Pacman to the rest of the components, eating away at the rest of those facets, then you’ve got a cannibalistic environment that’s surviving as much as anything on luck/good fortune as it is on good design.

That’s why I bang on so much about backup theory – because all the latest and greatest technology in the world won’t help you at all if you don’t have everything else set up in conjunction with it:

  • The people involved need to know their roles, and participate in both the architecture of the environment and its ongoing operation;
  • The processes for use of the system must be well established;
  • The system must be thoroughly documented;
  • The system must be tested or you’ve got no way of establishing reliability;
  • The Service Level Agreements have to be established or else there’s no point whatsoever to what you’re doing.

Backup theory isn’t the boring part of a backup system; I’d suggest it’s actually the most interesting part of it. Just as I suggested that companies need to plan to follow some new years resolutions for backup systems, I’d equally suggest that the people involved in backups should start making it their goal to spend a balanced amount of time on the components that form a backup system.

If you don’t have the theory, you actually don’t have a system.

If you want to know more, you should treat yourself to my book (now available in Kindle format).

 

There’s a pertinent adage in cooking when it comes to using wine in recipes:

If you wouldn’t drink it, don’t cook with it.

It’s simple: if you don’t like the taste of it in a glass, what makes you think you’ll like the taste of food you’ve added it to?

There are two similar rules for backup, and they’re particularly important when it comes time to do those periodic hardware refreshes in your environment:

If it’s not good enough to run production, don’t use it for DR.

If it’s not good enough to run production, don’t use it for backup.

The way in which both of these come into play is quite simple:

  1. If it’s not good enough to run production, don’t use it for DR. I’ve seen companies have a hardware refresh cycle of “move production equipment to DR, buy new production equipment”. However, invariably that equipment is being pulled out of production because it’s either lacking in capacity, or lacking in performance. That equipment is then going to be replaced with new equipment with planned usage time of (typically) 2-3 years. So let’s assume you get a year down the track – your in-use storage capacity has gone up, your processing load has increased, then there’s a major production fault and you have to failover to DR. At which point, you’re trying to run your production environment on something that was sized to max out 12 months ago. Chances of it adequately running production? Minimal.
  2. If it’s not good enough to run production, don’t use it for backup. Another common mistake is a situation whereby say, a storage array is pulled out of production and replaced with a new, faster array with more capacity. People invariably hate to see things go to waste, so someone suggests “let’s use the old array as {backup to disk | VTL | etc}”. Again, sounds simple enough on the face of it, except the equipment was either lacking in performance, or lacking in capacity. If it was lacking in performance, you’re putting it into a situation where you’re going to be copying off something that is purchased, on the outset, to be significantly faster than it. It’s similar with capacity – you’re going to be trying to backup a very large bucket to a much smaller bucket.

Whether your company likes the idea of it or not, backup and disaster recovery are not areas that should be assigned “hand me downs” by the rest of the business. They require their own capital budget, and a planning that allows for the following two factors:

  1. Performance should at least match the throughput on offer from production;
  2. It should exceed your production capacity.

If either of these conditions are not met, your strategy is insufficient.

 

Are your service level agreements and your backup software support contracts in alignment?

A lot of companies will make the decision to run with “business hours” backup support – 9 to 5, or some variant like that, Monday to Friday. This is seen as a cheaper option, and for some companies, depending on their requirements, it can be a perfectly acceptable arrangement too. That’s usually the case where there are no SLAs, or smaller environments where the business is geared to being able to operate for protracted periods with minimal IT.

What can sometimes be forgotten in attempts to restrain budgets is whether reduced support for production support systems has any impact on meeting business requirements relating to service level agreements. If for instance, you have to start getting data flowing back within 2 hours of a failure, a system fails at midnight and the subsequent recovery has issues, your chances of being able to hit your service level agreement start to plummet if you don’t have a support contract that guarantees you access to help at this point in time.

A common response to this from management – be it IT, or financial – is “we’ll buy per-incident support if we need to“. In other words, the service level agreements the business has established necessitates a better support contract than is budgeted for, so it is ‘officially’ planned to “wing it” in the event of a serious issue.

I describe that as an Icarus Support Contract.

Icarus, as you may remember, is from Greek mythology. His father Daedalus fashioned wings out of feathers and wax so that he and Icarus could escape from prison. They escaped, but Icarus, enjoying the sensation of flight so much, disregarded his father’s warnings about flying too high. The higher he got, the closer he was to the sun. Then, eventually, the sun melted the wax, his wings fell off, and he fell to his death into the sea.

Planning to buy per-incident support is effectively building a contingency plan based on unbooked, unallocated resources.

It’s also about as safe as relying on wings held together by wax when flying high. Sure, if you’re lucky, you’ll sneak through it; but is do you really want to trust data recovery and SLAs to luck? What if those unbooked resources are already working on something for someone who does have a 24×7 contract? There’s a creek for that – and a paddle too.

In a previous job, I once discussed disaster recovery preparedness with an IT manager at a financial institution. Their primary site and their DR site were approximately 150 metres away from one other, leaving them with very little wiggle room in the event of a major catastrophe in the city. (Remember, the site being inaccessible can be just as deadly to business as the site being destroyed – and while there’s a lot less things that may destroy two city blocks, there’s plenty more things that might cut off two city blocks from human access for days.)

When questioned about the proximity of the two sites, he wasn’t concerned. Why? They were a big financial institution, they had emergency budget, and they were a valued customer of a particular server/storage manufacturer. Quite simply, if something happened and they lost both sites, they’d just go and buy or rent a truckload of new equipment and get themselves back operational again via backups. I always found this a somewhat dubious preparedness strategy – it’s definitely an example of an Icarus support contract.

I’ve since talked to account managers at multiple server/storage vendors, including the one used in this scenario, and all of them, in this era of shortened inventory streams, have scoffed at the notion of being able to instantly drop in 200+ servers and appropriate storage at the drop of a hat – especially in a situation where there’s a disaster and there’s a run on such equipment. (In Australia for instance, a lot of high end storage kit usually takes 3-6 weeks to arrive since it’s normally shipped in from overseas.)

Icarus was a naïve fool who got lost in the excitement of the moment. The fable of Icarus teaches us the perils of ignoring danger and enjoying the short-term too much. In this case, relying on future unbooked resources in the event of an issue in order to save a few dollars here and there in the now isn’t all that reliable. It’s like the age-old tape cost-cutting: if you manage to shave 10% off the backup media budget by deciding not to backup certain files or certain machines, you may very well get thanked for it. However, no-one will remember congratulating you when there’s butt-kicking to be done if it turns out that data no longer being backed up actually needed recovery.

So what is an Icarus support contract? Well, it’s a contract where you rely on luck. It’s a gamble – that in the event of a serious problem, you can buy immediate assistance at the drop of a hat. Just how bad can planning on being lucky get? Well, consider that over the last 18 months the entire world has been dealing with Icarus financial contracts – they were officially called Sub-Prime Mortgages, but the net result was the same – they were contracts and financial agreements built around the principle of luck.

Do your business a favor, and avoid Icarus support contracts. That’s the real way to get lucky in business – to not factor luck into your equations.

 

In the first article on the subject, What is a zero error policy?, I established the three rules that need to be followed to achieve a zero error policy, viz:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

As a result of various questions and discussions I’ve had about this, I want to expand on the zero error approach to backups to discuss management of such a policy.

Saying that you’re going to implement a zero error policy – indeed, wanting to implement a zero error policy, and actually implementing are significantly different activities. So, in order to properly manage a zero error policy, the following three components must be developed, maintained and followed:

  1. Error classification.
  2. Procedures for dealing with errors.
  3. Documentation of the procedures and the errors.

In various cases I’ve seen companies try to implement a zero error policy by following one or two of the above, but they’ve never succeeded unless they’ve implemented all three.

Let’s look at each one individually.

Error Classification

Classification is at the heart of many activities we perform. In data storage, we classify data by its importance and its speed requirements, and assign tiers. In systems protection, we classify systems by whether they’re operational production, infrastructure support production, development, Q&A, test, etc. Stepping outside of IT, we routinely do things by classification – we pay bills in order of urgency, or we go shopping for the things we need sooner rather than the things we’re going to run out of in three months time, etc. Classification is not only important, but it’s also something we do (and understand the need for) naturally – i.e., it’s not hard to do.

In the most simple sense, errors for data protection systems can be broken down into three types:

  • Critical errors – If error X occurs then data loss occurs.
  • Hard errors – If error X occurs and data loss occurs, then recoverability cannot be achieved.
  • Soft errors – If error X occurs and data loss occurs, then recoverability can still be achieved, but with non-critical data recoverability uncertain.

Here’s a logical follow-up from the above classification – any backup system designed such that it can cause a critical error has been incorrectly designed. What’s an example of a critical error? Consider the following scenario:

  • Database is shutdown at 22:00 for cold backups by scheduled system task
  • Cold backup runs overnight
  • Database is automatically started at 06:00 by scheduled system task

Now obviously our preference would be to use a backup module, but that’s actually not the risk of critical error here: it’s the divorcing of the shutdown/startup from the actual filesystem backup. Why does this create a “critical error” situation, you may ask? On any system where exclusive file locking takes place, if for any reason the backup is still running when the database is started, corruption is likely to occur. (For example, I have seen Oracle databases on Windows destroyed by such scenarios.)

So, a critical error is one where the failure in the backup process will result in data loss. This is an unacceptable error; so, not only must we be able to classify critical errors, but all efforts must be made to ensure that no scenarios which permit critical errors are ever introduced to a system.

Moving on, a hard error is one where we can quantify that if the error occurs and we subsequently have data loss (recovery required), then we will not be able to facilitate that recovery to within our preferred (or required) windows. So if a client completely fails to backup overnight, or one filesystem on the client fails, then we would consider that to be a hard error – the backup did not work and thus if there is a failure on that client we cannot use that backup to recover.

A soft error, on the other hand, is an error that will not prevent core recovery from happening. These are the most difficult to classify. Using NetWorker as an example, you could say that these will often be the warnings issued during the backups where the backup still manages to complete. Perhaps the most common example of this is files being open (and thus inaccessible) during backup. However, we can’t (via a blanket rule) assume that any warning is a soft error – it could be a hard error in disguise.

To use language as an example, a syntax error is one which is immediately obvious. A semantic error is one where the meaning is not obvious. Thus, syntax errors cause an immediate failure, whereas semantic errors usually cause a bug.

Taking that analogy back to soft vs hard errors, and using our file-open example, you can readily imagine a scenario where files open during backup could constitute a hard or a soft error. In the case of a soft error, it may refer to temporary files that are generated by a busy system during backup processing. Such temporary files may have no relevance to the operational state of a recovered system, and thus the recoverability of the temporary files does not affect the recoverability* of the system as a whole. On the other hand, if critical data files are missed due to being open at the time of the backup, then the recoverability of the system as a whole is compromised.

So, to achieve a zero error policy, we must be able to:

  1. Classify critical errors, and ensure situations that can lead to them are designed out of the solution.
  2. Classify hard errors.
  3. Classify soft errors and be able to differentiate them from hard errors.

One (obvious) net result of this is that you must always check your backup results. No ifs, no buts, no maybes. For those who want to automatically parse backup results, as mentioned in the first article, it also means you must configure the automatic parser such that any unknown result is treated as an error for examination and either action or rule updating.

[Note: An interesting newish feature in NetWorker was the introduction of the "success threshold" option for backup groups. Set to "Warning", by default, this will see savesets that generated warnings (but not hard errors) flagged as successful. The other option is "Success", which means that in order for a saveset to be listed as a successful saveset, it must complete without warning. One may be able to argue that in an environment where all attempts have been made to eliminate errors, and the environment operates under a zero-error policy, then this option should be changed from the default to the more severe option.]

Procedures for dealing with errors

The ability to classify an error as critical, hard, or soft is practically useless unless procedures are established for dealing with the errors. Procedures for dealing with errors will be driven, at first, by any existing SLAs within the organisation. I.e., the SLA for either maximum amount of data loss or recovery time will drive the response to any particular error.

That response however shouldn’t be an unplanned reaction. That is, there should be procedures which define:

  1. By what time backup results will be checked.
  2. To whom (job title), to where (documentation), and by when critical and hard errors shall be reported.
  3. To where (documentation) soft errors shall be reported.
  4. For each system that is backed up, responses to hard errors. (E.g., some systems may require immediate re-run of the backup, whereas others may require the backup to be re-run later, etc.)

Note that this isn’t an exhaustive list – for instance, it’s obvious that any critical errors must be immediately responded to, since data loss has occurred. Equally it doesn’t take into account routine testing, etc., but the above procedures are more for the daily procedures associated with enacting a zero error policy.

Now, you may think that that the above requirements don’t constitute the need for procedures – that the processes can be followed informally. It may seem a callous argument to make, but in my experience in data protection, informal policies lead to laxity in following up those policies. (Or: if it isn’t written down, it isn’t done.)

Obviously when checks aren’t done it’s rarely for a malicious reason. However, knowing that “my boss would like a status report on overnight backups by 9am” is elastic – and so if we’re feeling there’s other things we need to look at first, we can choose to interpret that as “would like by 9am, but will settle for later”. If however there’s a procedure that says “management must have backup reports by 9am”, it takes away that elasticity. Where that is important is it actually helps in time management – tasks can be done in a logical and process required order, because there’s a definition of importance of activities within the role. This is critically important – not only for the person who has to perform the tasks, but also for those who would otherwise feel that they can assign other tasks that interrupt these critical processes. You’ve heard that a good offense is a good defense? Well, a good procedure is also a good defense – against lower priority interruptions.

Documentation of the procedures and the errors

There are two acutely different reasons why documentation must be maintained (or three, if you want to start including auditing as a reason). So, to rephrase that, there are three acutely different reasons why documentation must be maintained. These are as follows:

  1. For auditing and compliance reasons it will be necessary to demonstrate that your company has procedures (and documentation for those procedures) for dealing with backup failures.
  2. To deal with sudden staff absence – it may be as simple as someone not being able to make it in on time, or it could be the backup administrator gets hit by a bus and will be in traction in the hospital for two weeks (or worse).
  3. To assist any staff member who does not have an eidetic memory.

In day to day operations, it’s the third reason that’s the most important. Human memory is a wonderfully powerful search and recall tool, yet it’s also remarkably fallible. Sometimes I can remember seeing the exact message 3 years prior in an error log from another customer, but forget that I’d asked a particular question only a day ago and ask it again. We all have those moments. And obviously, I also don’t remember what my colleagues did half an hour ago if I wasn’t there with them at the time.

I.e., we need to document errors because that guarantees us being able to reference them later. Again – no ifs, no buts, no maybes. Perhaps the most important factor in documenting errors in a data protection environment though is documenting in a system that allows for full text search. At bare minimum, you should be able to:

  1. Classify any input error based on:
    • Date/Time
    • System (server and client)
    • Application (if relevant)
    • Error type – critical, hard, soft
    • Response
  2. Conduct a full text search (optionally date restricted):
    • On any of the methods used to classify
    • On the actual error itself

The above scenario fits nicely with Wiki systems, so that may be one good scenario, but there are others out there that can be equally used.

The important thing though is to get the documentation done. What may initially seem time consuming when a zero error policy is enacted will quickly become quick and automatic; combined with the obvious reduction in errors over time in a zero error policy, the automatic procedural response to errors will actually streamline the activities of the backup administrator.

That documentation obviously, on a day to day basis, provides the most assistance to the person(s) in the ongoing role of backup administrator. However, in any situation where someone else has to fill in, this documentation becomes even more important – it allows them to step into the role, data mine for any message they’re not sure of and see what the local response was if a situation had happened before. Put yourself into the shoes of that other person … if you’re required to step into another person’s role temporarily, do you want to do it with plenty of supporting information, or with barely anything more than the name of the system you have to administer?

Wrapping Up

Just like when I first discussed zero error policies, you may be left thinking at the end of this that it sounds like there’s a lot of work involved in managing a zero error policy. It’s important to understand however that there’s always effort involved in any transition from a non-managed system to a managed system (i.e., from informal policies to formal procedures). However, for the most part this extra work mainly comes in at the institution of the procedures – namely in relation to:

  • Determining appropriate error categorisation techniques
  • Establishing the procedures
  • Establishing the documentation of the procedures
  • Establishing the documentation system used for the environment

Once these activities have been done, day to day management and operation of the zero error policy becomes a standard part of the job, and therefore doesn’t represent a significant impact to work. That’s for two key reasons: once these components are in place then following them really doesn’t take a lot of extra time, and that time that it does take is actually factored into the job, so the extra time taken can hardly be considered wasteful or frivolous.

At both a personal and ethical level, it’s also extremely satisfying to be able to answer the question, “How many errors slipped through the net today?” with “None”.

 

In my book, I recommend that all businesses should adopt a zero error policy in regards to backup. I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

Now, the problem with talking about zero error policies is that many people get excited about the wrong things when it comes to them. That is, they either focus on:

  • This will be too expensive!

or

  • Who gets into trouble when errors DO occur?

Not only are these attitudes not helpful, but they’re not necessary either.

Having a zero error policy requires the following three rules:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

You may think that rule (2) implies rule (3), and it does, but rule (3) gives us a special case/allowance for noting that some errors are permitted, in the short term, if there is a sufficient reason.

The actual purpose of the zero error policy is to ensure that any error or abnormal report from the backup system is treated as something requiring investigation and resolution. If this sounds like a lot of work, there’s a couple of key points to make:

  • When switching from any other policy to a zero error policy, yes, there will be a settling-in period that takes more time and effort, but once the initial hurdle has been cleared there should not be a significant ongoing drain of resources;
  • Given the importance of successful backups (i.e., being able to successfully recover when required), the work that is required is not only important, but very easily arguably necessary and ethically required.

Let’s step through those three rules.

All errors shall be known

Recognising that there must be limits to the statement “all errors shall be known”, we take this to mean that if an error is reported it will be known about. The most simple interpretation of this is that all savegroup completion reports must be read. For the purposes of a NetWorker backup environment, any run-time backup error is going to appear in the savegroup completion report, and so reading the report and checking on a per-host basis is the most appropriate action.

There are some logical consequences of this requirement:

  1. Backups reports shall be checked.
  2. Recoveries shall be tested.
  3. An issue register shall be maintained.
  4. Backup logs shall be kept for at least the retention period of the backups they are for.

Note: By “…all savegroup completion reports must be read”, I’m not suggesting that you can’t automatically parse results – however, there’s a few rules that have to be carefully followed on this. Discussed more in my book, the key rule however is that when adopting both automated parsing and a zero error policy, one must configure the system such that any unknown output/text is treated as an error. I.e., anything not catered for at time of writing of an automated parser must be flagged as a potential error so that it is either dealt with or added to the parsing routine.

All errors shall be resolved

Errors aren’t meant to just keep occurring. Here’s some reasonably common errors within a NetWorker environment:

  • System fails backup every night because it’s been decommissioned.
  • System fails backup every night because it’s been incorrectly configured for inclusive backups and a filesystem/saveset is no longer present.
  • File open errors on Windows systems.
  • Errors about files changing during backup on Linux/Unix systems.

There’s not a single error in the above list (and I could have made it 5x longer) that can’t be resolved. The purpose of stating “all errors shall be resolved” is to discourage administrators (either backup or individual system administrators) from leaving errors unchallenged.

Every error represents a potential threat to the backup system, in one of two distinct ways:

  1. Real errors represent a recovery threat.
  2. Spurious errors may discourage the detection of a real error.

What’s a spurious error? That’s one where the fault condition is known. E.g., “that backup fails every night because one of the systems has been turned off”. In most cases, spurious errors are going to either come down to at best a domain error (“I didn’t fix that because it’s someone else’s problem”) or at worst, laziness (“I haven’t found the <1 minute required to turn off the backup for a decommissioned system”).

Spurious errors, I believe, are actually as bad, if not worse, than the real errors. While we work to protect our systems against real errors, it’s a fact of life and systems administration that they will periodically occur. Systems change, minor bugs may surface, environmental factors may play a part, etc. The role of the backup administrator therefore is to be constantly vigilant in detecting errors, taking preventative actions where applicable, and corrective actions where necessary.

Allowing spurious errors to continually occur within a backup system is however inappropriate, and runs totally contrary to good administration practices. The key problem is that if you come to anticipate that particular backups will have failures, you become lax in your checking, and thus may skip over real errors that creep in. As an example, consider the “client fails because it has been decommissioned” scenario. In NetWorker terms, this may mean that a particular savegroup completes every day with a status of “1 client failed”. So, every day, an administrator may note that the group had 1 failed client and not bother to check the rest of the report, since that failed client is expected. But what if another administrator had decommissioned that client? What if that client is no longer in the group, but another client is now being reported as failed every day?

That’s the insidious nature of spurious errors.

No error shall be allowed to continue indefinitely

No system is perfect, so we do have to recognise that some errors may have a life-span greater than a single backup job. However, in order for a zero error policy to work properly, we must give time limits to any failure condition.

There are two aspects to this rule – one is the obvious, SLA style aspect, to do with the length at which an error is allowed to occur before it is escalated and/or must be resolved. (E.g., “No system may have 3 days of consecutive backup failures”).

The other aspect to this rule that can be more challenging to work with is dealing with those “expected” errors. E.g., consider a situation where the database administrators are trialling upgrades to Oracle on a development server. In this case, it may be known that the development system’s database backups will fail for the next 3 days. In such instances, to correctly enable zero-error policies, one must maintain not only an issues register, but an expected issues register – that is, noting which errors which are going to happen, and when they should stop happening*.

Summarising

Zero error policies are arguably not only a functional but ethical requirement of good backup administration. While they may take a little while to implement, and may formalise some of the work processes involved in the backup system, these should not be seen as a detriment. Indeed, I’d go so far as to suggest that you can’t actually have a backup system without a zero error policy. That is, without a zero error policy you can still get backups/recoveries, but with less degrees of certainty – and the more certainty you can build into a backup environment, the more it becomes a backup system.

[Ready for more? Check out the next post on this topic, Zero Error Policy Management.]


* In the example given, we could in theory use the “scheduled backup” feature of a client instance to disable backups for that particular client. However, that feature has a limitation in that there’s no allowances for automatically turning scheduled backups on again at a later date. Nevertheless, it’s a common enough scenario that it serves the purpose of the example.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha