Obviously the NetWorker Blog gets a lot of referrals from search engines via people looking specifically for help on particular NetWorker issues they’re encountering. Even just in the last 8+ hours, here are just some of the search terms that people used:

nmc doesn’t start

restore networker aborted saveset

networker disk backup module

nsr_render_log command

nsr_render_log daemon.raw

networker centos support

39077:jbconfig: error, you must install the lus scsi passthrough driver before configuring

And the list goes on and on, on a daily basis. This was reflected in the Top 10 for 2011 (and indeed, the top 10 for every previous year, too).

I’ll let you all in on a little secret though: all of those tips, all of those NetWorker basics articles and how to use nsradmin user guides – they’re all just the tip of the iceberg when it comes to getting a working backup system in place.

You see, a lot of sites don’t have a backup system at all – they just have some backup software and backup hardware and configuration. That doesn’t represent a backup system at all. From my article, “What is a backup system?“, I provided this diagram to explain such beasts:

Backup system

As you can see, the technology (the backup software, hardware and configuration) represents just one entry point to having a backup system. The others though are all equally critical; and when you add them all in together, it becomes clear that a backup system will derive much of its success and reliability from the human and business factors.

The technology, you see, is the easiest part of the backup environment; and it’s also the part that’s most likely to appeal to IT people. If you were to graph how much time the average site spends on each of those activities, it would probably look like this:

Imbalanced backup systemsWhen in actual fact, it should look more like this:

Balanced backup system

The short description? If you chart the amount of time you spend on your backup “system”, and the the Technology aspect (software, hardware, configuration) becomes a Pacman to the rest of the components, eating away at the rest of those facets, then you’ve got a cannibalistic environment that’s surviving as much as anything on luck/good fortune as it is on good design.

That’s why I bang on so much about backup theory – because all the latest and greatest technology in the world won’t help you at all if you don’t have everything else set up in conjunction with it:

  • The people involved need to know their roles, and participate in both the architecture of the environment and its ongoing operation;
  • The processes for use of the system must be well established;
  • The system must be thoroughly documented;
  • The system must be tested or you’ve got no way of establishing reliability;
  • The Service Level Agreements have to be established or else there’s no point whatsoever to what you’re doing.

Backup theory isn’t the boring part of a backup system; I’d suggest it’s actually the most interesting part of it. Just as I suggested that companies need to plan to follow some new years resolutions for backup systems, I’d equally suggest that the people involved in backups should start making it their goal to spend a balanced amount of time on the components that form a backup system.

If you don’t have the theory, you actually don’t have a system.

If you want to know more, you should treat yourself to my book (now available in Kindle format).

 

There’s a pertinent adage in cooking when it comes to using wine in recipes:

If you wouldn’t drink it, don’t cook with it.

It’s simple: if you don’t like the taste of it in a glass, what makes you think you’ll like the taste of food you’ve added it to?

There are two similar rules for backup, and they’re particularly important when it comes time to do those periodic hardware refreshes in your environment:

If it’s not good enough to run production, don’t use it for DR.

If it’s not good enough to run production, don’t use it for backup.

The way in which both of these come into play is quite simple:

  1. If it’s not good enough to run production, don’t use it for DR. I’ve seen companies have a hardware refresh cycle of “move production equipment to DR, buy new production equipment”. However, invariably that equipment is being pulled out of production because it’s either lacking in capacity, or lacking in performance. That equipment is then going to be replaced with new equipment with planned usage time of (typically) 2-3 years. So let’s assume you get a year down the track – your in-use storage capacity has gone up, your processing load has increased, then there’s a major production fault and you have to failover to DR. At which point, you’re trying to run your production environment on something that was sized to max out 12 months ago. Chances of it adequately running production? Minimal.
  2. If it’s not good enough to run production, don’t use it for backup. Another common mistake is a situation whereby say, a storage array is pulled out of production and replaced with a new, faster array with more capacity. People invariably hate to see things go to waste, so someone suggests “let’s use the old array as {backup to disk | VTL | etc}”. Again, sounds simple enough on the face of it, except the equipment was either lacking in performance, or lacking in capacity. If it was lacking in performance, you’re putting it into a situation where you’re going to be copying off something that is purchased, on the outset, to be significantly faster than it. It’s similar with capacity – you’re going to be trying to backup a very large bucket to a much smaller bucket.

Whether your company likes the idea of it or not, backup and disaster recovery are not areas that should be assigned “hand me downs” by the rest of the business. They require their own capital budget, and a planning that allows for the following two factors:

  1. Performance should at least match the throughput on offer from production;
  2. It should exceed your production capacity.

If either of these conditions are not met, your strategy is insufficient.

 

Every backup you do has a half-life, which isn’t the retention period of the backup. Now, if you’re new to NetWorker, don’t go looking for a half life setting for clients or savesets or groups; I’m referring to a concept here rather than a literal configuration option.

In most environments (in environments where the backup system is not being used for archive or HSM), a backup is most likely to be used within a short period of it being generated. That highest-probability period of usage is what I would suggest should be considered the half-life of the backup. Like regular notions of half-life, it’s not just a one-off measurement, but one that can be continued to applied throughout the lifespan of the backup.

I.e., through each successive half-life iteration, the likelihood of the backup being recalled for recovery halves again. Unlike regular half-life considerations though, the potency – or the importance – of the backup remains the same regardless of its half-life state. That is, a backup you don’t recover from until nearly the end of its life is still likely to be just as important as a backup you recover from 30 minutes after it was completed.

In normal circumstances though, what the half-life of a backup affects is the urgency of a recovery request for that backup. This, in turn, reflects the way in which your backup environment needs to facilitate recoveries. As the half-life of the backup continues to decrease, you can typically take longer to perform the recovery, but at the other end of the spectrum when the backup is quick, a recovery request will similarly expect a rapid response.

You effectively design the backup system to suit the half-life of your backups. If your backups are most likely to be used for recovery within the first two weeks of their generation, then you need to ensure that those backups are your fastest to recover from. From an architecture point of view, this would typically mean storage decisions such as ensuring that at least 2 weeks worth of backups are on disk – either as VTL backups or ADV_FILE type backups. Over time you can move backups out to slower media – making room for new, incoming backups, and keeping old backups recoverable at an appropriate level of cost effectiveness for the likely urgency of a recovery request.

For the most part, we’d normally only need to consider 4 levels of half-life for backups before we hit a level of such diminishing urgency that it becomes a bit like the high availability problem (i.e., the jump from 99.99% availability to 99.999% availability is a far more expensive proposition than the jump from 99.9% availability to 99.99% availability, etc).

These levels would be:

  • Online – For backups that have the highest recovery priority, you’ll likely use a combination of backup and snapshot software. Your “online” backups are snapshots that can be instantly retrieved from.
  • Nearline – For backups that have been recently done, you’ll want to keep them almost-immediately accessible; in a disk backup realm this means within a VTL or on ADV_FILE – in a tape only realm you’d be ensuring these are still within your tape library.
  • Offline – For backups that were done “a while ago”, you’ll want to keep them locally available for recovery purposes but not necessarily hogging more expensive backup space. In a backup to disk/VTL environment, this would either mean staging to physical tape and keeping within a tape library, or keeping on-site in a media vault. For a tape-only environment, it refers to keeping the media on-site in the media vault.
  • Offsite – For backups that have been done “some time ago”, they can typically be kept off-site with a records retention company, or in disaster recovery storage, etc.

(Note that in all of this I’m not talking about clones – copies of your backups – you need them regardless of the half-life of your backup, so I’m taking them as a given at each stage of the process. For obvious reasons, clones and originals should never be in the same location except when they’re being purged.)

There’s another way we talk about half-lives in backups – RTO (recovery time objective) and RPO (recovery point objective). However, RTO and RPO frequently intimidates business. If you’re struggling to get the business to focus on RTOs and RPOs, start with the more readily understandable term of backup half-life and see how you go.

 

Are your service level agreements and your backup software support contracts in alignment?

A lot of companies will make the decision to run with “business hours” backup support – 9 to 5, or some variant like that, Monday to Friday. This is seen as a cheaper option, and for some companies, depending on their requirements, it can be a perfectly acceptable arrangement too. That’s usually the case where there are no SLAs, or smaller environments where the business is geared to being able to operate for protracted periods with minimal IT.

What can sometimes be forgotten in attempts to restrain budgets is whether reduced support for production support systems has any impact on meeting business requirements relating to service level agreements. If for instance, you have to start getting data flowing back within 2 hours of a failure, a system fails at midnight and the subsequent recovery has issues, your chances of being able to hit your service level agreement start to plummet if you don’t have a support contract that guarantees you access to help at this point in time.

A common response to this from management – be it IT, or financial – is “we’ll buy per-incident support if we need to“. In other words, the service level agreements the business has established necessitates a better support contract than is budgeted for, so it is ‘officially’ planned to “wing it” in the event of a serious issue.

I describe that as an Icarus Support Contract.

Icarus, as you may remember, is from Greek mythology. His father Daedalus fashioned wings out of feathers and wax so that he and Icarus could escape from prison. They escaped, but Icarus, enjoying the sensation of flight so much, disregarded his father’s warnings about flying too high. The higher he got, the closer he was to the sun. Then, eventually, the sun melted the wax, his wings fell off, and he fell to his death into the sea.

Planning to buy per-incident support is effectively building a contingency plan based on unbooked, unallocated resources.

It’s also about as safe as relying on wings held together by wax when flying high. Sure, if you’re lucky, you’ll sneak through it; but is do you really want to trust data recovery and SLAs to luck? What if those unbooked resources are already working on something for someone who does have a 24×7 contract? There’s a creek for that – and a paddle too.

In a previous job, I once discussed disaster recovery preparedness with an IT manager at a financial institution. Their primary site and their DR site were approximately 150 metres away from one other, leaving them with very little wiggle room in the event of a major catastrophe in the city. (Remember, the site being inaccessible can be just as deadly to business as the site being destroyed – and while there’s a lot less things that may destroy two city blocks, there’s plenty more things that might cut off two city blocks from human access for days.)

When questioned about the proximity of the two sites, he wasn’t concerned. Why? They were a big financial institution, they had emergency budget, and they were a valued customer of a particular server/storage manufacturer. Quite simply, if something happened and they lost both sites, they’d just go and buy or rent a truckload of new equipment and get themselves back operational again via backups. I always found this a somewhat dubious preparedness strategy – it’s definitely an example of an Icarus support contract.

I’ve since talked to account managers at multiple server/storage vendors, including the one used in this scenario, and all of them, in this era of shortened inventory streams, have scoffed at the notion of being able to instantly drop in 200+ servers and appropriate storage at the drop of a hat – especially in a situation where there’s a disaster and there’s a run on such equipment. (In Australia for instance, a lot of high end storage kit usually takes 3-6 weeks to arrive since it’s normally shipped in from overseas.)

Icarus was a naïve fool who got lost in the excitement of the moment. The fable of Icarus teaches us the perils of ignoring danger and enjoying the short-term too much. In this case, relying on future unbooked resources in the event of an issue in order to save a few dollars here and there in the now isn’t all that reliable. It’s like the age-old tape cost-cutting: if you manage to shave 10% off the backup media budget by deciding not to backup certain files or certain machines, you may very well get thanked for it. However, no-one will remember congratulating you when there’s butt-kicking to be done if it turns out that data no longer being backed up actually needed recovery.

So what is an Icarus support contract? Well, it’s a contract where you rely on luck. It’s a gamble – that in the event of a serious problem, you can buy immediate assistance at the drop of a hat. Just how bad can planning on being lucky get? Well, consider that over the last 18 months the entire world has been dealing with Icarus financial contracts – they were officially called Sub-Prime Mortgages, but the net result was the same – they were contracts and financial agreements built around the principle of luck.

Do your business a favor, and avoid Icarus support contracts. That’s the real way to get lucky in business – to not factor luck into your equations.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha