When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

 

Obviously the NetWorker Blog gets a lot of referrals from search engines via people looking specifically for help on particular NetWorker issues they’re encountering. Even just in the last 8+ hours, here are just some of the search terms that people used:

nmc doesn’t start

restore networker aborted saveset

networker disk backup module

nsr_render_log command

nsr_render_log daemon.raw

networker centos support

39077:jbconfig: error, you must install the lus scsi passthrough driver before configuring

And the list goes on and on, on a daily basis. This was reflected in the Top 10 for 2011 (and indeed, the top 10 for every previous year, too).

I’ll let you all in on a little secret though: all of those tips, all of those NetWorker basics articles and how to use nsradmin user guides – they’re all just the tip of the iceberg when it comes to getting a working backup system in place.

You see, a lot of sites don’t have a backup system at all – they just have some backup software and backup hardware and configuration. That doesn’t represent a backup system at all. From my article, “What is a backup system?“, I provided this diagram to explain such beasts:

Backup system

As you can see, the technology (the backup software, hardware and configuration) represents just one entry point to having a backup system. The others though are all equally critical; and when you add them all in together, it becomes clear that a backup system will derive much of its success and reliability from the human and business factors.

The technology, you see, is the easiest part of the backup environment; and it’s also the part that’s most likely to appeal to IT people. If you were to graph how much time the average site spends on each of those activities, it would probably look like this:

Imbalanced backup systemsWhen in actual fact, it should look more like this:

Balanced backup system

The short description? If you chart the amount of time you spend on your backup “system”, and the the Technology aspect (software, hardware, configuration) becomes a Pacman to the rest of the components, eating away at the rest of those facets, then you’ve got a cannibalistic environment that’s surviving as much as anything on luck/good fortune as it is on good design.

That’s why I bang on so much about backup theory – because all the latest and greatest technology in the world won’t help you at all if you don’t have everything else set up in conjunction with it:

  • The people involved need to know their roles, and participate in both the architecture of the environment and its ongoing operation;
  • The processes for use of the system must be well established;
  • The system must be thoroughly documented;
  • The system must be tested or you’ve got no way of establishing reliability;
  • The Service Level Agreements have to be established or else there’s no point whatsoever to what you’re doing.

Backup theory isn’t the boring part of a backup system; I’d suggest it’s actually the most interesting part of it. Just as I suggested that companies need to plan to follow some new years resolutions for backup systems, I’d equally suggest that the people involved in backups should start making it their goal to spend a balanced amount of time on the components that form a backup system.

If you don’t have the theory, you actually don’t have a system.

If you want to know more, you should treat yourself to my book (now available in Kindle format).

 

There’s a pertinent adage in cooking when it comes to using wine in recipes:

If you wouldn’t drink it, don’t cook with it.

It’s simple: if you don’t like the taste of it in a glass, what makes you think you’ll like the taste of food you’ve added it to?

There are two similar rules for backup, and they’re particularly important when it comes time to do those periodic hardware refreshes in your environment:

If it’s not good enough to run production, don’t use it for DR.

If it’s not good enough to run production, don’t use it for backup.

The way in which both of these come into play is quite simple:

  1. If it’s not good enough to run production, don’t use it for DR. I’ve seen companies have a hardware refresh cycle of “move production equipment to DR, buy new production equipment”. However, invariably that equipment is being pulled out of production because it’s either lacking in capacity, or lacking in performance. That equipment is then going to be replaced with new equipment with planned usage time of (typically) 2-3 years. So let’s assume you get a year down the track – your in-use storage capacity has gone up, your processing load has increased, then there’s a major production fault and you have to failover to DR. At which point, you’re trying to run your production environment on something that was sized to max out 12 months ago. Chances of it adequately running production? Minimal.
  2. If it’s not good enough to run production, don’t use it for backup. Another common mistake is a situation whereby say, a storage array is pulled out of production and replaced with a new, faster array with more capacity. People invariably hate to see things go to waste, so someone suggests “let’s use the old array as {backup to disk | VTL | etc}”. Again, sounds simple enough on the face of it, except the equipment was either lacking in performance, or lacking in capacity. If it was lacking in performance, you’re putting it into a situation where you’re going to be copying off something that is purchased, on the outset, to be significantly faster than it. It’s similar with capacity – you’re going to be trying to backup a very large bucket to a much smaller bucket.

Whether your company likes the idea of it or not, backup and disaster recovery are not areas that should be assigned “hand me downs” by the rest of the business. They require their own capital budget, and a planning that allows for the following two factors:

  1. Performance should at least match the throughput on offer from production;
  2. It should exceed your production capacity.

If either of these conditions are not met, your strategy is insufficient.

 

Every backup you do has a half-life, which isn’t the retention period of the backup. Now, if you’re new to NetWorker, don’t go looking for a half life setting for clients or savesets or groups; I’m referring to a concept here rather than a literal configuration option.

In most environments (in environments where the backup system is not being used for archive or HSM), a backup is most likely to be used within a short period of it being generated. That highest-probability period of usage is what I would suggest should be considered the half-life of the backup. Like regular notions of half-life, it’s not just a one-off measurement, but one that can be continued to applied throughout the lifespan of the backup.

I.e., through each successive half-life iteration, the likelihood of the backup being recalled for recovery halves again. Unlike regular half-life considerations though, the potency – or the importance – of the backup remains the same regardless of its half-life state. That is, a backup you don’t recover from until nearly the end of its life is still likely to be just as important as a backup you recover from 30 minutes after it was completed.

In normal circumstances though, what the half-life of a backup affects is the urgency of a recovery request for that backup. This, in turn, reflects the way in which your backup environment needs to facilitate recoveries. As the half-life of the backup continues to decrease, you can typically take longer to perform the recovery, but at the other end of the spectrum when the backup is quick, a recovery request will similarly expect a rapid response.

You effectively design the backup system to suit the half-life of your backups. If your backups are most likely to be used for recovery within the first two weeks of their generation, then you need to ensure that those backups are your fastest to recover from. From an architecture point of view, this would typically mean storage decisions such as ensuring that at least 2 weeks worth of backups are on disk – either as VTL backups or ADV_FILE type backups. Over time you can move backups out to slower media – making room for new, incoming backups, and keeping old backups recoverable at an appropriate level of cost effectiveness for the likely urgency of a recovery request.

For the most part, we’d normally only need to consider 4 levels of half-life for backups before we hit a level of such diminishing urgency that it becomes a bit like the high availability problem (i.e., the jump from 99.99% availability to 99.999% availability is a far more expensive proposition than the jump from 99.9% availability to 99.99% availability, etc).

These levels would be:

  • Online – For backups that have the highest recovery priority, you’ll likely use a combination of backup and snapshot software. Your “online” backups are snapshots that can be instantly retrieved from.
  • Nearline – For backups that have been recently done, you’ll want to keep them almost-immediately accessible; in a disk backup realm this means within a VTL or on ADV_FILE – in a tape only realm you’d be ensuring these are still within your tape library.
  • Offline – For backups that were done “a while ago”, you’ll want to keep them locally available for recovery purposes but not necessarily hogging more expensive backup space. In a backup to disk/VTL environment, this would either mean staging to physical tape and keeping within a tape library, or keeping on-site in a media vault. For a tape-only environment, it refers to keeping the media on-site in the media vault.
  • Offsite – For backups that have been done “some time ago”, they can typically be kept off-site with a records retention company, or in disaster recovery storage, etc.

(Note that in all of this I’m not talking about clones – copies of your backups – you need them regardless of the half-life of your backup, so I’m taking them as a given at each stage of the process. For obvious reasons, clones and originals should never be in the same location except when they’re being purged.)

There’s another way we talk about half-lives in backups – RTO (recovery time objective) and RPO (recovery point objective). However, RTO and RPO frequently intimidates business. If you’re struggling to get the business to focus on RTOs and RPOs, start with the more readily understandable term of backup half-life and see how you go.

 

Are your service level agreements and your backup software support contracts in alignment?

A lot of companies will make the decision to run with “business hours” backup support – 9 to 5, or some variant like that, Monday to Friday. This is seen as a cheaper option, and for some companies, depending on their requirements, it can be a perfectly acceptable arrangement too. That’s usually the case where there are no SLAs, or smaller environments where the business is geared to being able to operate for protracted periods with minimal IT.

What can sometimes be forgotten in attempts to restrain budgets is whether reduced support for production support systems has any impact on meeting business requirements relating to service level agreements. If for instance, you have to start getting data flowing back within 2 hours of a failure, a system fails at midnight and the subsequent recovery has issues, your chances of being able to hit your service level agreement start to plummet if you don’t have a support contract that guarantees you access to help at this point in time.

A common response to this from management – be it IT, or financial – is “we’ll buy per-incident support if we need to“. In other words, the service level agreements the business has established necessitates a better support contract than is budgeted for, so it is ‘officially’ planned to “wing it” in the event of a serious issue.

I describe that as an Icarus Support Contract.

Icarus, as you may remember, is from Greek mythology. His father Daedalus fashioned wings out of feathers and wax so that he and Icarus could escape from prison. They escaped, but Icarus, enjoying the sensation of flight so much, disregarded his father’s warnings about flying too high. The higher he got, the closer he was to the sun. Then, eventually, the sun melted the wax, his wings fell off, and he fell to his death into the sea.

Planning to buy per-incident support is effectively building a contingency plan based on unbooked, unallocated resources.

It’s also about as safe as relying on wings held together by wax when flying high. Sure, if you’re lucky, you’ll sneak through it; but is do you really want to trust data recovery and SLAs to luck? What if those unbooked resources are already working on something for someone who does have a 24×7 contract? There’s a creek for that – and a paddle too.

In a previous job, I once discussed disaster recovery preparedness with an IT manager at a financial institution. Their primary site and their DR site were approximately 150 metres away from one other, leaving them with very little wiggle room in the event of a major catastrophe in the city. (Remember, the site being inaccessible can be just as deadly to business as the site being destroyed – and while there’s a lot less things that may destroy two city blocks, there’s plenty more things that might cut off two city blocks from human access for days.)

When questioned about the proximity of the two sites, he wasn’t concerned. Why? They were a big financial institution, they had emergency budget, and they were a valued customer of a particular server/storage manufacturer. Quite simply, if something happened and they lost both sites, they’d just go and buy or rent a truckload of new equipment and get themselves back operational again via backups. I always found this a somewhat dubious preparedness strategy – it’s definitely an example of an Icarus support contract.

I’ve since talked to account managers at multiple server/storage vendors, including the one used in this scenario, and all of them, in this era of shortened inventory streams, have scoffed at the notion of being able to instantly drop in 200+ servers and appropriate storage at the drop of a hat – especially in a situation where there’s a disaster and there’s a run on such equipment. (In Australia for instance, a lot of high end storage kit usually takes 3-6 weeks to arrive since it’s normally shipped in from overseas.)

Icarus was a naïve fool who got lost in the excitement of the moment. The fable of Icarus teaches us the perils of ignoring danger and enjoying the short-term too much. In this case, relying on future unbooked resources in the event of an issue in order to save a few dollars here and there in the now isn’t all that reliable. It’s like the age-old tape cost-cutting: if you manage to shave 10% off the backup media budget by deciding not to backup certain files or certain machines, you may very well get thanked for it. However, no-one will remember congratulating you when there’s butt-kicking to be done if it turns out that data no longer being backed up actually needed recovery.

So what is an Icarus support contract? Well, it’s a contract where you rely on luck. It’s a gamble – that in the event of a serious problem, you can buy immediate assistance at the drop of a hat. Just how bad can planning on being lucky get? Well, consider that over the last 18 months the entire world has been dealing with Icarus financial contracts – they were officially called Sub-Prime Mortgages, but the net result was the same – they were contracts and financial agreements built around the principle of luck.

Do your business a favor, and avoid Icarus support contracts. That’s the real way to get lucky in business – to not factor luck into your equations.

 

Perhaps one of the most common mistakes that companies can make is to focus on their backup window. You might say this is akin to putting the cart before the horse. While the backup window is important, in a well designed backup system, it’s actually only of tertiary importance.

Here’s the actual order of importance in a backup environment:

  1. Recovery performance.
  2. Cloning (duplication) performance.
  3. Backup performance.

That is, the system must be designed to:

  1. First ensure that all data can be recovered within the required timeframes,
  2. Second ensure that all data that needs to be cloned is cloned within a suitable timeframe to allow off-siting,
  3. Third ensure that all data is backed up within the required backup window.

Obviously for environments with well considered backup windows (i.e., good reasons for the backup window requirements), the backup window should be met – there’s no questioning about that. However, meeting the backup window should not be done at the expense of impacting either the cloning window or the recovery window.

Here’s a case in point: block level backups of dense filesystems often allow for much smaller backup windows – however, due to the way that individual files are reconstructed (read from media, reconstruct in cache, copy back to filesystem), they do this at the expense of required recovery times. (This also goes to the heart of what I keep telling people about backup: test, test, test.)

The focus on the recovery performance in particular is the best possible way (logically, procedurally, best practices – however you want to consider it) to drive the entire backup system architecture. It shouldn’t be a case of how many TB per hour you want to backup, but rather, how many TB per hour you need to recover. Design the system to meet recovery performance requirements and backup will naturally follow*.

If your focus has up until now been the backup window, I suggest you zoom out so you can see the bigger picture.


* I’ll add that for the most part, your recovery performance requirements shouldn’t be “x TB per hour” or anything so arbitrary. Instead, they should be decided by your system maps and your SLAs, and instead should focus on business requirements – e.g., a much more valid recovery metric is “the eCommerce system must be recovered within 2 hours” (that would then refer to all dependencies that provide service to and access for the eCommerce system).

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha