When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha