10 Things Still Wrong with Data Protection Attitudes

 Architecture, Backup theory, NetWorker  Comments Off on 10 Things Still Wrong with Data Protection Attitudes
Mar 072012
 

When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

Check in – New Years Resolutions

 Architecture, Backup theory  Comments Off on Check in – New Years Resolutions
Jan 312012
 

Resolutions Check-in

In December last year I posted “7 new years backup resolutions for companies”. Since it’s the end of January 2012, I thought I’d check in on those resolutions and suggest where a company should be up to on them, as well as offering some next steps.

  1. Testing – The first resolution related to ensuring backups are tested. By now at least an informal testing plan should be in place if none were before. The next step will be to deal with some of the aspects below so as to allow a group to own the duty of generating an official data protection test plan, and then formalise that plan.
  2. Duplication – There should be documented details of what is and what isn’t duplicated within the backup environment. Are only production systems duplicated? Are only production Tier 1 systems duplicated? The first step towards achieving satisfactory duplication/cloning of backups is to note the current level of protection and expand outwards from that. The next step will be to develop tier guidelines to allow a specification of what type of backup receives what level of duplication. If there are already service tiers in the environment, this can serve as a starting point, slotting existing architecture and capability onto those tiers. Where existing architecture is insufficient, it should be noted and budgets/plans should be developed next to deal with these short-falls.
  3. Documentation – As I mentioned before, the backup environment should be documented. Each team that is involved in the backup process should have assigned at least one individual to write documentation relating to their sections (e.g., Unix system administrators would write Unix backup and recovery guidelines, etc., Windows system administrators would do the same for Windows, and so on). This should actually include 3 people: the writer, the peer reviewer, and the manager or team leader who accepts the documentation as sufficiently complete. The next step after this will be to handover documentation to the backup administrator(s) who will be responsible for collation, contribution of their sections, and periodic re-issuing of the documents for updates.
  4. Training – If staff (specifically administrators and operators) had previously not been trained in backup administration, a training programme should be in the works. The next step, of course, will be to arrange budget for that training.
  5. Implementing a zero error policy – First step in implementing a zero error policy is to build the requisite documents: an issues register, an exceptions register, and an escalations register. Next step will be to adjust the work schedules of the administrators involved to allow for additional time taken to resolve the ‘niggly’ backup problems that have been in the environment for some time as the switchover to a zero error policy is enacted.
  6. Appointing a Data Protection Advocate – The call should have gone out for personnel (particularly backup and/or system administrators) to nominate themselves for the role of DPA within the organisation, or if it is a multi-site organisation, one DPA per site. By now, the organisation should be in a position to decide who becomes the DPA for each site.
  7. Assembling an Information Protection Advisory Council (IPAC) – Getting the IPAC in place is a little more effort because it’s going to involve more groups. However, by now there should be formal recognition of the need for this council, and an informal council membership. The next step will be to have the first formal meeting of the council, where the structure of the group and the roles of the individuals within the group are formalised. Additionally, the IPAC may very well need to make the final decision on who is the DPA for each site, since that DPA will report to them on data protection activities.

It’s worth remembering at this point that while these tasks may seem arduous at first, they’re absolutely essential to a well running backup system that actually meshes with the needs of the business. In essence: the longer they’re put off, the more painful they’ll be.

How are you going?

7 New Years Backup Resolutions for Companies

 Backup theory  Comments Off on 7 New Years Backup Resolutions for Companies
Dec 272011
 

New years resolutions for backup

I’d like to suggest that companies be prepared to make (and keep!) 7 new years resolutions when it comes to the field of backup and recovery:

  1. We will test our backups: If you don’t have a testing regime in place, you don’t have a backup system at all.
  2. We will duplicate our backups: Your backup system should not be a single point of failure. If you’re not cloning, replicating or duplicating your backups in some form, your backup system could be the straw that breaks the camel’s back when a major issue occurs.
  3. We will document our backups: As for testing, if your backup environment is undocumented, it’s not a system. All you’ve got is a collection of backups, which, if the right people are around at the right time and in the right frame of mind, you could get a recovery from it. If you want a backup system in place, you not only have to test your backups, you also have to keep them well documented.
  4. We will train our administrators and operators: It never ceases to amaze me the number of companies that deploy enterprise backup software and then insist that administrators and operators just learn how to use it themselves. While the concept of backup is actually pretty simple (“hey, you, back it up or you’ll lose it!”), the practicality of it can be a little more complex, particularly given that as an environment grows in size, so does the scope and the complexity of a backup system. If you don’t have some form of training (whether it’s internal, by an existing employed expert, or external), you’re at the edge of the event horizon, peering over into the abyss.
  5. We will implement a zero error policy: Again, there’s no such thing as a backup system when there’s no zero error policy. No ifs, no buts, no maybes. If you don’t rigorously implement a zero error policy, you’re flipping a coin every time you do a recovery, regardless of what backup product you use. (To learn more about a zero error policy, check out the trial podcast I did where that was the topic.)
  6. We will appoint a Data Protection Advocate: There’s a lot of data “out there” within a company, not necessarily under central IT control. Someone needs to be thinking about it. That someone should be the Data Protection Advocate (DPA). This person should be tasked with being the somewhat annoying person who is present at every change control meeting, raising her or his hand and saying “But wait, how will this affect our ability to protect our data?” That person should also be someone who wanders around the office(s) looking under desks for those pesky departmental servers and “test” boxes that are deployed, the extra hard drives attached to research machines, etc. If you have multiple offices, you should have a DPA per office. (The role of the DPA is outlined in this post, “What don’t you backup?“)
  7. We will assemble an Information Protection Advisory Council (IPAC): Sitting at an equal tier to the change control board, and reporting directly to the CTO/CIO/CFO, the IPAC will liaise with the DPA(s) and the business to make sure that everyone is across the contingencies that are in place for data protection, and be the “go-to” point for the business when it comes to putting new functions in place. They should be the group that sees a request for a new system or service and collectively liaises with the business and IT to ensure that the information generated by that system/service is protected. (If you want to know more about an IPAC and its role in the business, check out “But where does the DPA fit in?“)

And there you have it – the new years resolutions for your company. You may be surprised – while there’ll be a little effort getting these in place, once they’re there, you’re going to find backup, recovery, and the entire information protection process a lot easier to manage, and a lot more reliable.

Dec 132011
 

For some time I’ve been debating whether to generate podcasts for the NetWorker blog.

Rather than continue to vacillate, I’ve decided to do a sample podcast, make it available here for downloading, and decide what to do based on feedback received.

While raw technical posts don’t translate well to podcasts (how do you quote screen output, for instance?), there’s a lot of backup theory related posts I make which can readily converted.

So, please follow the link below to the first podcast, in which I go over a topic near and dear to my heart: What is a zero error policy?

If you’re interested in me producing more podcasts, please let me know. Without feedback, I’ll likely leave it at just this trial. If people are interested though, I’ll setup a proper podcast stream within iTunes and get to work.

Podcast 001: What is a zero error policy?

Cheers!

Dec 052011
 

Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

You’re not the boss of me

 Architecture, Backup theory, NetWorker  Comments Off on You’re not the boss of me
Jul 072011
 

Consider the following two questions:

  1. Do you manage your backups, or do your backups manage you?
  2. Does your organisation decide how backups should be done based on SLAs, etc., or do the backups dictate how production operates?

As you can well imagine, the answers to the above questions will very quickly tell you whether you’ve got a healthy, or a sick backup environment.

While it’s obvious how both questions should be answered, I’d wager that at least some readers will be getting that little twinge reading the above knowing that I’ve just described their backup environment as sick. And I don’t mean sick as in Gen-Y “fully sick”, I mean unwell.

If your backup environment manages you (most specifically your time and the amount of hair you’ve got left), or your backup environment dictates how production works, then you’ve got some problems you need to address. Now.

A lazy backup admin is a healthy backup admin

In 1996, I joined a system administration team that had one guiding motto: be lazy. Their attitude towards work was without a doubt the most influential one I’ve ever encountered, and it still guides my work life to this day.

I don’t mean lazy as in “avoid work”.

I mean lazy as in “automate! automate! automate!”

As far as they were concerned, the goal of the system administrator should be to automate all regular activities to the point that they should either be only ever doing one of four activities:

  1. Automating processes.
  2. Checking results of automated processes.
  3. Waiting for something to go wrong/intervention to be required.
  4. Working on a project.

The same approach should be taken in backups. You should not be say, mindlessly doing repetitive tasks that could be automated – you should be automating them and then checking the automation results. You shouldn’t be fixing errors on a daily basis, you should have a zero error policy, and error processing as an exceptional rather than an every day task. Or you should be working on the next phase of expanding or updating the backup environment.

Et tu, defendo?

The backup system shouldn’t be ambushing primary production. It should be there as a guardian, a defender – not the system that stabs from the shadows, or hogs the limelight.

Every backup product, and every backup system, will of course have limitations. But these limitations should not prevent critical activities in production from being undertaken. Instead, limitations should be ameliorated such that what needs to be done in production can still be done, with appropriate workarounds in place. If the limitations are hard ones which require a rethink of how production is done, it should not be at the expense of the business functions or the end users. This may require mitigation with other technologies – for instance, a classic scenario in situations where the backup product can’t run backups as frequently as SLAs require is to mix traditional backups and snapshots.

Some SLAs, in the light of the available budget and technology should be reassessed. However, that’s not to say all of them should in such situations. A sick backup system is where any SLA, no matter how justified, that can’t be immediately met by the backup system “as is”, is abandoned.

You’re not the boss of me

So, are you in charge of your backup system, or is your backup system is in charge of you?

If you can’t answer that question the right way, it’s time to seize control and make sure next time someone asks you, you can.

Apr 152011
 

In the past I’ve talked about the importance of having zero error policies.

In “What is a zero error policy?“, I said:

Having a zero error policy requires the following three rules:

1. All errors shall be known.

2. All errors shall be resolved.

3. No error shall be allowed to continue to occur indefinitely.

If you’ve not read that article, I suggest you go read it, as well as the follow-up article, “Zero error policy management“.

I’m going to make, and stand by, with fervid determination, the following assertion:

If you do not have a zero-error policy for your backups, you do not have a valid backup system.

No ifs, no buts, no maybes, no exceptions.

Why? Because why. Because across all the sites I’ve seen, regardless of size, regardless of complexity, the only ones that actually work properly are those where every error is captured, identified, and dealt with. Only those sites would I point at and say “They have every chance of meeting their SLAs”.

In my book, I introduce the notion that just deploying software and thinking you have a backup system is like making a sacrifice to a volcano. So, without a zero error policy, what does a network diagram of your IT environment look like?

It looks like this:

Network diagram of backup environment without zero error policies

Error lifecycle management

 Backup theory  Comments Off on Error lifecycle management
Jun 072010
 

In previous articles I’ve discussed the need for zero error policies. This was covered first in What is a Zero Error Policy?, and followed up in Zero Error Policy Management. (If you’ve not read those articles, you really should before continuing.)

Key to ensuring a zero error policy is not only adopted, but also achieved, is a good understanding of the error lifecycle. That’s right – errors have a lifecycle, which is not only well defined, but actually helps us to keep them under control. An error lifecycle will resemble the following:

The error lifecycleThe start of the lifecycle is our Test and Detect loop:

  • Detect – An error is determined to have happened either as a result of a significant fault, or as a result of routine monitoring and analysis.
  • Test – An error is determined to have happened as a result of actual testing (formal or informal).

Once it’s determined that an error has happened, we then move into the resolution cycle, which consists of:

  • Diagnose – Determine the nature of the error – i.e., the root cause. If you don’t understand the actual cause, you can’t be certain that any solution you come up with is complete.
  • Rectify – Having understood the error, it’s time to resolve it. There’s two standard resolution techniques: complete resolution or workaround. Either are acceptable, so long as the resolution technique chosen is acceptable to the business and appropriate to the error.
  • Document – Once an error is solved, it needs to be documented. As has been said on numerous occasions, “Those who don’t learn from history are doomed to repeat it.” One of the worst possible error situations for instance is one where you’ve solved it in the past, but you can’t remember what you did and thus have to repeat the entire process. At minimum, documentation requires 3 components: (a) what lead to the error, (b) how the error manifests/is detected, and (c) how the error was resolved.

The error lifecycle doesn’t stop there though, as indicated by the diagram; instead, we add that error into a test and detection register – having encountered it, we should be able to more easily be on the look out for another instance. This is hopefully where the error finishes: being monitored for, but never again recurring. In the event though that it does reoccur, the diagnosis, rectification and documentation process should be simpler.

There you have it – the error lifecycle. Knowing it allows you to manage errors, rather than errors managing you.

The A-Z of Backup and Recovery

 Architecture, Backup theory, Data loss, Features, NetWorker, Support  Comments Off on The A-Z of Backup and Recovery
Jan 072010
 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

ProsCons
Maximum control over backup granularity, down to the individual file level.Each in-guest backup is unaware of other backups that may be happening on other virtual machines on the same server. Thus, the backups have the potential to actively compete for CPU, RAM, Network Bandwidth and Storage IOs. An aggressive or ill-considered approach to in-guest backup configuration can bring an entire virtual environment to its knees.
Coupled with NetWorker modules, allows for comprehensive application-consistent backups of enterprise products such as Oracle, Exchange Server, Sybase, Microsoft SQL Server, SAP, etc.Suffers same problems as conventional per-host agent backup solutions, most notably in consideration of potential performance inhibitors such as dense filesystems. Can result in the longest backups of all options.
Very strong support for granular recovery options.Bare Metal Recovery options are often more problematic or involved.
Least affected by changes to underlying virtual machine backup options.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

Most popular in August

 Aside, NetWorker  Comments Off on Most popular in August
Sep 012009
 

The most visited post in August was again, Carry a jukebox with you (if you’re using Linux). I think part of this must be attributed to the linkage of Linux with Free. I.e., because Linux is seen as low cost (or no cost), there’s a core group, particularly of open source fans, who want to come up with a totally free solution for their environment, no matter what environment that is.

However, I don’t think that’s all that can be attributed to why this article keep on drawing people in. Despite my reservations about VTL, a lot of people are interested in deploying them. It’s important to stress again – I don’t dislike VTLs, I just wish we didn’t need them. Recognising though that we do need them, I can appreciate the management benefits that they bring to an environment.

From a support perspective of course I’m a big fan – with a VTL I can carry a jukebox around wherever I go.

The Linux VTL post even beat out old standards – the parallelism and NSR peer information related posts, which normally win hands down every month.

(From a policy and procedural perspective though, it was good to see that the introductory post to zero error policies, What is a Zero Error Policy?, got the next most attention. I can’t really stress enough how important I think zero error policies are to systems management in general, and backup/data protection specifically.)