The cockatrice was a legendary beast that was a two-legged dragon, with the head of a rooster that could, amongst other things, turn people to stone with a glance. So it was somewhat to a basilisk, but a whole lot uglier and looked like it had been designed by a committee.

You may be surprised to know that there are cockatrice backup environments out there. Such an environment can be just as ugly as the mythical cockatrice, and just as dangerous, turning even a hardened backup expert to stone as he or she tries to sort through the “what-abouts?”, the “where-ares?” and the “who-does?”

These environments are typically quite organic, and have grown and developed over years, usually with multiple staff having been involved and/or responsible, but no one staff member having had sufficient ownership (or longevity) to establish a single unifying factor within the environment. That in itself would be challenging enough, but to really make the backup environment a cockatrice, there’ll also be a lack of documentation.

In such environments, it’s quite possible that the environment is largely acting like a backup system, but through a combination of sheer luck and a certain level of procedural adherence, typically by operators who have remained in the environment for long enough. These are the systems for which, when the question “But why do you do X?”, the answer is simply, “Because we’ve always done X.”

In this sort of system, new technologies have typically just been tacked on, sometimes shoe-horned into “pretending” they work just as the old systems, and sometimes not used at their peak efficiency because of that general reluctance to change such systems engender. (A classic example for instance, can be seen where a deduplication system is tacked onto an existing backup environment, but is treated like a standard VTL or a standard backup-to-disk region, without any consideration for the particularities involved in using deduplication storage.)

The good news is, these environments can be fixed, and turned into true backup systems. To do so, there needs to be four decisions made:

  1. To embrace change. The first essential step is to eliminate the “it’s always been done this way before” mentality. This doesn’t allow for progress, or change, at all, and if there’s one common factor in any successful business, it’s the ability to change. This is not just representative of the business itself, but for each component of the business – and that includes backup.
  2. To assign ownership. A backup system requires both a technical owner and a management owner. Ideally, the technical owner will be the Data Protection Advocate for the company or business group, and the management owner will be both an individual, and the Information Protection Advisory Council. (See here.)
  3. To document. The first step to pulling order out of chaos (or even general disarray and disconnectedness) is to start documenting the environment. “Document! Document! Document!”, you might hear me cry as I write this line – and you wouldn’t be too far wrong. Document the system configuration. Document the rebuild process. Document the backup and recovery processes. Sometimes this documentation will be reference to external materials, but a good chunk of it will be material that your staff have to develop themselves.
  4. To plan. Organic growth is fine. Uncontrolled organic or haphazard growth is not. You need to develop a plan for the backup environment. This will be possible once the above aspects have been tackled, but two key parts to that plan should be:
    • How long will the system, in its current form, continue to service our requirements?
    • What are some technologies we should be starting to evaluate now, or at least stay abreast of, for consideration when the system has to be updated?

With those four decisions made, and implemented, the environment can be transfigured from a hodge-podge of technologies with no real unifying principle other than conformity to prior usage patterns into a collection of synergistic tools working seamlessly to optimise the data backup and recovery operations of the company.

 

Resolutions Check-in

In December last year I posted “7 new years backup resolutions for companies”. Since it’s the end of January 2012, I thought I’d check in on those resolutions and suggest where a company should be up to on them, as well as offering some next steps.

  1. Testing – The first resolution related to ensuring backups are tested. By now at least an informal testing plan should be in place if none were before. The next step will be to deal with some of the aspects below so as to allow a group to own the duty of generating an official data protection test plan, and then formalise that plan.
  2. Duplication – There should be documented details of what is and what isn’t duplicated within the backup environment. Are only production systems duplicated? Are only production Tier 1 systems duplicated? The first step towards achieving satisfactory duplication/cloning of backups is to note the current level of protection and expand outwards from that. The next step will be to develop tier guidelines to allow a specification of what type of backup receives what level of duplication. If there are already service tiers in the environment, this can serve as a starting point, slotting existing architecture and capability onto those tiers. Where existing architecture is insufficient, it should be noted and budgets/plans should be developed next to deal with these short-falls.
  3. Documentation – As I mentioned before, the backup environment should be documented. Each team that is involved in the backup process should have assigned at least one individual to write documentation relating to their sections (e.g., Unix system administrators would write Unix backup and recovery guidelines, etc., Windows system administrators would do the same for Windows, and so on). This should actually include 3 people: the writer, the peer reviewer, and the manager or team leader who accepts the documentation as sufficiently complete. The next step after this will be to handover documentation to the backup administrator(s) who will be responsible for collation, contribution of their sections, and periodic re-issuing of the documents for updates.
  4. Training – If staff (specifically administrators and operators) had previously not been trained in backup administration, a training programme should be in the works. The next step, of course, will be to arrange budget for that training.
  5. Implementing a zero error policy – First step in implementing a zero error policy is to build the requisite documents: an issues register, an exceptions register, and an escalations register. Next step will be to adjust the work schedules of the administrators involved to allow for additional time taken to resolve the ‘niggly’ backup problems that have been in the environment for some time as the switchover to a zero error policy is enacted.
  6. Appointing a Data Protection Advocate – The call should have gone out for personnel (particularly backup and/or system administrators) to nominate themselves for the role of DPA within the organisation, or if it is a multi-site organisation, one DPA per site. By now, the organisation should be in a position to decide who becomes the DPA for each site.
  7. Assembling an Information Protection Advisory Council (IPAC) – Getting the IPAC in place is a little more effort because it’s going to involve more groups. However, by now there should be formal recognition of the need for this council, and an informal council membership. The next step will be to have the first formal meeting of the council, where the structure of the group and the roles of the individuals within the group are formalised. Additionally, the IPAC may very well need to make the final decision on who is the DPA for each site, since that DPA will report to them on data protection activities.

It’s worth remembering at this point that while these tasks may seem arduous at first, they’re absolutely essential to a well running backup system that actually meshes with the needs of the business. In essence: the longer they’re put off, the more painful they’ll be.

How are you going?

 

Continuing on my post relating to dark data last week, I want to spend a little more about data awareness classification and distribution within an enterprise environment.

Dark data isn’t the end of the story, and it’s time to introduce the entire family of data-awareness concepts. These are:

  • Data – This is both the core data managed and protected by IT, and all other data throughout the enterprise which is:
    • Known about – The business is aware of it;
    • Managed – This data falls under the purview of a team in terms of storage administration (ILM);
    • Protected – This data falls under the purview of a team in terms of backup and recovery (ILP).
  • Dark Data – To quote the previous article, “all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for”.
  • Grey Data – Grey data is previously discovered dark data for which no decision has been made as yet in relation to its management or protection. That is, it’s now known about, but has not been assigned any policy or tier in either ILM or ILP.
  • Utility Data – This is data which is subsequently classified out of grey data state into a state where the data is known to have value, but is not either managed or protected, because it can be recreated. It could be that the decision is made that the cost (in time) of recreating the data is less expensive than the cost (both in literal dollars and in staff-activity time) of managing and protecting it.
  • Noise – This isn’t really data at all, but are all the “bits” (no pun intended) that are left which are neither grey data, data or utility data. In essence, this is irrelevant data, which someone or some group may be keeping for unnecessary reasons, and in actual fact should be considered eligible for either deletion or archival and deletion.

The distribution of data by awareness within the enterprise may resemble something along the following lines:

Data Awareness Percentage Distribution

That is, ideally the largest percentage of data should be regular data which is known, managed and protected. In all likelihood for most organisations, the next biggest percentage of data is going to be dark data – the data that hasn’t been discovered yet. Ideally however, after regular and dark data have been removed from the distribution, there should be at most 20% of data left, and this should be broken up such that at least half of that remaining data is utility data, with the last 10% split evenly between grey data and noise.

The logical implications of this layout should be reasonably straight forward:

  1. At all times the majority of data within an organisation should be known, managed and protected.
  2. It should be expected that at least 20% of the data within an organisation is undiscovered, or decentralised.
  3. Once data is discovered, it should exist in a ‘grey’ state for a very short period of time; ideally it should be reclassified as soon as possible into data, utility data or noise. In particular, data left in a grey state for an extended period of time represents just as dangerous a potential data loss situation as dark data.

It should be noted that regular data, even in this awareness classification scheme, will still be subject to regular data lifecycle decisions (archive, tiering, deletion, etc.) In that sense, primary data eligible for deletion isn’t really noise, because it’s previously been managed and protected; noise really is ex dark-data that will end up being deleted, either as an explicit decision, or due to a failure at some future point after the decision to classify it as ‘noise’, having never been managed or protected in a centralised, coordinated manner.

Equally, utility data won’t refer to say, Q/A or test databases that replicate the content of production databases. These types of databases will again have fallen under the standard data umbrella in that there will have been information lifecycle management and protection policies established for them, regardless of what those policies actually were.

If we bring this back to roles, then it’s clear that a pivotal role of both the DPAs (Data Protection Advocates) and the IPAC (Information Protection Advisory Council) within an organisation should be the rapid coordination of classification of dark data as it is discovered into one of the data, utility data or noise states.

 

Dark Data

We’ve all heard the term Big Data - it’s something the vendors have been ramming down our throats with the same level of enthusiasm as Cloud. Personally, I think Big Data is a problem that shouldn’t exist: it serves for me as a stark criticism of OS, Application, Storage and Software companies for failing to anticipate the high end of the data growth arena and developing suitable mechanisms for dealing with it as part of the regular tool sets. After all, why should the end user have to ask him/herself: “Hmmm, do I have data or big data?”

Moving right along, recently another term has been starting to popup, and it’s far a more interesting – and legitimate – a problem.

It’s dark data.

If you haven’t heard of the term, I’m betting that you’ve either guessed the meaning or have a bit of an idea about it.

Dark data refers to all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for. Such as:

  • All those user PST files on desktops and notebooks;
  • That server a small workgroup deployed for testing purposes that’s not centrally managed or officially known about;
  • That research data an academic is storing on a 2TB USB drive connected to her laptop;
  • That offline copy of a chunk of the fileserver someone grabbed before going overseas that’s now sufficiently different from the real content of the fileserver;
  • and so on.

Dark data is a real issue within the business environment, because there’s potentially a large amount of critical information “out there” in the business but not necessarily under the control of the IT department.

You might call it decentralised data.

As we know from data protection, decentralised backups are particularly dangerous; they increase the cost of control and maintenance, they decrease the reliability of the process, and they can be a security nightmare. It’s exactly the same for dark data – in fact, worse, because by the very nature of the definition, it’s also data that’s unlikely to be backed up.

To try to control the spread of dark data, some companies will institute rigorous local storage policies, but these often present bigger headaches than they’re worth. For instance, locking down user desktops to make local storage not writeable isn’t always successful, and the added network load by shifting user profiles across to fileservers can be painful. Further, pushing these files across to centralised storage can make for extremely dense filesystems (or at least contribute towards them), trading one problem for another. Finally, it introduces new risk to the business, making users extremely unproductive if there are network or central storage issues.

There’s a few things a business can do in relation to dark data so as to decrease the headache and challenges created by it. These are acceptance, anticipation, and discovery.

  1. Acceptance – Acknowledge that dark data will find its way into the organisation. Keeping the corporate head in the sand over the existence of dark data, or blindly adhering to the (false) notion that rigorous security policies will prevent storage of data anywhere in the organisation except centrally, is foolish. Now, this doesn’t mean that you have to accept that data will become dark. Instead, acknowledging that there will be dark data out there will keep it as a known issue. What’s more, because it’s actually acknowledged by the business, it can be discussed by the business. Discussion will facilitate two key factors: keeping users aware of the dangers of dark data, and encouraging users to report dark data.
  2. Anticipation – Accepting that dark data exists is one thing; anticipating what can be done about it, and how it might be found allows a company to actually start dealing with dark data. Anticipating dark data can’t happen unless someone is responsible for it. Now, I’m not suggesting that being responsible for dark data means getting in trouble if there are issues with unprotected dark data going missing – if that were the case, not a single person in a company would want to be responsible for it. (And any person who did want to be responsible under those circumstances would likely not understand the scope of the issue.) The obvious person for this responsibility is the Data Protection Advisor. (See here and here.) You might argue that the dark data problem explicitly points out the need for one or more DPAs at every business.
  3. Discovery – No discovery process for dark data will be fully automated. There will be a level of automation that can be achieved via indexing and search engines deployed from central IT, but given dark data may be on systems which are only intermittently connected, or outside of the domain authority of IT, there will be a human element as well. This will consist of the DPA(s), end users, and team leaders, viz:
    • The DPA will be tasked with not only periodic visual inspections of his/her area of responsibility, but will also be responsible for issuing periodic reminders to staff, requesting notification of any local data storage.
    • End users should be aware (via induction, and company policies) of the need to avoid, as much as possible, the creation of data outside of the control and management of central IT. But they should equally be aware that in situations where this happens, a policy can be followed to notify IT to ensure that the data is protected or reviewed.
    • Team leaders should equally be aware of the potential for dark data creation, as per end users, but should also be tasked with liaising with IT to ensure dark data, once discovered, is appropriately classified, managed and protected. This may sometimes necessitate moving the data under IT control, but it may also at times be an acknowledgement that the data is best left local, with appropriate protection measures implemented and agreed upon.

Dark data is a real problem that will exist in practically every business; however, it doesn’t have to be a serious problem, when carefully dealt with. The above three rules – acceptance, anticipation, and discovery, will ensure it stays managed.

[2012-01-27 Addendum]

There’s now a followup to this article – “Data Awareness Distribution in the Enterprise“.

 

Obviously the NetWorker Blog gets a lot of referrals from search engines via people looking specifically for help on particular NetWorker issues they’re encountering. Even just in the last 8+ hours, here are just some of the search terms that people used:

nmc doesn’t start

restore networker aborted saveset

networker disk backup module

nsr_render_log command

nsr_render_log daemon.raw

networker centos support

39077:jbconfig: error, you must install the lus scsi passthrough driver before configuring

And the list goes on and on, on a daily basis. This was reflected in the Top 10 for 2011 (and indeed, the top 10 for every previous year, too).

I’ll let you all in on a little secret though: all of those tips, all of those NetWorker basics articles and how to use nsradmin user guides – they’re all just the tip of the iceberg when it comes to getting a working backup system in place.

You see, a lot of sites don’t have a backup system at all – they just have some backup software and backup hardware and configuration. That doesn’t represent a backup system at all. From my article, “What is a backup system?“, I provided this diagram to explain such beasts:

Backup system

As you can see, the technology (the backup software, hardware and configuration) represents just one entry point to having a backup system. The others though are all equally critical; and when you add them all in together, it becomes clear that a backup system will derive much of its success and reliability from the human and business factors.

The technology, you see, is the easiest part of the backup environment; and it’s also the part that’s most likely to appeal to IT people. If you were to graph how much time the average site spends on each of those activities, it would probably look like this:

Imbalanced backup systemsWhen in actual fact, it should look more like this:

Balanced backup system

The short description? If you chart the amount of time you spend on your backup “system”, and the the Technology aspect (software, hardware, configuration) becomes a Pacman to the rest of the components, eating away at the rest of those facets, then you’ve got a cannibalistic environment that’s surviving as much as anything on luck/good fortune as it is on good design.

That’s why I bang on so much about backup theory – because all the latest and greatest technology in the world won’t help you at all if you don’t have everything else set up in conjunction with it:

  • The people involved need to know their roles, and participate in both the architecture of the environment and its ongoing operation;
  • The processes for use of the system must be well established;
  • The system must be thoroughly documented;
  • The system must be tested or you’ve got no way of establishing reliability;
  • The Service Level Agreements have to be established or else there’s no point whatsoever to what you’re doing.

Backup theory isn’t the boring part of a backup system; I’d suggest it’s actually the most interesting part of it. Just as I suggested that companies need to plan to follow some new years resolutions for backup systems, I’d equally suggest that the people involved in backups should start making it their goal to spend a balanced amount of time on the components that form a backup system.

If you don’t have the theory, you actually don’t have a system.

If you want to know more, you should treat yourself to my book (now available in Kindle format).

 

New years resolutions for backup

I’d like to suggest that companies be prepared to make (and keep!) 7 new years resolutions when it comes to the field of backup and recovery:

  1. We will test our backups: If you don’t have a testing regime in place, you don’t have a backup system at all.
  2. We will duplicate our backups: Your backup system should not be a single point of failure. If you’re not cloning, replicating or duplicating your backups in some form, your backup system could be the straw that breaks the camel’s back when a major issue occurs.
  3. We will document our backups: As for testing, if your backup environment is undocumented, it’s not a system. All you’ve got is a collection of backups, which, if the right people are around at the right time and in the right frame of mind, you could get a recovery from it. If you want a backup system in place, you not only have to test your backups, you also have to keep them well documented.
  4. We will train our administrators and operators: It never ceases to amaze me the number of companies that deploy enterprise backup software and then insist that administrators and operators just learn how to use it themselves. While the concept of backup is actually pretty simple (“hey, you, back it up or you’ll lose it!”), the practicality of it can be a little more complex, particularly given that as an environment grows in size, so does the scope and the complexity of a backup system. If you don’t have some form of training (whether it’s internal, by an existing employed expert, or external), you’re at the edge of the event horizon, peering over into the abyss.
  5. We will implement a zero error policy: Again, there’s no such thing as a backup system when there’s no zero error policy. No ifs, no buts, no maybes. If you don’t rigorously implement a zero error policy, you’re flipping a coin every time you do a recovery, regardless of what backup product you use. (To learn more about a zero error policy, check out the trial podcast I did where that was the topic.)
  6. We will appoint a Data Protection Advocate: There’s a lot of data “out there” within a company, not necessarily under central IT control. Someone needs to be thinking about it. That someone should be the Data Protection Advocate (DPA). This person should be tasked with being the somewhat annoying person who is present at every change control meeting, raising her or his hand and saying “But wait, how will this affect our ability to protect our data?” That person should also be someone who wanders around the office(s) looking under desks for those pesky departmental servers and “test” boxes that are deployed, the extra hard drives attached to research machines, etc. If you have multiple offices, you should have a DPA per office. (The role of the DPA is outlined in this post, “What don’t you backup?“)
  7. We will assemble an Information Protection Advisory Council (IPAC): Sitting at an equal tier to the change control board, and reporting directly to the CTO/CIO/CFO, the IPAC will liaise with the DPA(s) and the business to make sure that everyone is across the contingencies that are in place for data protection, and be the “go-to” point for the business when it comes to putting new functions in place. They should be the group that sees a request for a new system or service and collectively liaises with the business and IT to ensure that the information generated by that system/service is protected. (If you want to know more about an IPAC and its role in the business, check out “But where does the DPA fit in?“)

And there you have it – the new years resolutions for your company. You may be surprised – while there’ll be a little effort getting these in place, once they’re there, you’re going to find backup, recovery, and the entire information protection process a lot easier to manage, and a lot more reliable.

 

It’s that time of the year where I sit back for a moment and look at what articles have attracted the most readers over the year, and it’s a fairly eclectic bunch. Interestingly, for the first time since forever, the article about fixing NSR Peer Information issues didn’t come first – we have some new winners.

10 – New Micromanual – LinuxVTL and NetWorker

The second micromanual was a step-by-step guide for configuring the open source LinuxVTL system with NetWorker. I had hoped when I started writing micromanuals that I’d get them more frequently delivered, but various factors get in the way of this. Maybe in 2012 I’ll be able to get a couple more out and available.

9 – Killing scheduled cloning operations

When NetWorker’s scheduled clone option was introduced, there were a few bugs relating to stopping a scheduled clone operation from the GUI. Sometimes you could, and sometimes you couldn’t. However, you could always kill a scheduled clone job from the command line, which is what this post explained.

8 – NetWorker Firewall Configuration on Windows

Very early in the year I was doing a lot of work with NetWorker on Windows 2008 R2, and I was noticing a few gaps in the installation process when it came to the process of automated configuration of the Windows Firewall to work with NetWorker daemons. This post explained the lessons I learnt.

7 – Carry a jukebox with you (if you’re using Linux)

This article was my first post about configuring the open source LinuxVTL system with NetWorker. Since then LinuxVTL has evolved quite a lot, and I’ll likely even need to update that micromanual early in the new year as a consequence.

6 – Why I’d choose NetWorker over NetBackup Every Time

Despite the fact that the article was titled “Why I’d choose…”, I had a rather indignant response to this post insisting I was being a jerk by writing it. I stand by every word in that post. I would not, personally, elect to choose NetBackup over NetWorker on the basis that NetBackup only has true image recovery as an option, and that NetBackup doesn’t support dependency chains for backup images. I see both of these factors as critical to a true enterprise backup product, and NetBackup only half supports one of them. That doesn’t make me a jerk, it makes me someone who gives a damn about your data.

5 – Using NetWorker Client with Opensolaris

A guest article written by Ronny Egner, this post covered off getting the NetWorker client working with the OpenSolaris version of Solaris.

4 – Basics – Fixing “NSR peer information” errors

A persistent challenge in NetWorker is when the NSR peer information gets out of whack; usually this can happen when a significant change happens on a client, and the server must have this information reset. I’d still love to see this article become irrelevant by seeing an option appear in NMC to handle it, but until then, this will remain a fairly popular article.

3 – This is wrong

Earlier this year, an Australian hosting service lost thousands of hosted domains and websites due to a “hack attack”. Supposedly the clever hackers destroyed not only the production data, but also all the backups.

What really went wrong was that the company in question had designed a very poor and inadequate backup solution. Rumours were abounding at the time that backups were just simply replicated snapshots. Snapshots may be able to act as backups, but not indefinitely, and not if they’re the only thing configured. (Backups and snapshots are effectively ‘sister’ activities in ILP.)

2 – micromanual: NetWorker Power User Guide to nsradmin

The original micromanual – “NetWorker power user guide to nsradmin” was and remains extremely popular. There’s been thousands of downloads of it since its release, including quite a number from EMC themselves, so it’s clearly a handy resource. If you’ve not downloaded it yourself but you want to boost your NetWorker productivity, it’s a must read.

1 – NetWorker 7.6 SP1

When NetWorker 7.6 SP1 came out, it was a huge release. In my opinion, it should have been numbered NetWorker 7.7 at least; it wasn’t a minor set of changes or a round of bug fixes, it included significant functionality updates (including one of my favourites – support for Boost). As the number one read article of the year, it’s been a big resource for people looking at the functionality of newer releases of NetWorker.

And that, they say, is that

This year has personally been a huge year for me. My partner and I moved state/city in June, going from a regional area just outside of Sydney to the inner west of Melbourne. We also celebrated our 15th anniversary together, surrounded by many of our new friends (who are like family to us) and a few of our old friends. We were even invited to get on the radio to talk about that, not only from the longevity of the relationship and having run the anniversary party up against the monthly Melbourne Den night. (There’s a podcast coming…) It was also the year when I sorted a lot of stuff out, and to boil all this down: it was the year that I spent a lot of time focusing on my personal life and not so much on the blog.

There may still be one or two posts left for 2011, but I’m also starting to get my head around changes and new material for 2012, and I believe 2012 will be a big year for NetWorker users.

 

For some time I’ve been debating whether to generate podcasts for the NetWorker blog.

Rather than continue to vacillate, I’ve decided to do a sample podcast, make it available here for downloading, and decide what to do based on feedback received.

While raw technical posts don’t translate well to podcasts (how do you quote screen output, for instance?), there’s a lot of backup theory related posts I make which can readily converted.

So, please follow the link below to the first podcast, in which I go over a topic near and dear to my heart: What is a zero error policy?

If you’re interested in me producing more podcasts, please let me know. Without feedback, I’ll likely leave it at just this trial. If people are interested though, I’ll setup a proper podcast stream within iTunes and get to work.

Podcast 001: What is a zero error policy?

Cheers!

 

Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

 

Long-term blog readers will know that I advocate a zero error policy within backup environments.

This is elucidated in my posts:

You could say that those posts are precursors to this post, and if you’re not familiar with what I’ve had to say there, you may want to read those first.

One of the critical mistakes I periodically see when companies try to implement a zero error policy is they focus too much on the errors.

LookingThe errors though, are often just the “tip of the iceberg”.

For instance, take the most simple of errors – an open file error. You might run a backup of a Windows filesystem which reports a collection of errors relating to files that were skipped because they were open at the time.

Yet, those open files aren’t really the error. Seeing them as the error is usually a case of mistaking cause and effect. In this scenario, the error is one of:

  • The backup software is misconfigured, or
  • The backup software is missing modules that allow it to backup open files.

In the first case, it may be that the file(s) which are reported as open and couldn’t be backed up actually don’t need to be backed up. They may be temporary files, or cache files, or some other short-lived collection of files that have no importance in terms of data protection. So the error there isn’t the individual files that failed to backup, but the failure to configure the exclusions for the client appropriately.

In the second case, it may be that those files really do need to be backed up, but to do so requires a special module. They may be database files (e.g., Microsoft SQL Server, Microsoft Exchange, etc.), or some other collection of files that must be quiesced before backup. In this case, the error is that the system is being backed up inconsistently.

Zero error policies aren’t about playing whack-a-mole with errors; they’re about solving problems.

After all, the captain of the Titanic couldn’t have averted the disaster by stopping the ship just short of the iceberg and having someone take a pick axe to the top of it.

The net result of this is that having a zero error policy requires the following two processes/activities:

  • Discussion of errors with system owners/nominated key users;
  • Root cause analysis.

If either of those are missing, you’re more likely making (at best educated) guesses as to the correct resolution to the errors. However, if you have those in place, you can more confidently review any error as it hits and make an informed (and even documented) decision as to how to resolve the underlaying issue that it represents.

Without it, a zero error policy may actually make the situation worse.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha