Basics: Planning A Recovery Service

 Architecture, Basics, Recovery  Comments Off on Basics: Planning A Recovery Service
Jan 302018


In Data Protection: Ensuring Data Availability, I talk quite a lot about what you need to understand and plan as part of a data protection environment. I’m often reminded of the old saying from clothing and carpentry – “measure twice, cut once”. The lesson in that statement of course is that rushing into something headlong may make your work more problematic. Taking the time to properly plan what you’re doing though can in a lot of instances (and data protection is one such instance) make the entire process easier. This post isn’t meant to be a replacement to the various planning chapters in my book – but I’m sure it’ll have some useful tips regardless.

We don’t backup just as something to do; in fact, we don’t protect data just as something to do, either. We protect data to either shield our applications and services (and therefore our businesses) from failures, and to ensure we can recover it if necessary. So with that in mind, what are some essential activities in planning a recovery service?

Hard disk and magnifying glass

First: Do you know what the data is?

Data classification isn’t something done during a data protection cycle. Maybe one day it will be when AI and machine learning is sufficiently advanced; in the interim though it requires input from people – IT, the business, and so on. Of course, there’s nothing physically preventing you from planning and implementing a recovery service without performing data classification; I’d go so far as to suggest that an easy majority of businesses do exactly that. That doesn’t mean it’s an ideal approach though.

Data classification is all about understanding the purpose of the data, who cares about it, how it is used, and so on. It’s a collection of seemingly innocuous yet actually highly important questions. It’s something I cover quite a bit in my book, and for the very good reason that I honestly believe a recovery service can be made simpler, cheaper and more efficient if it’s complimented by a data classification process within the organisation.

Second: Does the data need to exist?

That’s right – does it need to exist? This is another essential but oft-overlooked part of achieving a cheaper, simpler and more efficient recovery service: data lifecycle management. Yet, every 1TB you can eliminate from your primary storage systems, for the average business at least, is going to yield anywhere between 10 and 30TB savings in protection storage (RAID, replication, snapshots, backup and recovery, long term recovery, etc.). While for some businesses that number may be smaller, for the majority of mid-sized and higher businesses, that 10-30TB saving is likely to go much, much higher – particularly as the criticality of the data increases.

Without a data lifecycle policy, bad things happen over time:

  • Keeping data becomes habitual rather than based on actual need
  • As ‘owners’ of data disappear (e.g., change roles, leave the company, etc.), reluctance to delete, prune or manage the data tends to increase
  • Apathy or intransigence towards developing a data lifecycle programme increases.

Businesses that avoid data classification and data lifecycle condemn themselves to the torment of Sisyphus – constantly trying to roll a boulder up a hill only to have it fall back down again before they get to the top. This manifests in many ways, of course, but in designing, acquiring and managing a data recovery service it usually hits the hardest.

Third: Does the data need to be protected?

I remain a firm believer that it’s always better to backup too much data than not enough. But that’s a default, catchall position rather than one which should be the blanket rule within the business. Part of data classification and data lifecycle will help you determine whether you need to enact specific (or any) data protection models for a dataset. It may be test database instances that can be recovered at any point from production systems; it might be randomly generated data that has no meaning outside of a very specific use case, or it might be transient data merely flowing from one location to another that does not need to be captured and stored.

Remember the lesson from data lifecycle – every 1TB eliminated from primary storage can eliminate 10-30TB of data from protection storage. The next logical step after that is to be able to accurately answer the question, “do we even need to protect this?”

Fourth: What recovery models are required?

At this point, we’ve not talked about technology. This question gets us a little closer to working out what sort of technology we need, because once we have a fair understanding of the data we need to offer recovery services for, we can start thinking about what types of recovery models will be required.

This will essential involve determining how recoveries are done for the data, such as:

  • Full or image level recoveries?
  • Granular recoveries?
  • Point in time recoveries?

Some data may not need every type of recovery model deployed for it. For some data, granular recoverability is equally important as complete recoverability, for other types of data, it could be that the only way to recover it is image/full – wherein granular recoveries would simply leave data corrupted or useless. Does all data require point in time recovery? Much will, but some may not.

Other recovery models you should consider of course are how much users will be involved in recoveries. Self-service for admins? Self-service for end-users? All operator run? Chances are of course it’ll be a mix depending those previous recovery model questions (e.g., you might allow self-service individual email recovery, but full exchange recovery is not going to be an end-user initiated task.)

Fifth: What SLOs/SLAs are required?

Regardless of whether your business has Service Level Objectives (SLOs) or Service Level Agreements (SLAs), there’ll be the potential you have to meet a variety of them depending on the nature of the failure, the criticality and age of the data, and so on. (For the rest of this section, I’ll use ‘SLA’ as a generic term for both SLA and SLO). In fact, there’ll be up to three different categories of SLAs you have to meet:

  • Online: These types of SLAs are for immediate or near-immediate recoverability from failure; they’re meant to keep the data online rather than having to seek to retrieve it from a copy. This will cover options such as continuous replication (e.g., fully mirrored storage arrays), continuous data protection (CDP), as well as more regular replication and snapshot options.
  • Nearline: This is where backup and recovery, archive, and long term retention (e.g., compliance retention of backups/archives) comes into play. Systems in this area are designed to retrieve the data from a copy (or in the case of archive, a tiered, alternate platform) when required, as opposed to ensuring the original copy remains continuously, or near to continuously available.
  • Disaster: These are your “the chips are down” SLAs, which’ll fall into business continuity and/or isolated recovery. Particularly in the event of business continuity, they may overlap with either online or nearline SLAs – but they can also diverge quite a lot. (For instance, in a business continuity situation, data and systems for ‘tier 3’ and ‘tier 4’ services, which may otherwise require a particular level of online or nearline recoverability during normal operations, might be disregarded entirely until full service levels are restored.

Not all data may require all three of the above, and even if data does, unless you’re in a hyperconverged or converged environment, it’s quite possible if you’re a backup administrator, you only need to consider some of the above, with other aspects being undertaken by storage teams, etc.

Now you can plan the recovery service (and conclusion)

And because you’ve gathered the answers to the above, planning and implementing the recovery service is now the easy bit! Trust me on this – working out what a recovery service should look like for the business is when you’ve gathered the above information is a fraction of the effort compared to when you haven’t. Again: “Measure twice, cut once.”

If you want more in-depth information on above, check out chapters in my book such as “Contextualizing Data Protection”, “Data Life Cycle”, “Business Continuity”, and “Data Discovery” – not to mention the specific chapters on protection methods such as backup and recovery, replication, snapshots, continuous data protection, etc.

The designing of backup environments

 Architecture, Backup theory  Comments Off on The designing of backup environments
Feb 072012

The cockatrice was a legendary beast that was a two-legged dragon, with the head of a rooster that could, amongst other things, turn people to stone with a glance. So it was somewhat to a basilisk, but a whole lot uglier and looked like it had been designed by a committee.

You may be surprised to know that there are cockatrice backup environments out there. Such an environment can be just as ugly as the mythical cockatrice, and just as dangerous, turning even a hardened backup expert to stone as he or she tries to sort through the “what-abouts?”, the “where-ares?” and the “who-does?”

These environments are typically quite organic, and have grown and developed over years, usually with multiple staff having been involved and/or responsible, but no one staff member having had sufficient ownership (or longevity) to establish a single unifying factor within the environment. That in itself would be challenging enough, but to really make the backup environment a cockatrice, there’ll also be a lack of documentation.

In such environments, it’s quite possible that the environment is largely acting like a backup system, but through a combination of sheer luck and a certain level of procedural adherence, typically by operators who have remained in the environment for long enough. These are the systems for which, when the question “But why do you do X?”, the answer is simply, “Because we’ve always done X.”

In this sort of system, new technologies have typically just been tacked on, sometimes shoe-horned into “pretending” they work just as the old systems, and sometimes not used at their peak efficiency because of that general reluctance to change such systems engender. (A classic example for instance, can be seen where a deduplication system is tacked onto an existing backup environment, but is treated like a standard VTL or a standard backup-to-disk region, without any consideration for the particularities involved in using deduplication storage.)

The good news is, these environments can be fixed, and turned into true backup systems. To do so, there needs to be four decisions made:

  1. To embrace change. The first essential step is to eliminate the “it’s always been done this way before” mentality. This doesn’t allow for progress, or change, at all, and if there’s one common factor in any successful business, it’s the ability to change. This is not just representative of the business itself, but for each component of the business – and that includes backup.
  2. To assign ownership. A backup system requires both a technical owner and a management owner. Ideally, the technical owner will be the Data Protection Advocate for the company or business group, and the management owner will be both an individual, and the Information Protection Advisory Council. (See here.)
  3. To document. The first step to pulling order out of chaos (or even general disarray and disconnectedness) is to start documenting the environment. “Document! Document! Document!”, you might hear me cry as I write this line – and you wouldn’t be too far wrong. Document the system configuration. Document the rebuild process. Document the backup and recovery processes. Sometimes this documentation will be reference to external materials, but a good chunk of it will be material that your staff have to develop themselves.
  4. To plan. Organic growth is fine. Uncontrolled organic or haphazard growth is not. You need to develop a plan for the backup environment. This will be possible once the above aspects have been tackled, but two key parts to that plan should be:
    • How long will the system, in its current form, continue to service our requirements?
    • What are some technologies we should be starting to evaluate now, or at least stay abreast of, for consideration when the system has to be updated?

With those four decisions made, and implemented, the environment can be transfigured from a hodge-podge of technologies with no real unifying principle other than conformity to prior usage patterns into a collection of synergistic tools working seamlessly to optimise the data backup and recovery operations of the company.

%d bloggers like this: