Basics: Planning A Recovery Service

 Architecture, Basics, Recovery  Comments Off on Basics: Planning A Recovery Service
Jan 302018


In Data Protection: Ensuring Data Availability, I talk quite a lot about what you need to understand and plan as part of a data protection environment. I’m often reminded of the old saying from clothing and carpentry – “measure twice, cut once”. The lesson in that statement of course is that rushing into something headlong may make your work more problematic. Taking the time to properly plan what you’re doing though can in a lot of instances (and data protection is one such instance) make the entire process easier. This post isn’t meant to be a replacement to the various planning chapters in my book – but I’m sure it’ll have some useful tips regardless.

We don’t backup just as something to do; in fact, we don’t protect data just as something to do, either. We protect data to either shield our applications and services (and therefore our businesses) from failures, and to ensure we can recover it if necessary. So with that in mind, what are some essential activities in planning a recovery service?

Hard disk and magnifying glass

First: Do you know what the data is?

Data classification isn’t something done during a data protection cycle. Maybe one day it will be when AI and machine learning is sufficiently advanced; in the interim though it requires input from people – IT, the business, and so on. Of course, there’s nothing physically preventing you from planning and implementing a recovery service without performing data classification; I’d go so far as to suggest that an easy majority of businesses do exactly that. That doesn’t mean it’s an ideal approach though.

Data classification is all about understanding the purpose of the data, who cares about it, how it is used, and so on. It’s a collection of seemingly innocuous yet actually highly important questions. It’s something I cover quite a bit in my book, and for the very good reason that I honestly believe a recovery service can be made simpler, cheaper and more efficient if it’s complimented by a data classification process within the organisation.

Second: Does the data need to exist?

That’s right – does it need to exist? This is another essential but oft-overlooked part of achieving a cheaper, simpler and more efficient recovery service: data lifecycle management. Yet, every 1TB you can eliminate from your primary storage systems, for the average business at least, is going to yield anywhere between 10 and 30TB savings in protection storage (RAID, replication, snapshots, backup and recovery, long term recovery, etc.). While for some businesses that number may be smaller, for the majority of mid-sized and higher businesses, that 10-30TB saving is likely to go much, much higher – particularly as the criticality of the data increases.

Without a data lifecycle policy, bad things happen over time:

  • Keeping data becomes habitual rather than based on actual need
  • As ‘owners’ of data disappear (e.g., change roles, leave the company, etc.), reluctance to delete, prune or manage the data tends to increase
  • Apathy or intransigence towards developing a data lifecycle programme increases.

Businesses that avoid data classification and data lifecycle condemn themselves to the torment of Sisyphus – constantly trying to roll a boulder up a hill only to have it fall back down again before they get to the top. This manifests in many ways, of course, but in designing, acquiring and managing a data recovery service it usually hits the hardest.

Third: Does the data need to be protected?

I remain a firm believer that it’s always better to backup too much data than not enough. But that’s a default, catchall position rather than one which should be the blanket rule within the business. Part of data classification and data lifecycle will help you determine whether you need to enact specific (or any) data protection models for a dataset. It may be test database instances that can be recovered at any point from production systems; it might be randomly generated data that has no meaning outside of a very specific use case, or it might be transient data merely flowing from one location to another that does not need to be captured and stored.

Remember the lesson from data lifecycle – every 1TB eliminated from primary storage can eliminate 10-30TB of data from protection storage. The next logical step after that is to be able to accurately answer the question, “do we even need to protect this?”

Fourth: What recovery models are required?

At this point, we’ve not talked about technology. This question gets us a little closer to working out what sort of technology we need, because once we have a fair understanding of the data we need to offer recovery services for, we can start thinking about what types of recovery models will be required.

This will essential involve determining how recoveries are done for the data, such as:

  • Full or image level recoveries?
  • Granular recoveries?
  • Point in time recoveries?

Some data may not need every type of recovery model deployed for it. For some data, granular recoverability is equally important as complete recoverability, for other types of data, it could be that the only way to recover it is image/full – wherein granular recoveries would simply leave data corrupted or useless. Does all data require point in time recovery? Much will, but some may not.

Other recovery models you should consider of course are how much users will be involved in recoveries. Self-service for admins? Self-service for end-users? All operator run? Chances are of course it’ll be a mix depending those previous recovery model questions (e.g., you might allow self-service individual email recovery, but full exchange recovery is not going to be an end-user initiated task.)

Fifth: What SLOs/SLAs are required?

Regardless of whether your business has Service Level Objectives (SLOs) or Service Level Agreements (SLAs), there’ll be the potential you have to meet a variety of them depending on the nature of the failure, the criticality and age of the data, and so on. (For the rest of this section, I’ll use ‘SLA’ as a generic term for both SLA and SLO). In fact, there’ll be up to three different categories of SLAs you have to meet:

  • Online: These types of SLAs are for immediate or near-immediate recoverability from failure; they’re meant to keep the data online rather than having to seek to retrieve it from a copy. This will cover options such as continuous replication (e.g., fully mirrored storage arrays), continuous data protection (CDP), as well as more regular replication and snapshot options.
  • Nearline: This is where backup and recovery, archive, and long term retention (e.g., compliance retention of backups/archives) comes into play. Systems in this area are designed to retrieve the data from a copy (or in the case of archive, a tiered, alternate platform) when required, as opposed to ensuring the original copy remains continuously, or near to continuously available.
  • Disaster: These are your “the chips are down” SLAs, which’ll fall into business continuity and/or isolated recovery. Particularly in the event of business continuity, they may overlap with either online or nearline SLAs – but they can also diverge quite a lot. (For instance, in a business continuity situation, data and systems for ‘tier 3’ and ‘tier 4’ services, which may otherwise require a particular level of online or nearline recoverability during normal operations, might be disregarded entirely until full service levels are restored.

Not all data may require all three of the above, and even if data does, unless you’re in a hyperconverged or converged environment, it’s quite possible if you’re a backup administrator, you only need to consider some of the above, with other aspects being undertaken by storage teams, etc.

Now you can plan the recovery service (and conclusion)

And because you’ve gathered the answers to the above, planning and implementing the recovery service is now the easy bit! Trust me on this – working out what a recovery service should look like for the business is when you’ve gathered the above information is a fraction of the effort compared to when you haven’t. Again: “Measure twice, cut once.”

If you want more in-depth information on above, check out chapters in my book such as “Contextualizing Data Protection”, “Data Life Cycle”, “Business Continuity”, and “Data Discovery” – not to mention the specific chapters on protection methods such as backup and recovery, replication, snapshots, continuous data protection, etc.

Data protection lessons from a tomato

 Architecture, Backup theory  Comments Off on Data protection lessons from a tomato
Jul 252013

Data protection lessons from a tomato

Data protection lessons from a tomato? Have I gone mad?

Bear with me.

DIKW ModelIf you’ve done any ITIL training, the above diagram will look familiar to you. Rather unimaginatively, it’s called the DIKW model:

Data > Information > Knowledge > Wisdom

A simple, practical example of what this diagram/model means is the following:

  • Data – Something is red, and round.
  • Information – It’s a tomato.
  • Knowledge – Tomato is a fruit.
  • Wisdom – You don’t put tomato in a fruit salad.

That’s about as complex as DIKW gets. However, being a rather simple concept, it means it can be used in quite a few areas.

When it comes to data protection, its purpose is obvious: the criticality of the data to business wisdom will have a direct impact on the level of protection you need to apply to it.

In this case, I’m expanding the definition of wisdom a little. According to my Apple dashboard dictionary, wisdom is:

the quality of having experience, knowledge, and good judgement; the quality of being wise

Further, we can talk about wisdom in terms of accumulated experience:

the body of knowledge and experience that develops within a specified society or period.

So corporate wisdom is about having the experience and knowledge required to act with good judgement, and represents the sum of the knowledge and experience a corporation has built up over time.

If you think about wisdom in terms of corporate wisdom, then you’ll understand my point. For instance, a key database for a company – or the email system – represents a tangible chunk of corporate wisdom. Core fileservers will also be pretty far up the scale. It’s unlikely, on the other hand (in a business with appropriate storage policies) that the files on a regular end-user’s desktop or laptop will go much beyond information in the DIKW scale.

Of course, there are always exceptions. I’ll get to that in a moment.

What this comes back to pretty quickly is the need for Information Lifecycle Protection. End users and the business overall are typically not interested in data – they’re interested in information. They don’t care, as such, about the backup of /u01/app/oracle/data/CORPTAX/data01.dbf – they care about the corporate tax database. That, of course, means that the IT group and the business need to build service level agreements around business functions, not servers and storage. As ITIL teaches, the agreements about networks, storage, servers, etc., come in the form of operational level agreements between the segments of IT.

Ironically, years before studying ITIL, it’s something I covered in my book in the notion of establishing system dependency maps:

System Maps

(In the diagram, the number in parentheses beside a server or function is it’s reference number; D:X means that it depends on the nominated referenced server/function X.)

What all this boils down to is the criticality of one particular activity when preparing an Information Lifecycle Protection system within an organisation: Data classification. (That of course is where you should catch any of those exceptions I was talking about before.)

In order to properly back something up with the appropriate level of protection and urgency, you need to know what it is.

Or, as Stephen Manley said the other day:

OH at Starbucks – 3 page essay ending with ‘I now have 5 pairs of pants, not 2. That’s 3 more.’ Some data may not need to be protected.

Some data may not need to be protected. Couldn’t have said it better myself. Of course, I do also say that it’s better to backup a little bit too much data than not enough, but that’s not something you should see as carte blanche to just backup everything in your environment at all times, regardless of what it is.

The thing about data classification is that most companies do it without first finding all their data. The first step, possibly the hardest step, is first becoming aware of the data distribution within the enterprise. If you want to skip reading the post linked to in the previous sentence, here’s the key information from it:

  • Data – This is both the core data managed and protected by IT, and all other data throughout the enterprise which is:
    • Known about – The business is aware of it;
    • Managed – This data falls under the purview of a team in terms of storage administration (ILM);
    • Protected – This data falls under the purview of a team in terms of backup and recovery (ILP).
  • Dark Data – To quote [a] previous article, “all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for”.
  • Grey Data – Grey data is previously discovered dark data for which no decision has been made as yet in relation to its management or protection. That is, it’s now known about, but has not been assigned any policy or tier in either ILM or ILP.
  • Utility Data – This is data which is subsequently classified out of grey data state into a state where the data is known to have value, but is not either managed or protected, because it can be recreated. It could be that the decision is made that the cost (in time) of recreating the data is less expensive than the cost (both in literal dollars and in staff-activity time) of managing and protecting it.
  • Noise – This isn’t really data at all, but are all the “bits” (no pun intended) that are left which are neither grey data, data or utility data. In essence, this is irrelevant data, which someone or some group may be keeping for unnecessary reasons, and in actual fact should be considered eligible for either deletion or archival and deletion.

Once you’ve found your data, you can classify it. What’s structured and unstructured? What’s the criticality of the data? (I.e., what level of business wisdom does it relate to?)

But even then, you’re not quite ready to determine what your information lifecycle protection policy will be for the data – well, not until you have a data lifecycle policy, which at its simplest, looks something like this:

Data Lifecycle


Of course, there’s a lot of time and a lot of decisions bunched up in that diagram, but the lifecycle of data within an organisation is actually that simple at the conceptual level. Or rather, it should be. If you want to read more about data lifecycle, click here for the intro piece – there’s several accompanying pieces listed at the bottom of the article.

When considered from a backup perspective, the end goal of a data lifecycle policy though is simple:

Backup only that which needs to be backed up.

If data can be deleted, delete it.

If data can be archived, archive it.

The logical implication of course is – if you can’t classify it, if you can’t determine its criticality, then the core backup mantra, “always better to backup a little bit more than not enough” takes precedence, and you should be working out how to back it up. Obviously, as a fall back rule, it works, but it’s best to design your overall environment and data policies to avoid it.

So to summarise:

  1. Following the DIKW model, the closer data is to representing corporate wisdom, the more critical its information lifecycle protection requirements will be.
  2. In order to determine that criticality you first have to find the data within your environment.
  3. Once you’ve found the data in your environment, you have to classify it.
  4. Once you’ve classified it, you can build a data lifecycle policy for it.
  5. And then you can configure the appropriate information lifecycle protection for it.

If you think back to EMC’s work towards mitigating the effects of accidental architectures, you’ll see where I was coming from in talking about the importance of procedural change to arrest further accidental architectures. It’s a classic ER technique – identify, triage and heal.

And we can learn all this from a tomato, sliced and salted with the DIKW model.

%d bloggers like this: