In “Tape and dedupe: So not happening • The Register“, Chris Mellor asks:

Why haven’t more vendors followed CommVault in putting deduped data on tape? Is it technically too hard?

There’s a good reason for this – it’s pretty nutty. Whether we wish it or not, tape is designed for large scale high speed sequential access. Dedupe requires high speed random access in order to rehydrate. Some time ago I wrote a rebuttal to Curtis Preston’s overly generous appraisal of CommVault’s dedupe to tape strategy, and I still stand by every word I wrote there.

To be fair, Chris quotes someone who puts the argument very succinctly:

Steve Mackey, SpectraLogic’s sales veep for Europe and Africa, says: “The issue of dedupe is recovery. You’ve got to recover the whole tape or a set of tapes before you can recover a file. The big users of archive are looking to recover the data. Today I don’t believe dedupe on tape meets the requirements for recovery.”

Dedupe to tape is crazy. Unless we can somehow overcome the sequential access nature of tape, it will stay crazy, too. That’s why tape and dedupe generally isn’t happening.

And I’m glad that’s the case.

 

In a previous article, I discussed how deduplication is one of those technologies that still straddles the gap between bleeding edge and leading edge, and thus needs to be classified as bleeding edge.

Putting aside the bleeding edge/leading edge argument for the moment (though my view there remains the same), a growing concern I have for deduplication is that it’s popping up everywhere in little islands rather than as a fully integrated option.

The net result? Dedupe on primary storage. Rehydrate to access. Modify, then dedupe to save again. Rehydrate for next access. Dedupe for saved changes. Rehydrate to backup. Dedupe the backup. Rehydrate for recovery.

All this dedupe is making me thirsty. Worse, it’s starting to look like a roller-coaster ride, and I always have the same reaction to them – horror, then an urge to throw up a little. The cycle doesn’t even look nice:

Dedupe/Rehydrate Cycle

So, what’s the solution?

There’s certainly no easy solution – and currently no integrated solution. Not without some serious consideration to standards. Let’s accept, for the moment, that there’s no real option to keep in-OS/RAM data deduplicated. (I.e., at the per-operating system level – maybe there would be at a cross-OS virtualisation level within the hypervisor, but we’re not really there yet.)

One obvious factor that springs to mind is that the first, best approach to some normalisation would be to come up with a technique to transfer deduped primary storage in its deduped format to a deduped backup storage. There are already techniques for synchronising deduplicated data (e.g., when replicating between say, two Data Domain hosts). Why rehydrate when the next step is going to be a new dedupe algorithm being applied, for instance?

If we look at NetWorker, there are a number of places where dedupe can happen, either as part of the backup cycle, or a larger strategy:

  • Primary storage deduplication via say, a Data Domain storage box or something along those lines.
  • Archive/single instance deduplication for less frequently accessed files (say, Centera).
  • Source based dedupe backup (via an Avamar node).
  • Dedupe VTL (data domain or the DL4000 with a deduplication add-on).

(No, I won’t put dedupe backup to disk there. Not until ADV_FILE starts working better.)

Within the EMC product kit, there’s a lot of chance for interoperability of deduplicated data without the need to rehydrate. If anything, EMC is one of the few vendors out there (HP and IBM are the only others that spring to mind) that offer reasonably complete verticals on storage, running from the base array to the backup solution.

Based on EMC’s strong focus on deduplication with the acquisition of both Avamar and Data Domain, it seems a distinct possibility that this is at least a part of their planning. Shifting deduplicated data between disparate products without needing to rehydrate does have potential to be a game changer in terms of how we work with data, but I’ll promise you this: you won’t see this level of integration this year, and possibly not for the next few years. That level of integration is not going to be easy, it’s not going to come quick, and it’s going to require extreme levels of testing to make sure that it actually works when it is implemented.

So for the time being, we’ll have to continue to put up with deduplication being done in little islands within our IT environments, and continue to ride the deduplication/rehydration roller-coaster. Let’s hope we all don’t get sick before solutions start to appear.

 

I’ve debated for a while whether to do this or not, since it might come across as somewhat twee. I think though that in the same way that “My Very Eager Mate Just Sat Up Near Pluto” works for planets, having an A-Z for backups might help to point out the most important aspects to a backup and recovery system.

So, here goes:

AA is for Audit. Your backup system should be able to stand in front of an audit as complete and trustworthy.
BB is for Backup. Without backup, you can't have recovery, and without recovery, your business is uninsured.
CC is for Change Control. If your backup system isn't integrated into the change control process, neither your backup system nor your change control process works.
DD is for DeDupe. You'll be seeing a lot more of it in Backup and Recovery moving forward. My money is on target dedupe being considerably more popular than source dedupe. Why? For the same reason that VTLs are around. Target dedupe = easier dedupe, both for vendors, and for companies with existing solutions to integrate.
EE is for Errors, User. The most common reason you'll need to recover is from user errors. Use this to help plan how your backup system will work.
FF is for Fast. Every person and their dog seems to have a story about making backups faster. Look instead for the stories about making recovery faster – they're the more important ones.
GG is for Growth. Your backup environment should be scoped to handle at least 2 years growth upon implementation. If it isn't, budgets haven't been established correctly.
HH is for Help. Don't try to solve backup/recovery problems in isolation; they're too important to let stew.
II is for Insurance. It's the central purpose of backup, and if you think of it any other way, chances are you're wrong.
JJ is for Jeckyll, not Hyde. When it comes to recovery situations, people should be able to work through them as calmly and cleanly as Dr Jeckyll might – not storm through them like Mr Hyde, flying apart.
KK is for Knowledge. Know your system. Know your errors. Know where to look for information. Know your support hotline numbers. Know your averages. Know your performance peaks and your troughs. Know at a glance whether your system is running smoothly or having problems.
LL is for Logs. Treasure your logs. Don't throw them away too quickly, make sure they're backed up too. With access to your logs, you can answer in 3 years time why a backup from yesterday is proving problematic to recover from.
MM is for Magnetic Tape. It's not going away any time soon. Don't kid yourself, you'll still be using it in backup and recovery systems for some time to come.
NN is for Napkin. If you can't summarise your backup system on the back of a napkin, it's too complicated. There are no exceptions to this rule.
OO is for Order. Backups bring Order to Chaos. Hence, your backup system must be an ordered process, rather than a chaotic and haphazard arrangement of scripts and non-processes.
PP is for Procedures; without them, you don't have a backup system at all.
QQ is for Query. If you're the backup administrator, you should be constantly prepared for a query about backup success. If you're a manager or system owner, you should feel confident you can get a positive response at any time to a query about backup success.
RR is for Recovery, the most important facet of data protection.
SS is for SLAs. (Service Level Agreements). Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) form the heart of SLAs, and contrary to popular opinion in many circles, SLAs are vital to good design. Having SLAs is the first, most critical step to getting the correct budget for the correct system. Without defined recovery requirements, you can't prioritise activities properly; i.e., you'll have a reactionary environment rather than a proactive environment.
TT is for Testing. In fact, T is for Testing, Testing, Testing. If your backup system doesn't include test planning, test procedures and test results, it's not a system at all.
UU is for Ululate. It's that sound you make when your only copy of a backup is destroyed by a failing tape drive or failing tape because you didn't clone it, and you know that recovery failure is not an option.
VV is for VTL. Whether you like the need for them or not, they're not going away any time soon.
WW is for Windows. No, not that Windows. Backup Windows. Clone Windows. Recovery Windows. Design your system first to meet you recovery windows, then your clone windows, then and only then, your backup windows. If you don't do it in that order, your system isn't designed for recovery.
XX is for X-Ray. If you can't X-Ray your backup status, drill down and see how happened, you should assume the worst. (OK, I'm grasping there, but what do you eXpect?)
YY is for Yes. Yes you should be backing up. Yes you should be checking the backup status. Yes you should be able to recover.
ZZ is for Zero Error Policy. If you don't run your backup system with a zero error policy, you're not running it properly, and it's not actually a system.

And there we have it. Maybe neither short, nor succinct, yet hopefully useful none-the-less.

 

If you think you can’t go a day without hearing something about dedupe, you’re probably right. Whether it’s every vendor arguing the case that their dedupe offerings are the best, or tech journalism reporting on it, or pundits explaining why you need it and why your infrastructure will just die without it, it seems that it’s equally the topic of the year along with The Cloud.

There is (from some at least) an argument that backup systems should be “out there” in terms of innovation; I question that in as much as I believe that the term bleeding edge is there for a reason – it’s much sharper, it’s prone to accidents, and if you have an accident at the bleeding edge level, well, you’ll bleed.

So, I always argue that there’s nothing wrong with leading edge in backup systems (so long as it is warranted), but bleeding edge is far more riskier a proposition – not just in terms of potentially wasted investment, but due to the side effect of that wasted investment. If a product is outright bleeding edge then having it involved in data protection is a particularly dangerous proposition. (Only when technology is a mix of bleeding edge and leading edge can you at least start to make the argument that it should be at least considered in the data protection sphere.)

Personally I like the definitions of Bleeding Edge and Leading Edge in the article at Wikipedia on Technology Lifecycle. To quote:

Bleeding edge – any technology that shows high potential but hasn’t demonstrated its value or settled down into any kind of consensus. Early adopters may win big, or may be stuck with a white elephant.

Leading edge – a technology that has proven itself in the marketplace but is still new enough that it may be difficult to find knowledgeable personnel to implement or support it.

So the question is – is deduplication leading edge, or is it still bleeding edge?

To understand the answer, we first have to consider that there’s actually 5 classified stages to the technology lifecycle. These are:

  1. Bleeding edge.
  2. Leading edge.
  3. State of the art.
  4. Dated.
  5. Obsolete.

What we have to consider is – what happens when a technology exhibits attributes of more than one classification or stage of technology? To me, working in the conservative field of data protection, I think there’s only one answer: it should be classified by the “least mature” or “most dangerous” stage that it exhibits attributes for.

Thus, deduplication is still bleeding edge.

Why dedupe is still bleeding edge

Clearly there are attributes of deduplication which are leading edge. It has, in field deployments, proven itself to be valuable in particular instances.

However, there are attributes of deduplication which are definitely still bleeding edge. In particular, the distinction for bleeding edge (to again quote from the Wikipedia article on Technology Lifecycle) is that it:

…shows high potential but hasn’t demonstrated its value or settled down into any kind of consensus.

(My emphasis added.)

Clearly in at least some areas, deduplication has demonstrated its value – my rationale for it still being bleeding edge though is the second (and equally important) attribute: I’m not convinced that deduplication has sufficiently settled down into any kind of consensus.

Within deduplication, you can:

  • Dedupe primary data (less frequent, but talk is growing about this)
  • Dedupe virtualised systems
  • Dedupe archive/HSM systems (whether literally, or via single instance storage, or a combination thereof)
  • Dedupe NAS
  • For backup:
    • Do source based dedupe:
      • At the file level
      • At a fixed block level
      • At a variable block level
    • Do target based dedupe:
      • Post-backup, maintaining two pools of storage, one deduplicated, one normal. Most frequently accessed data is typically “hydrated”, whereas the deduped storage is longer term/less frequently accessed data.
      • Inline (at ingest), maintaining only one deduplicated pool of storage
    • For long term storage of deduplicated backups:
      • Replicate, maintaining two deduplicated systems
      • Transfer out to tape, usually via rehydration (the slightly better term for “undeduplicating”)
      • Transfer deduped data out to tape “as is”

Does this look like any real consensus to you?

One comfort in particular that we can take from all these disparate dedupe options is that clearly there’s a lot of innovation going on. The fundamental basics behind dedupe as well are tried and trusted – we use them every time we compress a file or bunch of files. It’s just scanning for common blocks and reducing the data to the smallest possible amount.

It’s also an intelligent and logical method of moving forward in storage – i.e., we’ve reached a point in storage where both companies that purchase storage, and the vendors that provide it, are moving towards using storage more efficiently rather than just continuing to buy it. This trend started with the development of SAN and NAS, so dedupe is just the logical continuation of those storage centralisation/virtualisation paths. More so, the trend towards more intelligent use of technology is not new – consider even recent changes in products from the CPU manufacturers. Targeting Intel as a prime example, for years their primary development strategy was “fast, faster, fastest.” However, that strategy ended up hitting a brick wall – it doesn’t matter how fast an individual processor is if you actually need to do multiple things at once. Hence multi-core really hit the mainstream. Previously reserved in multi-CPU environments for high end workstations and servers, it’s now common for any new computer to come with multiple cores. (Heck, I have 2 x Quad Core processors in the machine I’m writing this article on. The CPU speeds are technically slower than my lab ESX server, but with multi-core, multi-threading, it smacks the ESX server out of the lab every time on performance. It’s more intelligent use of the resources.)

So dedupe is about shifting away from big, bigger biggest storage to smart, smarter and smartest storage.

We’re certainly not at smartest yet.

We’re probably not even at smarter yet.

As an overall implementation strategy, deduplication is practically infantile in terms of actual industry-state vs potential industry-state. You can do it on your primary production data, or your virtualised systems or your archived data or your secondary NAS data or your backups, but so far there’s been little tangible, usable advances towards being able to use it throughout your entire data lifecycle in a way which is compatible and transparent regardless of vendor or product in use.

For dedupe to be able to make that leap fully out of bleeding edge territory, it needs to make some inroads into complete data lifecycle deduplication – starting at the primary data level and finishing at backups and archives.

(And even when we can use it through the entire product lifecycle, we’ll still be stuck with working out what to do with it once it’s been generated, for longer term storage. Do we replicate between sites? Do we rehydrate to tape or do we send out the deduped data to tape? Obviously based on recent articles I don’t (yet) have much faith in the notion of writing deduped data to tape.)

If you think that there isn’t a choice for long term storage – that it has to be replication, and dedupe is a “tape killer”, think again. Consider smaller sites with constrained budget, consider sites that can’t afford dedicated disaster recovery systems, and consider sites that want to actually limit their energy impact. (I.e., sites that understand the difference in energy savings between offsite tapes and MAID for long term data storage.)

So should data protection environments implement dedupe?

You might think, based on previous comments, that my response to this is going to be a clear-cut no. That’s not quite correct however. You see, because dedupe falls into both leading edge and bleeding edge, it is something that can be implemented into specific environments, in specific circumstances.

That is, the suitability of dedupe for an environment can be evaluated on a case by case basis, so long as sites are aware that when implementing dedupe they’re not getting the full promise of the technology, but just specific windows on the technology. It may be that companies:

  • Need to reduce their backup windows, in which case source-based dedupe could be one option (among many).
  • Need to reduce their overall primary production data, in which case single instance archive is a likely way to go.
  • Need to keep more data available for recovery in VTLs (or for that matter on disk backup units), in which case target based dedupe is the likely way to go.
  • Want to implement more than one of the above, in which case they will be buying disparate technology that don’t share common architectures or operational management systems.

I’d be mad if I were to say that dedupe is still too immature for any site to consider – yet equally I’d charge that anyone who says that every site should go down a dedupe path, and that every site will get fantastic savings from implementing dedupe is equally mad.

 

Over at Backup Central, Curtis Preston says he’s convinced that dedupe to tape according to the CommVault model is a good idea, in a “crazy good” way rather than a “crazy bad” way. To summarise Curtis’ argument (and thereby establish my understanding of it), the process is:

  1. Day to day recovery of deduped tape backup would be crazy (I agree with this)
  2. Design the system so that you still facilitate most recoveries from dedupe on disk (I have no issue with this)
  3. Periodically effectively stage out the dedupe data to tape (first objection)
  4. Long-term recoveries are done from tape written in dedupe format (holy cow that’s insane!)

So, let’s look at why I think this is “crazy bad” by examining each point.

Point one – day to day recovery of deduped tape backup would be crazy

Fully agreed. I’d liken recovery from deduped data on tape to recovery of highly fragmented files from a block level backup. Block level backup products (e.g., EMC’s SnapImage) allows you to bypass the inefficiencies of the filesystem on dense structures to do a block by block backup. This can deliver fantastic time savings. For. Backup.

For recovery, file level reconstruction from block level backups can suck in a terribly horrendous way. File level reconstruction from block level backups requires recovery of the required blocks into a cache, and then the files are put back together. If your files are heavily fragmented (which is often the case on dense filesystems), the number of reads from tape required – and the amount of seeking required – is very high. Real world example: 400 GB dense filesystem (about 40,000,000 files) had full backups reduced from 15 hours to 4 hours using block level backup. Recovery of the entire filesystem took less than 4 hours – recovery of a 40 GB directory took 12 hours. Having a very large cache is one way to get around this, but that starts to get costly (and in my experience is frequently poached).

Recovery from deduped data on tape will very likely suck just as badly.

Point two – design the system so that you facilitate most recoveries from dedupe on disk

Again, fully agreed. So far I’m in complete agreement with Curtis and CommVault. This point can be said of any backup design – design your system so that the most frequently performed recoveries are done from the fastest backup medium.

Point three – Periodically effectively stage out all dedupe data to tape

This is the crazy part, and not crazy good, but out and out crazy bad. To quote Curtis on this:

If you’re going to dedupe to tape, you first have to dedupe to disk.  You create what they call a silo on disk, which is a full backup and a set of deduped incrementals based on (and deduped against) that full backup. The retention on that silo should be long enough to satisfy most of your operational restore requests.  (Typically that’s 30 days, but it could be longer in your environment.)

What’s so crazy-bad about this?

Now, I’ll profess that I don’t know for sure which way this is being done, but it reads that new full backups are generated periodically in the dedupe environment, allowing the previous dependency chains of fulls + incrementals to be transferred out to tape. (Based on my reading of the CommVault marketing documentation, which refers to “reducing” the number of fulls required for retention cycles, this appears to be an accurate assessment.)

So this means that every X days (whatever your period-between-fulls is going to be) you have to do new fulls. Now while this isn’t so much of an issue in regular backups, in dedupe backups it’s a known fact that the initial full backups are hideously slow. This can be worn by most organisations when it’s a once-off. Every month? Even every 3 months or 6 months? Far less likely.

Point four – Long-term recoveries are done from tape written in dedupe format

Obviously some of my objections to this have already been expressed in my comments for point two, but to continue with my objections, let’s look at what Curtis has to say on this point as well:

But I also agree that if I typically do all my restores from within the last 30 days, and someone asks me for a 31 day-old file, it’s generally going to be the type of restore where the fact that it might take several minutes to complete is not going to be a huge deal.  (In the case that you did need to do a large restore from a deduped tape set, you could actually bring it back in to disk in its entirety before you initiate the restore.)

Now, I agree that recovery of longer term backups can be done from slower media in most instances.

There’s a difference between “slower media” and “a snail just overtook our data recovery”.

In the first case, I don’t believe that recovery from deduped data on tape will be in the order of “several minutes” … I think this would turn out to be a highly optimistic rather than terribly realistic time-frame. I would need to see a large number of real world instances of short recovery times to really believe this will be in an order of “several minutes”. Yes, I’m going on a gut feeling, but I feel it’s somewhat justified.

In the second case … “you could actually bring it back in to disk in its entirety” … how much storage do you want to be using here? If we’re talking bringing back the entire “silo”, that’s a lot of storage to bring back  – I’d suggest it’s going to be comparable to but orders of magnitude worse than say, recovering a 1TB virtual machine fileserver to a separate location in order to pull out a 100KB Excel spreadsheet. Let’s be accurate about this: recovering the entire silo would mean recovering all deduped backups – most notably a full of your entire environment.

If we’re talking about recovering just portions of the data on tape, then again, it’s going to be like the file-level recovery from block-level backup issue previously described, and we’ll be back to square one.

In Summary

I’ve got to be entirely blunt here – CommVault’s approach reminds me of the old (crude) expression (made as “G Rated” as possible):

“You can’t polish a poo, but you can roll it in gold dust”.

If the supporting architecture is crazy, it doesn’t matter that it can do something “nifty” – particularly if that something “nifty” will result in significantly slower recoveries (even in limited circumstances).

Yes, it’s undoubtedly the case that the CommVault approach will reduce the amount of data stored on tape, which will result in some cost savings. However, penny pinching in backup environments has a tendency to result in recovery impacts – often significant recovery impacts. For example, NetBackup gives “media savings” by not enforcing dependencies. Yes, this can result in in saving money here and there on media, but can result in being unable to do complete filesystem recoveries approaching the end of a total retention period, which is plain dumb.

The CommVault approach while saving some money on tape will significantly expand recovery times (or require large cache areas and still take a lot of recovery time). Saving money is good. Wasting a little time during longer-term recoveries is likely to be perceived as being OK – until there’s a pressing need. Wasting a lot of time during longer-term recoveries is rarely going to be perceived as being OK.

The other saying that springs to mind is: The road to hell is paved with good intentions.

If I’m correct in my understanding of how the CommVault dedupe-to-tape strategy works based on a review of the CommVault marketing material (typically for any vendor, slim information) and Curtis’ summary, I can only say that their approach is not crazy good as Curtis concludes, but crazy bad.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha