The one core problem with deduplication

[Edit- 2015: I must get around to writing a refutation to the article below. Keeping it for historical purposes, but I’d now argue I was approaching the problem from a reasonably flawed perspective.]

Don’t get me wrong – I’m quite the fan of deduplication, and not just because it’s really interesting technology. It has potential to allow a lot more backup data to be kept online for much longer periods of time.

Having more backups immediately available for recovery is undoubtedly great.

I wrote previously about 7 problems with deduplication, but they’re just management problems, not functional problems. Yet, there’s one, core problem with deduplication: it’s a backup solution.

Deduplication is about backup.

It’s not about recovery.

Target deduplication? If it’s inline, like with Data Domain products, it’s stellar. Source deduplication? It massively reduces the amount of data you have to stream across your network.

When it comes to recovery though, deduplication isn’t a shining knight. That data has to be rehydrated, and unless you’re doing something really intelligent in terms of matching non-corrupt blocks, or maintaining massive deduplication caches on a client, you’re going to be rehydrating at the target rest point and streaming the full data back across the network.

That 1TB database at a remote site you’ve been backing up over a ADSL link after initial seeding, thanks to source based deduplication? How long can you afford to have the recovery take if it’s got to stream back across that ADSL link?

I’m not saying to avoid using deduplication. I think it’s likely to become a standard feature of backup solutions within 5 years. By itself though, it’s unlikely to speed up your recoveries. In short: if you’re deploying a data deduplication solution, after you’ve done all your sizing tests, sit down and map out what systems may present challenges during recovery from deduplicated systems (hint: it’s almost always going to be the remote ones), and make sure you have a strategy for them. Always have a strategy.

Always have a recovery strategy. After all, if you don’t, you don’t have a backup system. You’ve just got a bunch of backups.

[Edit- 2015: I must get around to writing a refutation to the article above. Keeping it for historical purposes, but I’d now argue I was approaching the problem from a reasonably flawed perspective.]

__

PS: Thanks to Siobhán for prodding me on this topic.

4 thoughts on “The one core problem with deduplication”

  1. So what you’re actually saying is that source-based dedupe has its drawbacks when it comes to restoring. And you’re absolutely right.
    Target based, however does not have any major drawbacks IMO.

    I don’t know anyone who sells dedupe storage claiming that it improves recovery speed, it’s usually a bit slower than pure disk backup (on the same hardware), but the savings in hardware and power consumption usually justifies that.

    1. Actually if you look at the marketing messages for most source dedupe products, you’ll see a very common feature – lots of talk about improvement of backup performance, less so about recovery performance. As per my blog article though, if you take appropriate steps in developing a strategy to mitigate that recovery factor, you’ll be OK with it.

      Target dedupe still does nothing for recovery performance, but it’s less of an issue on that front, because it equally doesn’t do anything to improve backup performance across a network either. 🙂

  2. Great post and you’ve made some great points. I would add that I am aware of some technologies that provide a “delta restore awareness” at the block level. That is only the required blocks for the restore are moved. I performed a POC on an Iron Mountain BaaS service that did this well. I had heard Avamar was going to add that capability but I’m not sure whatever came of it? You make a valid point regarding RTO of data over limited bandwidth. In this hypothetical scenario I would recommend a small local Avamar grid or perhaps AVE if the supporting infrastructure is there. That data can then be replicated to another grid offsite.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.