[Edit- 2015: I must get around to writing a refutation to the article below. Keeping it for historical purposes, but I’d now argue I was approaching the problem from a reasonably flawed perspective.]
Don’t get me wrong – I’m quite the fan of deduplication, and not just because it’s really interesting technology. It has potential to allow a lot more backup data to be kept online for much longer periods of time.
Having more backups immediately available for recovery is undoubtedly great.
I wrote previously about 7 problems with deduplication, but they’re just management problems, not functional problems. Yet, there’s one, core problem with deduplication: it’s a backup solution.
Deduplication is about backup.
It’s not about recovery.
Target deduplication? If it’s inline, like with Data Domain products, it’s stellar. Source deduplication? It massively reduces the amount of data you have to stream across your network.
When it comes to recovery though, deduplication isn’t a shining knight. That data has to be rehydrated, and unless you’re doing something really intelligent in terms of matching non-corrupt blocks, or maintaining massive deduplication caches on a client, you’re going to be rehydrating at the target rest point and streaming the full data back across the network.
That 1TB database at a remote site you’ve been backing up over a ADSL link after initial seeding, thanks to source based deduplication? How long can you afford to have the recovery take if it’s got to stream back across that ADSL link?
I’m not saying to avoid using deduplication. I think it’s likely to become a standard feature of backup solutions within 5 years. By itself though, it’s unlikely to speed up your recoveries. In short: if you’re deploying a data deduplication solution, after you’ve done all your sizing tests, sit down and map out what systems may present challenges during recovery from deduplicated systems (hint: it’s almost always going to be the remote ones), and make sure you have a strategy for them. Always have a strategy.
Always have a recovery strategy. After all, if you don’t, you don’t have a backup system. You’ve just got a bunch of backups.
[Edit- 2015: I must get around to writing a refutation to the article above. Keeping it for historical purposes, but I’d now argue I was approaching the problem from a reasonably flawed perspective.]
PS: Thanks to Siobhán for prodding me on this topic.