{"id":3792,"date":"2012-06-27T19:32:29","date_gmt":"2012-06-27T09:32:29","guid":{"rendered":"http:\/\/nsrd.info\/blog\/?p=3792"},"modified":"2015-05-20T20:20:25","modified_gmt":"2015-05-20T10:20:25","slug":"the-one-core-problem-with-deduplication","status":"publish","type":"post","link":"https:\/\/nsrd.info\/blog\/2012\/06\/27\/the-one-core-problem-with-deduplication\/","title":{"rendered":"The one core problem with deduplication"},"content":{"rendered":"<p><em>[Edit- 2015: I must get around to writing a refutation to the article below. Keeping it for historical purposes, but I&#8217;d now argue I was\u00a0approaching the problem from a reasonably flawed perspective.]<\/em><\/p>\n<p>Don&#8217;t get me wrong \u2013 I&#8217;m quite the fan of deduplication, and not just because it&#8217;s really interesting technology. It has potential to allow a\u00a0<em>lot<\/em> more backup data to be kept online for much longer periods of time.<\/p>\n<p>Having more backups immediately available for recovery is undoubtedly great.<\/p>\n<p>I wrote previously about <a title=\"7 common problems with deduplication\" href=\"https:\/\/nsrd.info\/blog\/2011\/08\/07\/7-common-problems-with-deduplication\/\" target=\"_blank\">7 problems with deduplication<\/a>, but they&#8217;re just management problems, not functional problems. Yet, there&#8217;s one,\u00a0<em>core<\/em> problem with deduplication: it&#8217;s a\u00a0<em>backup<\/em> solution.<\/p>\n<p>Deduplication is about\u00a0<em>backup<\/em>.<\/p>\n<p>It&#8217;s\u00a0<em>not<\/em> about recovery.<\/p>\n<p>Target deduplication? If it&#8217;s inline, like with Data Domain products, it&#8217;s stellar. Source deduplication? It massively reduces the amount of data you have to stream across your network.<\/p>\n<p>When it comes to recovery though, deduplication isn&#8217;t a shining knight. That data has to be rehydrated, and unless you&#8217;re doing something\u00a0<em>really<\/em> intelligent in terms of matching non-corrupt blocks, or maintaining massive deduplication caches on a client, you&#8217;re going to be rehydrating at the target rest point and streaming the full data back across the network.<\/p>\n<p>That 1TB database at a remote site you&#8217;ve been backing up over a ADSL link after initial seeding, thanks to source based deduplication? How long can you afford to have the recovery take if it&#8217;s got to\u00a0stream\u00a0back across that ADSL link?<\/p>\n<p>I&#8217;m not saying to avoid using deduplication. I think it&#8217;s likely to become a standard feature of backup solutions within 5 years. By itself though, it&#8217;s unlikely to speed up your recoveries. In short: if you&#8217;re deploying a data deduplication solution, after you&#8217;ve done all your sizing tests, sit down and map out what systems may present challenges during recovery from deduplicated systems (hint: it&#8217;s almost always going to be the remote ones), and make sure you have a strategy for them.\u00a0<em>Always<\/em> have a strategy.<\/p>\n<p>Always have a\u00a0<em>recovery<\/em> strategy. After all, if you don&#8217;t, you don&#8217;t have a backup <em>system<\/em>. You&#8217;ve just got a bunch of backups.<\/p>\n<p><em>[Edit- 2015: I must get around to writing a refutation to the article above. Keeping it for historical purposes, but I&#8217;d now argue I was\u00a0approaching the problem from a reasonably flawed perspective.]<\/em><\/p>\n<p>__<\/p>\n<p>PS: Thanks to Siobh\u00e1n for prodding me on this topic.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[Edit- 2015: I must get around to writing a refutation to the article below. Keeping it for historical purposes, but&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[3,5],"tags":[301,1252],"class_list":["post-3792","post","type-post","status-publish","format-standard","hentry","category-architecture","category-backup-theory","tag-deduplication","tag-recovery"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pKpIN-Za","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/3792","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/comments?post=3792"}],"version-history":[{"count":1,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/3792\/revisions"}],"predecessor-version":[{"id":5585,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/3792\/revisions\/5585"}],"wp:attachment":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media?parent=3792"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/categories?post=3792"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/tags?post=3792"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}