{"id":3254,"date":"2011-08-07T07:29:12","date_gmt":"2011-08-06T21:29:12","guid":{"rendered":"http:\/\/nsrd.info\/blog\/?p=3254"},"modified":"2011-08-07T07:29:12","modified_gmt":"2011-08-06T21:29:12","slug":"7-common-problems-with-deduplication","status":"publish","type":"post","link":"https:\/\/nsrd.info\/blog\/2011\/08\/07\/7-common-problems-with-deduplication\/","title":{"rendered":"7 common problems with deduplication"},"content":{"rendered":"<p>In an earlier article, I suggested some <a title=\"Deduplication and space management\" href=\"https:\/\/nsrd.info\/blog\/2011\/02\/09\/deduplication-and-space-management\/\" target=\"_blank\">space management techniques<\/a> that need to be foremost in the minds of any deduplication user. Now, more broadly, I want to mention the top 7 things you need to avoid with deduplication:<\/p>\n<p><strong>1 \u2013 Watch your multiplexing<\/strong><\/p>\n<p>Make sure you take note of what sort of multiplexing you can get away with for deduplication. For instance, when using NetWorker with a deduplication VTL, you <em>must<\/em>\u00a0use maximum on-tape multiplexing settings of 1; if you don&#8217;t, the deduplication system won&#8217;t be able to properly process the incoming data. It&#8217;ll get stored, but the deduplication ratios will fall through the floor.<\/p>\n<p>A common problem I&#8217;ve encountered is a well running deduplication VTL system which over time &#8216;suddenly&#8217; stops getting any good deduplication ratio at all. Nine times out of ten the cause was a situation (usually weeks before) where for one reason or another the VTL had to be dropped and recreated in NetWorker \u2013 but, the target and max sessions values were <em>not<\/em>\u00a0readjusted for each of the virtual drives.<\/p>\n<p><strong>2 \u2013 Get profiled<\/strong><\/p>\n<p>Sure you could just sign a purchase order for a very spiffy looking piece of deduplication equipment. Everyone&#8217;s raving about deduplication. It must be good, right? It must work everywhere, right?<\/p>\n<p>Well, not exactly. Deduplication can make a big impact in the at-rest data footprint of a lot of backup environments, but it can also be a terrible failure if your data doesn&#8217;t lend itself well to deduplication. For instance, if your multimedia content is growing, then your deduplication ratios are likely shrinking as well.<\/p>\n<p>So before you rush out and buy a deduplication system, make sure you have some preliminary assessment done of your data. The better the analysis of your data, the better the understanding you&#8217;ll have of what sort of benefit deduplication will bring your environment.<\/p>\n<p>Or to say it another way \u2013 people who go into a situation with starry eyes can sometimes be blinded.<\/p>\n<p><strong>3 \u2013 Assume lower dedupe ratios<\/strong><\/p>\n<p>A fact sheet has been thrust in front of you! A vendor fact sheet! It says that you&#8217;ll achieve a deduplication ratio of 30:1! It says that some customers have been known to see deduplication ratios of 200:1! It says &#8230;<\/p>\n<p>Well, vendor fact sheets say a lot of things, and there&#8217;s always some level of truth in them.<\/p>\n<p>But, step back a moment and consider compression ratios stated for tapes. Almost all tape vendors give a 2:1 compression ratio \u2013 some actually higher. This is all well and good \u2013 but now go and run &#8216;mminfo -mv&#8217; in your environment, and calculate the sorts of compression ratios you&#8217;re really getting.<\/p>\n<p>Compression ratios don&#8217;t really equal deduplication ratios of course \u2013 there&#8217;s a chunk more complexity in deduplication ratios. However, anyone who has been in backup for a while will know that you&#8217;ll occasionally get backup tapes with insanely high compression ratios &#8211; say, 10:1 or more, but an average for many sites is probably closer to the 1.4:1 mark.<\/p>\n<p>My general rule of thumb these days is to assume a 7:1 compression ratio for an &#8216;average&#8217; site where a comprehensive data analysis has not been done. Anything more than that is cream on top.<\/p>\n<p><strong>4 \u2013 Don&#8217;t be miserly<\/strong><\/p>\n<p>Deduplication is <em>not<\/em>\u00a0to be treated as a &#8216;temporary staging area&#8217;. Otherwise you&#8217;ll have just bought yourself the most expensive backup to disk solution on the market. You don&#8217;t start getting any tangible benefit from deduplication until you&#8217;ve been backing up for several weeks. If you scope and buy a system that can only hold say, 1-2 weeks worth of data, you may as well just spend the money on regular disk.<\/p>\n<p>I&#8217;m starting to come to the conclusion that your deduplication capacity should be able to hold at least 4x your standard full cycle. So if you do full backups once a week and incrementals all other days, you need 4 weeks worth of storage. If you do full backups once a month with incrementals\/differentials the rest of the time, you need 4 <em>months<\/em>\u00a0worth of storage.<\/p>\n<p><strong>5 \u2013 Have a good cloning strategy<\/strong><\/p>\n<p>You&#8217;ve got deduplication.<\/p>\n<p>You may even have replication between <em>two<\/em>\u00a0deduplication units.<\/p>\n<p>But at some point, unless you&#8217;re throwing massive amounts of budgets at this and have minimal retention times, the chances are that you&#8217;re going to have to start writing data out to tape to clear off older content.<\/p>\n<p>Your cloning strategy has to be <em>blazingly\u00a0fast<\/em>\u00a0and <em>damn efficient<\/em>. A site with 20TB of deduplicated storage should be able to keep at least 4 x LTO-5 drives running at a decent streaming speed in order to push out the data as its required. Why? Because it&#8217;s <em>rehydrating<\/em>\u00a0the data as it streams back out to tape. Oh, I know some backup products offer to write the data out to tape in deduplicated format, but that usually turns out to be\u00a0<a title=\"Dedupe to tape is crazy\" href=\"https:\/\/nsrd.info\/blog\/2009\/10\/26\/dedupe-to-tape-is-crazy-bad-if-the-architecture-is-crazy\/\" target=\"_blank\">bat-shit crazy<\/a>. Sure, it gets the data out to tape quicker, but once data is on tape you have to start thinking about the amount of time it takes to <em><a title=\"Why tape and dedupe just don't mix\" href=\"https:\/\/nsrd.info\/blog\/2011\/07\/06\/why-tape-and-dedupe-just-dont-mix\/\" target=\"_blank\">recover<\/a><\/em>\u00a0it.<\/p>\n<p><strong>6 \u2013 Know your trends<\/strong><\/p>\n<p>Any deduplication system should support you getting to see what sort of deduplication ratios you&#8217;re getting. If it&#8217;s got a reporting mechanism, all the better, but in a worst case scenario, be prepared to log in every single day for your backup cycles and see:<\/p>\n<blockquote><p>-a- What your current global deduplication ratio is<\/p>\n<p>-b- What deduplication ratio you achieved over the past 24 hours<\/p><\/blockquote>\n<p>Use that information \u2013 store it, map it, and learn from it. When do you get your best deduplication ratios? What backups do they correlate to? More importantly, when do you get your <em>worst<\/em>\u00a0deduplication ratios, and what backups do <em>they<\/em>\u00a0correlate to?<\/p>\n<p>(The recent addition of DD Boost functionality in NetWorker <a title=\"First Impressions - DD Boost\" href=\"https:\/\/nsrd.info\/blog\/2011\/01\/24\/first-impressions-data-domain-boost\/\" target=\"_blank\">can make this trivially easy<\/a>, by the way.)<\/p>\n<p>If you&#8217;ve got this information at hand, you can use it to trend and map capacity utilisation within your deduplication system. If you don&#8217;t, you&#8217;re flying blind with one hand tied behind your back.<\/p>\n<p><strong>7 \u2013 Know your space reclamation process and speeds<\/strong><\/p>\n<p>It&#8217;s rare for space reclamation to happen immediately in a deduplication system. It may happen daily, or weekly, but it&#8217;s unlikely to be instantaneous. (See <a title=\"Deduplication and space management\" href=\"https:\/\/nsrd.info\/blog\/2011\/02\/09\/deduplication-and-space-management\/\" target=\"_blank\">here<\/a> for more details.)<\/p>\n<p>Have a strong, clear understanding of:<\/p>\n<blockquote><p>-a- When your space reclamation runs (obviously, this should be tweaked to your environment)<\/p>\n<p>-b- How long space reclamation typically takes to complete<\/p>\n<p>-c- The impact that space reclamation operation has on performance of your deduplication environment<\/p>\n<p>-d- An average understanding of how much capacity you&#8217;re likely to reclaim<\/p>\n<p>-e- What factors may <em>block<\/em>\u00a0reclamation. (E.g., hung replication, etc.)<\/p><\/blockquote>\n<p>If you don&#8217;t understand this, you&#8217;re flying blind and have the <em>other<\/em>\u00a0hand tied behind your back, too.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In an earlier article, I suggested some space management techniques that need to be foremost in the minds of any&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[3,5,16],"tags":[100,195,196,228,229,281,299,301,613,787,915,1021,1022,1023],"class_list":["post-3254","post","type-post","status-publish","format-standard","hentry","category-architecture","category-backup-theory","category-networker","tag-advertised-capacity","tag-capacity","tag-capacity-plan","tag-cloning","tag-cloning-strategy","tag-data-profiling","tag-dedupe-ratio","tag-deduplication","tag-multiplexing","tag-reclamation","tag-space-reclamation","tag-trend","tag-trend-analysis","tag-trending"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pKpIN-Qu","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/3254","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/comments?post=3254"}],"version-history":[{"count":0,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/3254\/revisions"}],"wp:attachment":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media?parent=3254"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/categories?post=3254"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/tags?post=3254"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}