{"id":5578,"date":"2015-05-20T20:21:17","date_gmt":"2015-05-20T10:21:17","guid":{"rendered":"http:\/\/nsrd.info\/blog\/?p=5578"},"modified":"2018-12-11T12:04:59","modified_gmt":"2018-12-11T02:04:59","slug":"pool-size-and-deduplication","status":"publish","type":"post","link":"https:\/\/nsrd.info\/blog\/2015\/05\/20\/pool-size-and-deduplication\/","title":{"rendered":"Pool size and deduplication"},"content":{"rendered":"<p>When you start looking into deduplication, one of the things that&nbsp;becomes immediately&nbsp;apparent is &#8230;&nbsp;<em>size matters<\/em>. In particular, the size of your deduplication <em>pool<\/em> matters.<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Deduplication-Pool.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5579\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Deduplication-Pool.jpg\" alt=\"Deduplication Pool\" width=\"604\" height=\"456\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Deduplication-Pool.jpg 604w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Deduplication-Pool-300x226.jpg 300w\" sizes=\"auto, (max-width: 604px) 100vw, 604px\" \/><\/a><\/p>\n<p>In this respect, what I&#8217;m referring to is the analysis pool for comparison when performing deduplication. If we&#8217;re only talking target based deduplication, that&#8217;s simple \u2013 it&#8217;s the size of the bucket you&#8217;re writing&nbsp;your backup to. However, the problems with a purely target based deduplication approach to data protection are&nbsp;network congestion and time wasted \u2013 a full backup of a 1TB fileserver will still see 1TB of data transferred over&nbsp;the network to have most of its data&nbsp;dropped as being duplicate. That&#8217;s an awful lot of packets going to \/dev\/null,&nbsp;and an awful lot of bandwidth wasted.<\/p>\n<p>For example,&nbsp;consider the following diagram being of a solution&nbsp;using target only deduplication (e.g., VTL only or no Boost API on&nbsp;the hosts):<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Dedupe-Target-Only.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5580\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Dedupe-Target-Only.jpg\" alt=\"Dedupe Target Only\" width=\"656\" height=\"588\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Dedupe-Target-Only.jpg 656w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Dedupe-Target-Only-300x269.jpg 300w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/a><\/p>\n<p>In this diagram, consider the outline&nbsp;arrow heads to indicate&nbsp;<em>where<\/em> deduplication is being evaluated. Thus, if each server had 1TB of&nbsp;storage to be backed up, then each server would send 1TB of storage over to the Data Domain to be backed up, with deduplication&nbsp;performed only at the target end. That&#8217;s not how deduplication has to work now, but it&#8217;s a reminder of where we were only a relatively short period of time ago.<\/p>\n<p>That&#8217;s why source based deduplication (e.g.,&nbsp;NetWorker Client Direct with a DDBoost enabled&nbsp;connection, or Data Domain Boost for Enterprise Applications) brings so many efficiencies to a data protection&nbsp;system. While there&#8217;ll be a touch more processing performed on the individual clients, that&#8217;ll be significantly outweighed by the ofttimes&nbsp;<em>massive<\/em> reduction in data sent&nbsp;onto the network for ingestion into the deduplication appliance.<\/p>\n<p>So that might look more like:<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Source-Dedupe.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5581\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Source-Dedupe.jpg\" alt=\"Source Dedupe\" width=\"656\" height=\"588\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Source-Dedupe.jpg 656w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Source-Dedupe-300x269.jpg 300w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/a><\/p>\n<p>I.e., in this diagram with&nbsp;outline arrow heads indicating&nbsp;location of deduplication activities, we get an immediate change \u2013 each of those hosts will still have 1TB of backup to perform,&nbsp;<em>but<\/em> they&#8217;ll evaluate via hashing mechanisms whether or not that data actually needs to be sent to the&nbsp;target appliance.<\/p>\n<p>There&#8217;s still efficiencies to be had even here though, which is where the original point about pool&nbsp;<em>size<\/em> becomes critical. To understand why,&nbsp;let&#8217;s look at the diagram a slightly&nbsp;different way:<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Source-Dedupe-Global.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5582\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Source-Dedupe-Global.jpg\" alt=\"Source Dedupe Global\" width=\"656\" height=\"588\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Source-Dedupe-Global.jpg 656w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2015\/05\/Source-Dedupe-Global-300x269.jpg 300w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/a><\/p>\n<p>In this case, we&#8217;ve still got source deduplication, but the merged lines represent something far more&nbsp;important &#8230; we&#8217;ve got&nbsp;<em>global<\/em>, source deduplication.<\/p>\n<p>Or to put it a slightly different way:<\/p>\n<ul>\n<li>Target deduplication:\n<ul>\n<li>Client: &#8220;Hey, here&#8217;s&nbsp;<span style=\"text-decoration: underline;\"><strong>all<\/strong><\/span> my data. Check to see what you want to store.&#8221;<\/li>\n<\/ul>\n<\/li>\n<li>Source deduplication&nbsp;(limited):\n<ul>\n<li>Client: &#8220;Hey,&nbsp;I want to backup&nbsp;&lt;data&gt;. Tell me what I need to send based on what I&#8217;ve sent you before.&#8221;<\/li>\n<\/ul>\n<\/li>\n<li>Source&nbsp;deduplication (global):\n<ul>\n<li>Client: &#8220;Hey, I want to&nbsp;backup &lt;data&gt;. Tell me what I need to send based on anything you&#8217;ve ever received before.&#8221;<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>That&nbsp;<em>limited<\/em> deduplication component may not be limited on a per host basis. Some products might&nbsp;deduplicate on a per host basis, while others&nbsp;might deduplicate based on particular pool sizes \u2013 e.g.,&nbsp;<em>x<\/em>TB. But even so,&nbsp;there&#8217;s a <em>huge<\/em> difference between deduplicating against a small comparison set&nbsp;and deduplicating against a large comparison set.<\/p>\n<p>Where that&nbsp;global deduplication pool size comes into play is the commonality of data that exists between hosts within an environment.&nbsp;Consider for instance the&nbsp;minimum recommended size for a Windows 2012&nbsp;installation \u2013 32GB.&nbsp;Now, assume you might get a 5:1 deduplication ratio on a Windows 2012 server (I&nbsp;<em>literally<\/em>&nbsp;picked&nbsp;a number out of the air as an example, not a fact) &#8230; that&#8217;ll mean a target&nbsp;occupied data size of 6.4GB to hold 32GB of data.<\/p>\n<p>But we rarely consider a single server in isolation. Let&#8217;s expand this out to encompass 100 x Windows 2012 servers, each at 32GB in size. It&#8217;s here we see the importance of&nbsp;a large&nbsp;pool of data for deduplication analysis:<\/p>\n<ul>\n<li>If that deduplication analysis were being performed at the <em>per-server<\/em> level, then realistically we&#8217;d be getting 100 x 6.4GB of target data, or 640GB.<\/li>\n<li>If the deduplication analysis were being performed against all data previously deduplicated, then we&#8217;d assume that same 5:1 deduplication ratio for&nbsp;the first server backup, and then much higher deduplication ratios for each subsequent server backup, as they evaluate&nbsp;against previously stored&nbsp;data. So that might mean 1 x 5:1 + 99 x 20:1 &#8230; 164.8GB instead of 640GB or even (if we want to compare against tape) 3,200GB.<\/li>\n<\/ul>\n<p>Throughout this article I&#8217;ve been using the term&nbsp;<em>pool<\/em>, but&nbsp;I&#8217;m not referring to&nbsp;NetWorker media pools \u2013 everything written to a Data Domain as an example, regardless of what media pool it&#8217;s associated with in NetWorker will be globally deduplicated against everything else on the Data Domain. But this does make a strong case for&nbsp;<em>right-sizing<\/em> your appliance, and in particular planning for&nbsp;more data to be stored on it than you would for a conventional disk &#8216;staging&#8217; or &#8216;landing&#8217; area. The old model \u2013 backup to disk, transfer to tape \u2013 was premised on having a disk landing zone big enough to accommodate your biggest backup, so long as you could&nbsp;subsequently transfer that to tape before your&nbsp;<em>next<\/em> backup. (Or some variant thereof.) A common mistake when evaluating deduplication is to think along similar lines. You&nbsp;<em>don&#8217;t<\/em> want storage that&#8217;s just big enough to hold a single big backup \u2013 you want it big enough to hold many backups so you&nbsp;can actually see the strategic and operational benefit&nbsp;of deduplication.<\/p>\n<p>The net lesson is a simple one: size matters. The size of the deduplication pool, and what&nbsp;deduplication activities are compared against will make&nbsp;a significantly noticeable impact to how much space is occupied by your&nbsp;data protection activities, how long it takes to perform those activities, and what the impact of those activities are on your LAN or WAN.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When you start looking into deduplication, one of the things that&nbsp;becomes immediately&nbsp;apparent is &#8230;&nbsp;size matters. In particular, the size of&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[3,1154,1181,16],"tags":[301,1231,1232,1233],"class_list":["post-5578","post","type-post","status-publish","format-standard","hentry","category-architecture","category-avamar-2","category-data-domain-2","category-networker","tag-deduplication","tag-pool-size","tag-source-based-deduplication","tag-target-based-deduplication"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pKpIN-1rY","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/5578","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/comments?post=5578"}],"version-history":[{"count":2,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/5578\/revisions"}],"predecessor-version":[{"id":7428,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/5578\/revisions\/7428"}],"wp:attachment":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media?parent=5578"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/categories?post=5578"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/tags?post=5578"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}