{"id":6262,"date":"2017-05-05T07:29:07","date_gmt":"2017-05-04T21:29:07","guid":{"rendered":"http:\/\/nsrd.info\/blog\/?p=6262"},"modified":"2018-12-11T08:39:13","modified_gmt":"2018-12-10T22:39:13","slug":"architecture-matters-when-you-dedupe","status":"publish","type":"post","link":"https:\/\/nsrd.info\/blog\/2017\/05\/05\/architecture-matters-when-you-dedupe\/","title":{"rendered":"Architecture Matters: When you dedupe"},"content":{"rendered":"<p>There was a&nbsp;time, comparatively not that long ago, when the biggest governing factor in LAN capacity for a datacentre was not the primary production workloads, but the mechanics of getting a full backup from each host over to the backup media.&nbsp;If you&#8217;ve been around in&nbsp;the data protection industry long enough you&#8217;ll have had experience of that \u2013 for instance, the drive towards 1Gbit networks over Fast Ethernet started&nbsp;more often than not in datacentres I was involved in thanks to backup. Likewise, the first systems I saw being attached&nbsp;directly to 10Gbit backbones in datacentres were the&nbsp;backup infrastructure.<\/p>\n<p>Well architected&nbsp;deduplication can eliminate that consideration. That&#8217;s not to say you won&#8217;t eventually need 10Gbit, 40Gbit or even more in&nbsp;your datacentre, but if deduplication is&nbsp;architected correctly, you won&#8217;t need to deploy&nbsp;that next level up of network&nbsp;performance to meet your backup requirements.<\/p>\n<p>In this blog article I want to take you through an example of why deduplication architecture&nbsp;matters, and I&#8217;ll focus on something that amazingly still gets consideration from time to time: post-ingest deduplication.<\/p>\n<p>Before I get started \u2013 obviously, Data&nbsp;Domain doesn&#8217;t use post-ingest deduplication. Its pre-ingest deduplication ensures the only data written to the appliance is already deduplicated, and it further increases efficiency by pushing deduplication segmentation and processing out to the individual clients (in a NetWorker\/Avamar environment) to limit the amount of data flowing across the network.<\/p>\n<p>A post-deduplication architecture though&nbsp;has your protection appliance feature two distinct tiers of storage \u2013 the landing or staging tier, and the deduplication tier. So that means when it&#8217;s time to do a backup,&nbsp;all your clients send all their data across the network to&nbsp;sit, in original sized format, on the&nbsp;staging tier:<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/2017\/05\/05\/architecture-matters-when-you-dedupe\/post-process-dedupe-01\/\" rel=\"attachment wp-att-6269\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6269\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-01.jpg\" alt=\"Post Process Dedupe 01\" width=\"1308\" height=\"1048\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-01.jpg 1308w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-01-300x240.jpg 300w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-01-768x615.jpg 768w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-01-1024x820.jpg 1024w\" sizes=\"auto, (max-width: 1308px) 100vw, 1308px\" \/><\/a><\/p>\n<p>In the example above we&#8217;ve already had backups run to the post-ingest deduplication appliance; so there&#8217;s a heap of deduplicated data sitting in the deduplication tier, but our staging tier has just landed all the backups from each of the clients in&nbsp;the environment. (If it were NetWorker writing to the appliance, each of those backups would be&nbsp;the full sized savesets.)<\/p>\n<p>Now, at some point after&nbsp;the backup completes (usually a preconfigured time), post-processing kicks in. This is effectively a data-migration window in a post-ingest appliance where all the data in the staging tier has to be read&nbsp;and processed for deduplication. For example, using the example above, we might start with inspecting &#8216;Backup01&#8217; for commonality to&nbsp;data on&nbsp;the deduplication tier:<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/2017\/05\/05\/architecture-matters-when-you-dedupe\/post-process-dedupe-02\/\" rel=\"attachment wp-att-6270\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6270\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-02.jpg\" alt=\"Post Process Dedupe 02\" width=\"1137\" height=\"705\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-02.jpg 1137w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-02-300x186.jpg 300w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-02-768x476.jpg 768w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-02-1024x635.jpg 1024w\" sizes=\"auto, (max-width: 1137px) 100vw, 1137px\" \/><\/a><\/p>\n<p>So the post-ingest processing engine starts by reading through all the content of <strong>Backup01<\/strong> and constructs fingerprint analysis of the data that has landed.<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/2017\/05\/05\/architecture-matters-when-you-dedupe\/post-process-dedupe-03\/\" rel=\"attachment wp-att-6271\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6271\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-03.jpg\" alt=\"Post Process Dedupe 03\" width=\"1137\" height=\"705\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-03.jpg 1137w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-03-300x186.jpg 300w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-03-768x476.jpg 768w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-03-1024x635.jpg 1024w\" sizes=\"auto, (max-width: 1137px) 100vw, 1137px\" \/><\/a><\/p>\n<p>As fingerprints are assembled, data can be compared against the&nbsp;data already residing in the deduplication tier. This may result in signature matches or signature&nbsp;misses, indicating new data that needs to be copied into the&nbsp;deduplication tier.<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/2017\/05\/05\/architecture-matters-when-you-dedupe\/post-process-dedupe-04\/\" rel=\"attachment wp-att-6272\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6272\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-04.jpg\" alt=\"Post Process Dedupe 04\" width=\"1137\" height=\"705\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-04.jpg 1137w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-04-300x186.jpg 300w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-04-768x476.jpg 768w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-04-1024x635.jpg 1024w\" sizes=\"auto, (max-width: 1137px) 100vw, 1137px\" \/><\/a><\/p>\n<p>In this it&#8217;s similar to regular deduplication \u2013 signature matches result in pointers for existing data being updated and extended, and a signature miss results in needing to store new data on the deduplication tier.<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/2017\/05\/05\/architecture-matters-when-you-dedupe\/post-process-dedupe-05\/\" rel=\"attachment wp-att-6273\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6273\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-05.jpg\" alt=\"Post Process Dedupe 05\" width=\"1137\" height=\"705\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-05.jpg 1137w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-05-300x186.jpg 300w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-05-768x476.jpg 768w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-05-1024x635.jpg 1024w\" sizes=\"auto, (max-width: 1137px) 100vw, 1137px\" \/><\/a><\/p>\n<p>Once the first backup&nbsp;file written to the staging tier has been dealt with, we can delete that file from&nbsp;the staging area and move onto the second backup file to start the process all over again. And we keep doing that&nbsp;over and over and over on the staging tier until we&#8217;re left with an empty staging tier:<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/2017\/05\/05\/architecture-matters-when-you-dedupe\/post-process-dedupe-06\/\" rel=\"attachment wp-att-6274\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6274\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-06.jpg\" alt=\"Post Process Dedupe 06\" width=\"1137\" height=\"705\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-06.jpg 1137w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-06-300x186.jpg 300w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-06-768x476.jpg 768w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/Post-Process-Dedupe-06-1024x635.jpg 1024w\" sizes=\"auto, (max-width: 1137px) 100vw, 1137px\" \/><\/a><\/p>\n<p>Of course,&nbsp;that&#8217;s not the end of the process \u2013 then the deduplication tier will have to run its regular&nbsp;housekeeping&nbsp;operations to remove data that&#8217;s no longer referenced by anything.<\/p>\n<p>Architecturally,&nbsp;post-ingest deduplication is a kazoo to pre-ingest deduplication&#8217;s symphony&nbsp;orchestra. Sure, you might&nbsp;technically get to hear the 1812 Overture, but it&#8217;s not really going to be the same, right?<\/p>\n<p>Let&#8217;s go through where architecturally, post-ingest&nbsp;deduplication fails you:<\/p>\n<ol>\n<li>The network becomes your bottleneck again. You have to send all your backup data to the appliance.<\/li>\n<li>The staging tier has to have at least as much capacity available as the size of your&nbsp;<em>biggest backup<\/em>,&nbsp;assuming it can execute its post-process&nbsp;deduplication within&nbsp;the window between when&nbsp;your previous backup finishes and your next backup&nbsp;starts.<\/li>\n<li>The deduplication process becomes&nbsp;<em>entirely<\/em> spindle bound. If you&#8217;re using spinning disk, that&#8217;s a nightmare. If you&#8217;re using SSD, that&#8217;s $$$.<\/li>\n<li>There&#8217;s no way of telling how much space will be occupied on the&nbsp;deduplication tier after deduplication processing completes. This can lead you into very messy situations where say, the staging tier&nbsp;can&#8217;t empty because&nbsp;the deduplication tier has filled. (Yes, capacity maintenance is a requirement still on pre-ingest deduplication systems, but it&#8217;s <span style=\"text-decoration: underline;\"><strong>half<\/strong><\/span> the effort.)<\/li>\n<\/ol>\n<p>What this means is simple: post-ingest deduplication architectures are&nbsp;<em>asking you to pay for their architectural inefficiencies<\/em>. That&#8217;s where:<\/p>\n<ol>\n<li>You have to pay to increase your network bandwidth to get a complete copy of your data from client to protection&nbsp;storage within your backup window.<\/li>\n<li>You have to pay&nbsp;for both&nbsp;the staging tier storage and the deduplication tier storage.&nbsp;(In fact, the staging tier is&nbsp;often&nbsp;<em>a lot bigger<\/em> than&nbsp;the size of your biggest&nbsp;backups in a 24-hour window so the deduplication can be handled in time.)<\/li>\n<li>You have to factor the additional housekeeping operations into blackout windows, outages, etc. Housekeeping almost invariably becomes a daily rather than a weekly task, too.<\/li>\n<\/ol>\n<p>Compare all that to pre-ingest deduplication:<\/p>\n<p><a href=\"https:\/\/nsrd.info\/blog\/2017\/05\/05\/architecture-matters-when-you-dedupe\/pre-ingest-dedupe\/\" rel=\"attachment wp-att-6275\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6275\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/pre-ingest-dedupe.jpg\" alt=\"Pre-Ingest Deduplication\" width=\"914\" height=\"687\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/pre-ingest-dedupe.jpg 914w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/pre-ingest-dedupe-300x225.jpg 300w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2017\/05\/pre-ingest-dedupe-768x577.jpg 768w\" sizes=\"auto, (max-width: 914px) 100vw, 914px\" \/><\/a><\/p>\n<p>Using pre-ingest deduplication, especially&nbsp;Boost based deduplication, the&nbsp;segmentation and hashing happen directly where&nbsp;the data is, and rather than sending the&nbsp;<em>entire<\/em>&nbsp;data&nbsp;to be protected from the client to the Data&nbsp;Domain, we only send&nbsp;the <strong>unique<\/strong> data. Data that already resides on the&nbsp;Data Domain? All we&#8217;ll have sent is a tiny fingerprint so the Data Domain can confirm it&#8217;s already there (and update its pointers for existing data), then moved on.&nbsp;After your first backup, that potentially means that on a day to day basis your network requirements&nbsp;for backup are reduced by 95% or more.<\/p>\n<p>That&#8217;s why architecture matters: you&#8217;re either doing it right, or you&#8217;re&nbsp;paying&nbsp;the price for someone else&#8217;s&nbsp;inefficiency.<\/p>\n<hr>\n<p>If you want to see&nbsp;more about how a well architected backup environment looks \u2013 technology, people and processes, check out my&nbsp;book, <a href=\"https:\/\/www.amazon.com\/Data-Protection-Ensuring-Availability\/dp\/1482244152\/ref=mt_paperback?_encoding=UTF8&amp;me=\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>Data Protection: Ensuring Data Availability<\/strong><\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There was a&nbsp;time, comparatively not that long ago, when the biggest governing factor in LAN capacity for a datacentre was&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[3],"tags":[1241,301,1361,625,732],"class_list":["post-6262","post","type-post","status-publish","format-standard","hentry","category-architecture","tag-architecture","tag-deduplication","tag-efficiency","tag-network","tag-performance"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pKpIN-1D0","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/6262","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/comments?post=6262"}],"version-history":[{"count":8,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/6262\/revisions"}],"predecessor-version":[{"id":7389,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/6262\/revisions\/7389"}],"wp:attachment":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media?parent=6262"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/categories?post=6262"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/tags?post=6262"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}