{"id":1216,"date":"2009-10-28T12:05:45","date_gmt":"2009-10-28T02:05:45","guid":{"rendered":"http:\/\/nsrd.wordpress.com\/?p=1216"},"modified":"2009-10-28T12:05:45","modified_gmt":"2009-10-28T02:05:45","slug":"dedupe-leading-edge-or-bleeding-edge","status":"publish","type":"post","link":"https:\/\/nsrd.info\/blog\/2009\/10\/28\/dedupe-leading-edge-or-bleeding-edge\/","title":{"rendered":"Dedupe: Leading Edge or Bleeding Edge?"},"content":{"rendered":"<p>If you think you can&#8217;t go a day without hearing something about dedupe, you&#8217;re probably right. Whether it&#8217;s every vendor arguing the case that their dedupe offerings are the best, or tech journalism reporting on it, or pundits explaining why you need it and why your infrastructure will just <em>die<\/em> without it, it seems that it&#8217;s equally the topic of the year along with The Cloud.<\/p>\n<p>There is (from some at least) an argument that backup systems should be &#8220;out there&#8221; in terms of innovation; I question that in as much as I believe that the term <em>bleeding edge<\/em> is there for a reason \u2013 it&#8217;s much sharper, it&#8217;s prone to accidents, and if you have an accident at the bleeding edge level, well, you&#8217;ll bleed.<\/p>\n<p>So, I always argue that there&#8217;s nothing wrong with <em>leading edge<\/em> in backup systems (so long as it is warranted), but bleeding edge is far more riskier a proposition \u2013 not just in terms of potentially wasted investment, but due to the <em>side effect<\/em> of that wasted investment. If a product is outright bleeding edge then having it involved in data protection is a particularly dangerous proposition. (Only when technology is a mix of bleeding edge and leading edge can you at least start to make the argument that it should be at least considered in the data protection sphere.)<\/p>\n<p>Personally I like the definitions of Bleeding Edge and Leading Edge in the article at Wikipedia on <a title=\"Technology Lifecycle @ Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/Technology_lifecycle\" target=\"_blank\">Technology Lifecycle<\/a>. To quote:<\/p>\n<blockquote><p>Bleeding edge \u2013 any technology that shows high potential but hasn&#8217;t demonstrated its value or settled down into any kind of consensus. Early adopters may win big, or may be stuck with a white elephant.<\/p>\n<p>Leading edge \u2013 a technology that has proven itself in the marketplace but is still new enough that it may be difficult to find knowledgeable personnel to implement or support it.<\/p><\/blockquote>\n<p>So the question is \u2013 is deduplication leading edge, or is it still bleeding edge?<\/p>\n<p>To understand the answer, we first have to consider that there&#8217;s actually 5 classified stages to the technology lifecycle. These are:<\/p>\n<ol>\n<li>Bleeding edge.<\/li>\n<li>Leading edge.<\/li>\n<li>State of the art.<\/li>\n<li>Dated.<\/li>\n<li>Obsolete.<\/li>\n<\/ol>\n<p>What we have to consider is &#8211; what happens when a technology exhibits attributes of more than one classification or stage of technology? To me, working in the conservative field of data protection, I think there&#8217;s only one answer: it should be classified by the &#8220;least mature&#8221; or &#8220;most dangerous&#8221; stage that it exhibits attributes for.<\/p>\n<p>Thus, <em>deduplication is still bleeding edge<\/em>.<\/p>\n<h3>Why dedupe is still bleeding edge<\/h3>\n<p>Clearly there are attributes of deduplication which are leading edge. It has, in field deployments, proven itself to be valuable in particular instances.<\/p>\n<p>However, there are attributes of deduplication which are definitely still bleeding edge. In particular, the distinction for bleeding edge (to again quote from the Wikipedia article on Technology Lifecycle) is that it:<\/p>\n<blockquote><p>&#8230;shows high potential but hasn&#8217;t demonstrated its value or <strong>settled down into any kind of consensus<\/strong>.<\/p><\/blockquote>\n<p>(My emphasis added.)<\/p>\n<p>Clearly in at least some areas, deduplication has demonstrated its value \u2013 my rationale for it still being bleeding edge though is the second (and equally important) attribute: I&#8217;m not convinced that deduplication has sufficiently settled down into any kind of consensus.<\/p>\n<p>Within deduplication, you can:<\/p>\n<ul>\n<li>Dedupe primary data (less frequent, but talk is growing about this)<\/li>\n<li>Dedupe <a title=\"Interesting dedupe technology over at Storagezilla\" href=\"https:\/\/nsrd.info\/blog\/2009\/09\/01\/interesting-dedupe-technology-overview-at-storagezilla\/\" target=\"_blank\">virtualised<\/a> systems<\/li>\n<li>Dedupe archive\/HSM systems (whether literally, or via single instance storage, or a combination thereof)<\/li>\n<li>Dedupe NAS<\/li>\n<li>For backup:\n<ul>\n<li>Do source based dedupe:\n<ul>\n<li>At the file level<\/li>\n<li>At a fixed block level<\/li>\n<li>At a variable block level<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li>Do target based dedupe:\n<ul>\n<li>Post-backup, maintaining two pools of storage, one deduplicated, one normal. Most frequently accessed data is typically &#8220;hydrated&#8221;, whereas the deduped storage is longer term\/less frequently accessed data.<\/li>\n<li>Inline (at ingest), maintaining only one deduplicated pool of storage<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li>For long term storage of deduplicated backups:\n<ul>\n<li>Replicate, maintaining two deduplicated systems<\/li>\n<li>Transfer out to tape, usually via rehydration (the slightly better term for &#8220;undeduplicating&#8221;)<\/li>\n<li>Transfer deduped data out to tape &#8220;as is&#8221;<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Does this look like any real consensus to you?<\/p>\n<p>One comfort in particular that we can take from all these disparate dedupe options is that clearly there&#8217;s a lot of innovation going on. The fundamental basics behind dedupe as well are tried and trusted \u2013 we use them every time we compress a file or bunch of files. It&#8217;s just scanning for common blocks and reducing the data to the smallest possible amount.<\/p>\n<p>It&#8217;s also an intelligent and logical method of moving forward in storage \u2013 i.e., we&#8217;ve reached a point in storage where both companies that purchase storage, and the vendors that provide it, are moving towards using storage more <em>efficiently<\/em> rather than just continuing to buy it. This trend started with the development of SAN and NAS, so dedupe is just the logical continuation of those storage centralisation\/virtualisation paths. More so, the trend towards more intelligent use of technology is not new \u2013 consider even recent changes in products from the CPU manufacturers. Targeting Intel as a prime example, for years their primary development strategy was &#8220;fast, <em>faster<\/em>, <em><strong>fastest<\/strong><\/em>.&#8221; However, that strategy ended up hitting a brick wall \u2013 it doesn&#8217;t matter how fast an individual processor is if you actually need to do multiple things at once. Hence multi-core really hit the mainstream. Previously reserved in multi-CPU environments for high end workstations and servers, it&#8217;s now common for any new computer to come with multiple cores. (Heck, I have 2 x Quad Core processors in the machine I&#8217;m writing this article on. The CPU speeds are technically slower than my lab ESX server, but with multi-core, multi-threading, it smacks the ESX server out of the lab every time on performance. It&#8217;s more <em>intelligent<\/em> use of the resources.)<\/p>\n<p>So dedupe is about shifting away from big,<em> bigger <strong>biggest<\/strong><\/em> storage to <em>smart, smarter<\/em> and<em> <strong>smartest<\/strong><\/em> storage.<\/p>\n<p>We&#8217;re certainly not at <em><strong>smartest<\/strong><\/em> yet.<\/p>\n<p>We&#8217;re probably not even at <em>smarter<\/em> yet.<\/p>\n<p>As an overall implementation strategy, deduplication is practically infantile in terms of actual industry-state vs potential industry-state. You can do it on your primary production data, or your virtualised systems or your archived data or your secondary NAS data or your backups, but so far there&#8217;s been little tangible, usable advances towards being able to use it throughout your entire data lifecycle in a way which is compatible and transparent regardless of vendor or product in use.<\/p>\n<p>For dedupe to be able to make that leap fully out of bleeding edge territory, it needs to make some inroads into complete data lifecycle deduplication \u2013 starting at the primary data level and finishing at backups and archives.<\/p>\n<p>(And even when we <em>can<\/em> use it through the entire product lifecycle, we&#8217;ll still be stuck with working out what to do with it once it&#8217;s been generated, for longer term storage. Do we replicate between sites? Do we rehydrate to tape or do we send out the deduped data to tape? Obviously based on recent articles I don&#8217;t (yet) have much faith in the notion of <a title=\"Dedupe to tape is &quot;crazy bad&quot; if the architecture is crazy\" href=\"https:\/\/nsrd.info\/blog\/2009\/10\/26\/dedupe-to-tape-is-crazy-bad-if-the-architecture-is-crazy\/\" target=\"_blank\">writing deduped data to tape<\/a>.)<\/p>\n<p>If you think that there isn&#8217;t a choice for long term storage \u2013 that it has to be replication, and dedupe is a &#8220;tape killer&#8221;, think again. Consider smaller sites with constrained budget, consider sites that can&#8217;t afford dedicated disaster recovery systems, and consider sites that want to actually limit their energy impact. (I.e., sites that understand the difference in energy savings between offsite tapes and <a title=\"MAID @ Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/MAID\" target=\"_blank\">MAID<\/a> for long term data storage.)<\/p>\n<h3>So should data protection environments implement dedupe?<\/h3>\n<p>You might think, based on previous comments, that my response to this is going to be a clear-cut <em>no<\/em>. That&#8217;s not quite correct however. You see, because dedupe falls into both leading edge and bleeding edge, it is something that <em>can<\/em> be implemented into specific environments, in specific circumstances.<\/p>\n<p>That is, the suitability of dedupe for an environment can be evaluated on a case by case basis, so long as sites are aware that when implementing dedupe they&#8217;re not getting the full promise of the technology, but just specific windows on the technology. It may be that companies:<\/p>\n<ul>\n<li>Need to reduce their backup windows, in which case source-based dedupe <em>could<\/em> be one option (among many).<\/li>\n<li>Need to reduce their overall primary production data, in which case single instance archive is a likely way to go.<\/li>\n<li>Need to keep more data available for recovery in VTLs (or for that matter on disk backup units), in which case target based dedupe is the likely way to go.<\/li>\n<li>Want to implement more than one of the above, in which case they will be buying disparate technology that don&#8217;t share common architectures or operational management systems.<\/li>\n<\/ul>\n<p>I&#8217;d be mad if I were to say that dedupe is still too immature for <em>any<\/em> site to consider \u2013 yet equally I&#8217;d charge that anyone who says that <em>every<\/em> site should go down a dedupe path, and that <em>every<\/em> site will get fantastic savings from implementing dedupe\u00a0is equally mad.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you think you can&#8217;t go a day without hearing something about dedupe, you&#8217;re probably right. Whether it&#8217;s every vendor&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[5,12,13],"tags":[167,298,301,480,526,563,805,912,986],"class_list":["post-1216","post","type-post","status-publish","format-standard","hentry","category-backup-theory","category-general-technology","category-general-thoughts","tag-bleeding-edge","tag-dedupe","tag-deduplication","tag-inline","tag-leading-edge","tag-maid","tag-rehydrate","tag-source","tag-target"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pKpIN-jC","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/1216","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/comments?post=1216"}],"version-history":[{"count":0,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/1216\/revisions"}],"wp:attachment":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media?parent=1216"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/categories?post=1216"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/tags?post=1216"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}