Dedupe: Leading Edge or Bleeding Edge?

If you think you can’t go a day without hearing something about dedupe, you’re probably right. Whether it’s every vendor arguing the case that their dedupe offerings are the best, or tech journalism reporting on it, or pundits explaining why you need it and why your infrastructure will just die without it, it seems that it’s equally the topic of the year along with The Cloud.

There is (from some at least) an argument that backup systems should be “out there” in terms of innovation; I question that in as much as I believe that the term bleeding edge is there for a reason – it’s much sharper, it’s prone to accidents, and if you have an accident at the bleeding edge level, well, you’ll bleed.

So, I always argue that there’s nothing wrong with leading edge in backup systems (so long as it is warranted), but bleeding edge is far more riskier a proposition – not just in terms of potentially wasted investment, but due to the side effect of that wasted investment. If a product is outright bleeding edge then having it involved in data protection is a particularly dangerous proposition. (Only when technology is a mix of bleeding edge and leading edge can you at least start to make the argument that it should be at least considered in the data protection sphere.)

Personally I like the definitions of Bleeding Edge and Leading Edge in the article at Wikipedia on Technology Lifecycle. To quote:

Bleeding edge – any technology that shows high potential but hasn’t demonstrated its value or settled down into any kind of consensus. Early adopters may win big, or may be stuck with a white elephant.

Leading edge – a technology that has proven itself in the marketplace but is still new enough that it may be difficult to find knowledgeable personnel to implement or support it.

So the question is – is deduplication leading edge, or is it still bleeding edge?

To understand the answer, we first have to consider that there’s actually 5 classified stages to the technology lifecycle. These are:

  1. Bleeding edge.
  2. Leading edge.
  3. State of the art.
  4. Dated.
  5. Obsolete.

What we have to consider is – what happens when a technology exhibits attributes of more than one classification or stage of technology? To me, working in the conservative field of data protection, I think there’s only one answer: it should be classified by the “least mature” or “most dangerous” stage that it exhibits attributes for.

Thus, deduplication is still bleeding edge.

Why dedupe is still bleeding edge

Clearly there are attributes of deduplication which are leading edge. It has, in field deployments, proven itself to be valuable in particular instances.

However, there are attributes of deduplication which are definitely still bleeding edge. In particular, the distinction for bleeding edge (to again quote from the Wikipedia article on Technology Lifecycle) is that it:

…shows high potential but hasn’t demonstrated its value or settled down into any kind of consensus.

(My emphasis added.)

Clearly in at least some areas, deduplication has demonstrated its value – my rationale for it still being bleeding edge though is the second (and equally important) attribute: I’m not convinced that deduplication has sufficiently settled down into any kind of consensus.

Within deduplication, you can:

  • Dedupe primary data (less frequent, but talk is growing about this)
  • Dedupe virtualised systems
  • Dedupe archive/HSM systems (whether literally, or via single instance storage, or a combination thereof)
  • Dedupe NAS
  • For backup:
    • Do source based dedupe:
      • At the file level
      • At a fixed block level
      • At a variable block level
    • Do target based dedupe:
      • Post-backup, maintaining two pools of storage, one deduplicated, one normal. Most frequently accessed data is typically “hydrated”, whereas the deduped storage is longer term/less frequently accessed data.
      • Inline (at ingest), maintaining only one deduplicated pool of storage
    • For long term storage of deduplicated backups:
      • Replicate, maintaining two deduplicated systems
      • Transfer out to tape, usually via rehydration (the slightly better term for “undeduplicating”)
      • Transfer deduped data out to tape “as is”

Does this look like any real consensus to you?

One comfort in particular that we can take from all these disparate dedupe options is that clearly there’s a lot of innovation going on. The fundamental basics behind dedupe as well are tried and trusted – we use them every time we compress a file or bunch of files. It’s just scanning for common blocks and reducing the data to the smallest possible amount.

It’s also an intelligent and logical method of moving forward in storage – i.e., we’ve reached a point in storage where both companies that purchase storage, and the vendors that provide it, are moving towards using storage more efficiently rather than just continuing to buy it. This trend started with the development of SAN and NAS, so dedupe is just the logical continuation of those storage centralisation/virtualisation paths. More so, the trend towards more intelligent use of technology is not new – consider even recent changes in products from the CPU manufacturers. Targeting Intel as a prime example, for years their primary development strategy was “fast, faster, fastest.” However, that strategy ended up hitting a brick wall – it doesn’t matter how fast an individual processor is if you actually need to do multiple things at once. Hence multi-core really hit the mainstream. Previously reserved in multi-CPU environments for high end workstations and servers, it’s now common for any new computer to come with multiple cores. (Heck, I have 2 x Quad Core processors in the machine I’m writing this article on. The CPU speeds are technically slower than my lab ESX server, but with multi-core, multi-threading, it smacks the ESX server out of the lab every time on performance. It’s more intelligent use of the resources.)

So dedupe is about shifting away from big, bigger biggest storage to smart, smarter and smartest storage.

We’re certainly not at smartest yet.

We’re probably not even at smarter yet.

As an overall implementation strategy, deduplication is practically infantile in terms of actual industry-state vs potential industry-state. You can do it on your primary production data, or your virtualised systems or your archived data or your secondary NAS data or your backups, but so far there’s been little tangible, usable advances towards being able to use it throughout your entire data lifecycle in a way which is compatible and transparent regardless of vendor or product in use.

For dedupe to be able to make that leap fully out of bleeding edge territory, it needs to make some inroads into complete data lifecycle deduplication – starting at the primary data level and finishing at backups and archives.

(And even when we can use it through the entire product lifecycle, we’ll still be stuck with working out what to do with it once it’s been generated, for longer term storage. Do we replicate between sites? Do we rehydrate to tape or do we send out the deduped data to tape? Obviously based on recent articles I don’t (yet) have much faith in the notion of writing deduped data to tape.)

If you think that there isn’t a choice for long term storage – that it has to be replication, and dedupe is a “tape killer”, think again. Consider smaller sites with constrained budget, consider sites that can’t afford dedicated disaster recovery systems, and consider sites that want to actually limit their energy impact. (I.e., sites that understand the difference in energy savings between offsite tapes and MAID for long term data storage.)

So should data protection environments implement dedupe?

You might think, based on previous comments, that my response to this is going to be a clear-cut no. That’s not quite correct however. You see, because dedupe falls into both leading edge and bleeding edge, it is something that can be implemented into specific environments, in specific circumstances.

That is, the suitability of dedupe for an environment can be evaluated on a case by case basis, so long as sites are aware that when implementing dedupe they’re not getting the full promise of the technology, but just specific windows on the technology. It may be that companies:

  • Need to reduce their backup windows, in which case source-based dedupe could be one option (among many).
  • Need to reduce their overall primary production data, in which case single instance archive is a likely way to go.
  • Need to keep more data available for recovery in VTLs (or for that matter on disk backup units), in which case target based dedupe is the likely way to go.
  • Want to implement more than one of the above, in which case they will be buying disparate technology that don’t share common architectures or operational management systems.

I’d be mad if I were to say that dedupe is still too immature for any site to consider – yet equally I’d charge that anyone who says that every site should go down a dedupe path, and that every site will get fantastic savings from implementing dedupe is equally mad.

2 thoughts on “Dedupe: Leading Edge or Bleeding Edge?”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.