Architecture Matters: When you dedupe

There was a time, comparatively not that long ago, when the biggest governing factor in LAN capacity for a datacentre was not the primary production workloads, but the mechanics of getting a full backup from each host over to the backup media. If you’ve been around in the data protection industry long enough you’ll have had experience of that – for instance, the drive towards 1Gbit networks over Fast Ethernet started more often than not in datacentres I was involved in thanks to backup. Likewise, the first systems I saw being attached directly to 10Gbit backbones in datacentres were the backup infrastructure.

Well architected deduplication can eliminate that consideration. That’s not to say you won’t eventually need 10Gbit, 40Gbit or even more in your datacentre, but if deduplication is architected correctly, you won’t need to deploy that next level up of network performance to meet your backup requirements.

In this blog article I want to take you through an example of why deduplication architecture matters, and I’ll focus on something that amazingly still gets consideration from time to time: post-ingest deduplication.

Before I get started – obviously, Data Domain doesn’t use post-ingest deduplication. Its pre-ingest deduplication ensures the only data written to the appliance is already deduplicated, and it further increases efficiency by pushing deduplication segmentation and processing out to the individual clients (in a NetWorker/Avamar environment) to limit the amount of data flowing across the network.

A post-deduplication architecture though has your protection appliance feature two distinct tiers of storage – the landing or staging tier, and the deduplication tier. So that means when it’s time to do a backup, all your clients send all their data across the network to sit, in original sized format, on the staging tier:

Post Process Dedupe 01

In the example above we’ve already had backups run to the post-ingest deduplication appliance; so there’s a heap of deduplicated data sitting in the deduplication tier, but our staging tier has just landed all the backups from each of the clients in the environment. (If it were NetWorker writing to the appliance, each of those backups would be the full sized savesets.)

Now, at some point after the backup completes (usually a preconfigured time), post-processing kicks in. This is effectively a data-migration window in a post-ingest appliance where all the data in the staging tier has to be read and processed for deduplication. For example, using the example above, we might start with inspecting ‘Backup01’ for commonality to data on the deduplication tier:

Post Process Dedupe 02

So the post-ingest processing engine starts by reading through all the content of Backup01 and constructs fingerprint analysis of the data that has landed.

Post Process Dedupe 03

As fingerprints are assembled, data can be compared against the data already residing in the deduplication tier. This may result in signature matches or signature misses, indicating new data that needs to be copied into the deduplication tier.

Post Process Dedupe 04

In this it’s similar to regular deduplication – signature matches result in pointers for existing data being updated and extended, and a signature miss results in needing to store new data on the deduplication tier.

Post Process Dedupe 05

Once the first backup file written to the staging tier has been dealt with, we can delete that file from the staging area and move onto the second backup file to start the process all over again. And we keep doing that over and over and over on the staging tier until we’re left with an empty staging tier:

Post Process Dedupe 06

Of course, that’s not the end of the process – then the deduplication tier will have to run its regular housekeeping operations to remove data that’s no longer referenced by anything.

Architecturally, post-ingest deduplication is a kazoo to pre-ingest deduplication’s symphony orchestra. Sure, you might technically get to hear the 1812 Overture, but it’s not really going to be the same, right?

Let’s go through where architecturally, post-ingest deduplication fails you:

  1. The network becomes your bottleneck again. You have to send all your backup data to the appliance.
  2. The staging tier has to have at least as much capacity available as the size of your biggest backup, assuming it can execute its post-process deduplication within the window between when your previous backup finishes and your next backup starts.
  3. The deduplication process becomes entirely spindle bound. If you’re using spinning disk, that’s a nightmare. If you’re using SSD, that’s $$$.
  4. There’s no way of telling how much space will be occupied on the deduplication tier after deduplication processing completes. This can lead you into very messy situations where say, the staging tier can’t empty because the deduplication tier has filled. (Yes, capacity maintenance is a requirement still on pre-ingest deduplication systems, but it’s half the effort.)

What this means is simple: post-ingest deduplication architectures are asking you to pay for their architectural inefficiencies. That’s where:

  1. You have to pay to increase your network bandwidth to get a complete copy of your data from client to protection storage within your backup window.
  2. You have to pay for both the staging tier storage and the deduplication tier storage. (In fact, the staging tier is often a lot bigger than the size of your biggest backups in a 24-hour window so the deduplication can be handled in time.)
  3. You have to factor the additional housekeeping operations into blackout windows, outages, etc. Housekeeping almost invariably becomes a daily rather than a weekly task, too.

Compare all that to pre-ingest deduplication:

Pre-Ingest Deduplication

Using pre-ingest deduplication, especially Boost based deduplication, the segmentation and hashing happen directly where the data is, and rather than sending the entire data to be protected from the client to the Data Domain, we only send the unique data. Data that already resides on the Data Domain? All we’ll have sent is a tiny fingerprint so the Data Domain can confirm it’s already there (and update its pointers for existing data), then moved on. After your first backup, that potentially means that on a day to day basis your network requirements for backup are reduced by 95% or more.

That’s why architecture matters: you’re either doing it right, or you’re paying the price for someone else’s inefficiency.


If you want to see more about how a well architected backup environment looks – technology, people and processes, check out my book, Data Protection: Ensuring Data Availability.

2 thoughts on “Architecture Matters: When you dedupe”

  1. So when using CloudBoost 2.1 with Networker is this performing a local dedupe or a global dedupe as similar with Data Domain?

    1. It’s very similar in process to Data Domain. The deduplication is global for the Cloud Boost appliance, with the Cloud Boost version of distributed deduplication pushed out to each client. For NetWorker 9.1 that applies to Linux clients; for NetWorker 9.2 that will apply for Windows clients as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.