It goes without a doubt that we have to get smarter about storage. While I’m probably somewhat excessive in my personal storage requirements, I currently have 13TB of storage attached to my desktop machine alone. If I can do that at the desktop, think of what it means at the server level…
As disk capacities continue to increase, we have to work more towards intelligent use of storage rather than continuing the practice of just bolting on extra TBs whenever we want because it’s “easier”.
One of the things that we can do to more intelligently manage storage requirements for either operational or support production systems is to deploy deduplication where it makes sense.
That being said, the real merits of target based deduplication become most apparent when we compare it to source based deduplication, which is where the majority of this article will now take us.
A lot of people are really excited about source level deduplication, but like so many areas in backup, it’s not a magic bullet. In particular, I see proponents of source based deduplication start waving magic wands consisting of:
- “It will reduce the amount of data you transmit across the network!”
- “It’s good for WAN backups!”
- “Your total backup storage is much smaller!”
While each of these facts are true, they all come with big buts. From the outset, I don’t want it said that I’m vehemently opposed to source based deduplication; however, I will say that target based deduplication often has greater merits.
For the first item, this shouldn’t always be seen as a glowing recommendation. Indeed, it should only come into play if the network is a primary bottleneck – and that’s more likely going to be the case if doing WAN based backups as opposed to regular backups.
In regular backups while there may be some benefit to reducing the amount of data transmitted, what you’re often not told is that this reduction comes at a cost – that being increased processor and/or memory load on the clients. Source based deduplication naturally has to shift some of the processing load back across to the client – otherwise the data will be transmitted and thrown away. (And otherwise proponents wouldn’t argue that you’ll transmit less data by using source based backup.)
So number one, if someone is blithely telling you that you’ll push less data across your network, ask yourself the following questions:
(a) Do I really need to push less data across the network? (I.e., is the network the bottleneck at all?)
(b) Can my clients sustain a 10% to 15% load increase in processing requirements during backup activities?
This makes the first advantage of source based deduplication somewhat less tangible than it normally comes across as.
Onto the second proposed advantage of source based deduplication – faster WAN based backups. Undoubtedly, this is true, since we don’t have to ship anywhere near as much data across the network. However, consider that we backup in order to recover. You may be able to reduce the amount of data you send across the WAN to backup, but unless you plan very carefully you may put yourself into a situation where recoveries aren’t all that useful. That is – you need to be careful to avoid trickle based recoveries. This often means that it’s necessary to put a source based deduplication node in each WAN connected site, with those nodes replicating to a central location. What’s the problem with this? Well, none from a recovery perspective – but it can considerably blow out the cost. Again, informed decisions are very important to counter-balance source based deduplication hyperbole.
Finally – “your total backup storage is much smaller!”. This is true, but it’s equally an advantage of target based deduplication as well; while the rates may have some variance the savings are still great regardless.
Now let’s look at a couple of other factors of source based deduplication that aren’t always discussed:
- Depending on the product you choose, you may get less OS and database support than you’re getting from your current backup product.
- The backup processes and clients will change. Sometimes quite considerably, depending on whether your vendor supports integration of deduplication backup with your current backup environment, or whether you need to change the product entirely.
When we look at those above two concerns is when target based deduplication really starts to shine. You still get deduplication, but with significantly less interruption to your environment and your processes.
Regardless of whether target based deduplication is integrated into the backup environment as a VTL, or whether it’s integrated as a traditional backup to disk device, you’re not changing how the clients work. That means whatever operating systems and databases you’re currently backing up you’ll be able to continue to backup, and you won’t end up in the (rather unpleasant) situation of having different products for different parts of your backup environment. That’s hardly a holistic approach. It may also be the case that the hosts where you’d get the most out of deduplication aren’t eligible for it – again, something that won’t happen with target based deduplication.
The changes for integrating target based deduplication in your environment are quite small – you just change where you’re sending your backups to, and let the device(s) handle the deduplication, regardless of what operating system or database or application or type of data is being sent. Now that’s seamless.
Equally so, you don’t need to change your backup processes for your current clients – if it’s not broken, don’t fix it, as the saying goes. While this can be seen by some as an argument for stagnation, it’s not; change for the sake of change is not always appropriate, whereas predictability and reliability are very important factors to consider in a data protection environment.
Overall, I prefer target based deduplication. It integrates better with existing backup products, reduces the number of changes required, and does not place restrictions on the data you’re currently backing up.
What coincidence, just yesterday I went to a Data Domain & DR presentation.
I’m a total fan of target based deduplication. It’s fast and very reliable. We did not have a single backup error resulting from the Data Domain, something not everyone can say about their tape libaries.
Of course, there can be problems. One person I spoke to had a considerably larger IT environment and they ran into bottlenecks when writing from many servers to the DD or writing to a DD over a firewall (CPU usage went up, FW went down ;-)). And their expensive.
Source deduplication seems tempting for remote offices, but I was told you can’t recover in “dedupe mode”, which would probably totally blow-up your recover plans…
I wonder if the person you spoke to was using a previous generation Data Domain product? The current ones I’m told have massively larger levels of performance.
Of course, the other aspect is whether the Data Domain box was sufficiently specced for the workload it was under — a common mistake with deduplication solutions is to aim for “just good enough” or “hopefully enough”: while we can normally get away with this for traditional solutions for a while, when deduplication becomes involved, it’s very important to not compromise on the specification of the solution.
If I remember corectly, they were using DD 565 (as do we) which can get up to 80MB/s of throughput. That’s ofc nowhere near to the performance a 6xx of today would get.
Whether this was due to unsufficent testing or unexpected growth I do not know.
Anyways, I’m highly satisfied with our DDs and I’m looking forward to the blossoms of their recent takeover by EMC 🙂
Excellent article Preston! One advantage of source based dedup not mentioned is the backup window. If clients are dedup’ing wouldn’t that significantly decrease the backup window and make it possible to serve a larger number of clients with a smaller backup environment?
Best Regards,
Vic
Hi Vic,
True, the backup window is decreased on a per client basis, but typically this does see a big spike in CPU utilisation on the client during the backup. It does also rely on “regularity” … small delta changes. (If you are dumping large amounts of fresh data onto a host each day source based backups will suddenly struggle quite a bit.)
As to whether you get a “smaller” backup environment, the jury is still out on that one. If we look at Avamar as an example, you can very easily build up to a large RAIN – the idea of protecting 30 hosts with 20 backup servers (even though logically they’re the same system, as part of the RAIN) doesn’t really qualify as a “smaller” backup environment to me. Again looking at Avamar, if you’re then needing offsite protection, you then need to replicate, which means a similar environment somewhere else again.
NOTE: I am *not* saying that you need a 20 node RAIN for 30 servers, I’m using that as an example. Clearly, based on sizing charts I’ve been though, it’s actually possible depending on the amount of data you have on those servers – I’m just not saying it’s a guaranteed fact, just an example…
Cheers,
Preston.