5 more considerations for deduplication

Recently I wrote, “7 common problems with deduplication“. That covered some of the practicalities that you need to be aware of. However, that wasn’t a definitive list, and I wanted to expand on that a little with this post.

These are:

  1. Architecture – How will it fit together?
  2. Rehydration – Can your pipe accommodate the data?
  3. Redundancy – Are you putting all your eggs in the one basket?
  4. Replicas – How will your copies be handled and recognised by the server?
  5. Long term storage – What is your strategy for longer-term backups?

Each of these include factors that you have to consider before you go ahead with data deduplication within an environment, and I’ll go through each one individually.

Architecture

If we look at NetWorker and target based deduplication, we run into an interesting architectural issue. The way NetWorker generates multiplexed savesets can have a direct impact on the compressibility of the datastream. In particular, all VTL based deduplication devices should be configured such that each virtual drive has both target and max sessions set to 1.

In a conventional tape or backup-to-disk environment, it’s common to see configurations where 4 or more sessions are streamed to each device. For physical tape, this may be partly due to the need to keep drives streaming, but it can also be to do with making sure that there’s not a backlog of pending savesets, too – i.e., keeping the backup window as narrow as possible.

If we cut away from that process and move to an architecture that has a 1:1 ratio for streams and virtual drives, the logical solution is to increase the number of virtual drives. Typically I’d suggest that there’s at least a 4:1 ratio of virtual drives to physical drives when a VTL is replacing a PTL. I.e., if you had 4 physical drives, you’ll be configuring a VTL with at least 16 virtual drives.

However, if we look at NetWorker licensing, this has an odd effect. VTLs will either get ‘real’ VTL licenses if they’re of a particular EMC brand, or an alternate VTL license bundle, which grants 3 x Unlimited Autochanger licenses per XTB presented by the VTL.

Neither of those licenses are the issue – the issue is actually with NetWorker’s limitations relating to the number of devices per storage node or server. For NetWorker, Network Edition, you’re entitled to:

  • 16 devices on the server;
  • 16 devices on each storage node.

For NetWorker, Power Edition, you’re entitled to:

  • 32 devices on the server;
  • 32 devices on each storage node.

That’s all well and good for physical tape environments – but once you go virtual, those limitations can get very tight, very quickly. (Hint, EMC: Those limitations should be doubled or quadrupled, please.)

The net effect is that if you have say, a 4-drive PTL and a 16-drive VTL, but just a single server, no storage nodes, you’ll need to do one of the following:

  • Upgrade from Network edition to Power Edition, or
  • Purchase an additional storage node license to ‘stack on’ an extra 16 devices.

Yes – you can purchase and add-on storage node licenses to add to the permitted device count within the environment, without adding an actual storage node. This is handy to know in normal situations, but when it comes to deduplicating VTLs in particular, it’s a must.

Rehydration

It’s all very well to have a fabulous deduplication ratio. Let’s say you’re achieving 10:1 or something along those lines. However, we don’t just deal in deduplicated data. At some point, that data is going to have to be rehydrated. Typically this’ll be for one of the following:

  • As part of a recovery, or
  • For tape-out functionality.

In either case, you’re no longer concerned about the deduplication ratio you’ve achieved, but the amount of rehydrated data you’ll be streaming out. One immediate consideration is that if you’ve deployed deduplication backups for branch-office scenarios, and you’ve been loving the ‘trickle’ effect of only sending unique data across the WAN, you’re going to be somewhat less enamoured by having to send the entire data stream, rehydrated, back across the WAN.

Unless, of course, you’ve architected for that situation.

If you’re doing tape-out – either cloning or staging, then you need to still factor that actual rehydrated size into any sizing calculations for a physical tape library. In particular, a common mistake I’m seeing is that people think that by implementing deduplication they can substantially reduce the number of physical tape drives in the environment. I would suggest that as a general rule of thumb for most sites, a reduction of between one quarter and one third of the physical devices is the most you can hope to achieve. If you pull out more than that, you’re likely going to suffer serious contention during tape out operations. You’ll also be totally blown out of the water whenever there’s a physical fault.

Redundancy

Deduplication should never be deployed on its own. E.g., you can’t just have a single Avamar RAIN or a single target deduplication unit. It’s putting all your eggs in one basket. You need some form of atomic-unit redundancy, be that a second grid you replicate to, or a second DD you replicate to, or tape-out.

I’ve heard of solutions deployed that have a single Avamar RAIN for instance – and just a few nodes in the grid – with no tape out, and no replication to another site. I personally think that’s data-suicide. Sure, any individual node in a RAIN can fail and the grid will continue, but you’ve still got the fundamental problem – what happens if you lose your grid?

The same applies to target based deduplication. For ease of consideration, any deduplication configuration, be it Avamar, Data Domain, Quantum, FalconStor or anything else should be considered to have one unit per physical location. And if, under those definitions, you’ve only got one unit – well, you’ve got insufficient redundancy.

Replicas

In particular with target based deduplication, if you’re using the replication functionality of the deduplication device (to avoid a NetWorker clone rehydrate+deduplicate again scenario), you introduce a new challenge – how do you get NetWorker to actually know about the replicas? Items for consideration here are:

  1. Can both replicas be online at the same time? I.e., does the deduplication environment support this?
  2. Will NetWorker perceive the replicas as the same physical media? I.e., do the replicas have the same volume ID? If so, NetWorker won’t permit them to be mounted in two different locations at once.
  3. How ‘atomically’ can replicas be brought online? If replicas do have the same volume ID, what is the smallest replica that can be brought online? Typically this will be either a single virtual tape, or a single disk backup unit. For virtual tapes, that’ll be more manageable. For disk backup units, it presents more of a problem.

Newer technology, such as DD Boost, which integrates NetWorker’s cloning facilities with the inherent replication capabilities of the hardware, address this issue. If you’re not using DD Boost though, you need to come up with your own solution.

Long Term Storage

Want deduplication? Want enough deduplication to handle 7 years of backups? 10 years? 15 years? ‘Forever’ years? Long term storage can’t be left by the way-side, you have to plan and architect this into your solution.

Some deduplication vendors (EMC included) are starting to tout new archive credentials in their deduplication arrays, but to be perfectly frank, the long-term cost of maintaining large amounts of either spinning or partially spun down disks with deduplicated storage, vs a batch of tapes with rehydrated storage, is still not at a point that can be entertained by many businesses. Tape is, and shall continue to be cheap for longer term storage and archival storage. Anyone who tries to tell you otherwise likely has a vested interest in dropping more storage on your datacentre floor.

When planning for longer-term storage in a deduplication environment, you have to make a few decisions in advance:

  • Do longer term backups go direct to tape (or conventional disk staging areas) instead of ever hitting deduplicated storage?
  • If the longer-term backups do sit on deduplicated storage, what will be the additional size requirements?
  • Are those size requirements worth it? E.g., if you have to buy a unit that has an additional 20TB of deduplication capabilities in order to hold all the long-term backups that you want to keep ‘nearline’, is it actually worth it, given it’ll always be staged out/relocated to longer-term storage, or do you go for a cheaper initial storage option as well?

Summing up

Between this and other articles, one might think that I’m actually against deduplication. I’m not. However, I am dead-set against the mis-use of technology. Wasteful spending, particularly in the backup environment, just leads to bigger issues – such as artificial and inaccurate budgetary restraints at a later point in time.

When it comes to deduplication, I guess there can only be one rule: eyes wide open.

2 thoughts on “5 more considerations for deduplication”

  1. Hello Preston,

    We are trying to backup our Active directory server using NMM 2.4 but i want to use GLR feature,
    I was going through NMM guide but not understanding how to backup entire AD.
    As per Doc it says we have to mention the command.

    CN=,OU=,DC=,DC=suffix
    For example:
    CN=testuser1,OU=OU1,DC=corp,DC=xyz,DC=com
    Where the backup saves the entire domain named corp.xyz.com from its root
    level.

    But if we mention something like this will it backup the entire AD or just the testuser1?
    My question is how to backup entire AD using GLR feature?
    what should be the syntax for the backing up the entire AD? ( assume our Domain name is test.com)

    Regards,
    Mangs

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.