Data isn’t data isn’t data

An integral part of effective data protection is data awareness. You can’t adequately protect what you don’t know about, but similarly, you can’t adequately protect what you don’t understand, either. Understanding what sort of data you have is critical to understanding how you can protect it – and even more so from a business perspective, how much you may need to spend in order to protect it.


As the title says, Data isn’t Data isn’t Data.

I think this is most striking for me in organisations which have been running with data protection solutions that have organically developed over time (probably since the company was either quite small or operationally quite informal) and are now looking at making major, hopefully long-reaching changes to their data protection strategy.

The scenario works like this: the company asks for proposals on a holistic data protection strategy that tells prospective bidders all about where data is, and what operating systems data sits on, and usually even what the link speeds are between its sites, but doesn’t have more details about the type of data it is. By type, I mean:

  • What percentage of the data is traditional database;
  • What percentage is traditional file/operating system;
  • What percentage is NAS;
  • What percentage is virtual machine images;
  • What percentage of each must be sent or stored in an encrypted format,
  • and so on.

At one time, that information wasn’t necessarily all that relevant: if it were all being sent to tape the biggest headaches came from whether or not there were particularly dense file systems. (You can’t stream tape backups over WAN-speed links so you’d typically not care about the link speed so long as you could deploy sufficient tape infrastructure in each required location.) If data was already compressed or already encrypted before it was backed up, that might reduce the compression ratio achieved on individual tapes in the data protection environment, but what’s a few tapes here and there?*

As data protection gets more efficient and smarter though, this sort of information becomes as important to understanding what will be involved in protecting it as the more traditional questions.

Consider for instance a company that wants to protect 70TB of data using deduplication storage so as to minimise the protection footprint and gain the most efficiencies out of disk based backup strategies. The typical starting questions you’d need to answer for a backup and recovery environment might be say:

  • How long do you want to keep your daily/weekly backups for?
  • How long do you want to keep monthly fulls for?
  • Do you need long term retention for yearlies or other backups?

For the purposes of simplicity, let’s stick to just those first two questions and provide some basic answers to work with:

  • Daily incrementals and weekly fulls to be kept for 6 weeks
  • Monthly backups to be kept for 12 months

We’ll also assume all data is in one location. When suggesting a backup environment for size, the above would have been enough information to come up with an approximate configuration to meet the backup capacity requirements for the environment in the old world of tape. (Noted: it would certainly not be enough for determining speed requirements.)

But if you want to take advantage of deduplication, data isn’t data isn’t data. Knowing that you have 70TB of data doesn’t allow anyone to make any reliable recommendations about what sort of protection storage you might need if your intent is to drop tape and move to more efficient formats. OK, let’s start providing a few more details and see what happens.

Let’s say you’re told:

  • 70 TB of data
  • Weekly fulls retained for 6 weeks
  • Daily incrementals retained for 6 weeks
  • Monthly fulls retained for 12 months
  • 3.19% average daily change rate

If you’re just going to be backing up to tape, or plain disk, this has now given you enough information to have a stab at coming up with a potential capacity, which would start with:

Size ~= 70 TB x 6 (weekly fulls) + 70 TB x 12 (monthly fulls) + (70 TB x 3.19% x 36 incrementals)

Size ~= 1340.388 TB

But is that accurate? Well, no: we don’t have enough information to properly understand the environment. Is it possible, for instance, to work out how much deduplication storage you might need to provide protection for 1340.388TB of backups? What’s the ‘average’ deduplication ratio for any data, regardless of what it is? (Hint: there’s no such thing.)

Coming back to the original point of the article, data isn’t data isn’t data. So let’s start breaking this out a little more and see what happens. That 70TB of data becomes:

  • 10 TB Files
  • 5 TB Databases
  • 5 TB Mail
  • 50 TB VMware

Let’s also assume that because we now know the data types, we also know the per-type change rate rather than relying on an average change rate, and so we actually have:

  • 10 TB Files at 1.75% daily change
  • 5 TB Databases at 6% daily change
  • 5 TB Mail at 3% daily change
  • 50 TB VMware at 2% daily change (within the VMs, not the individual container files – which of course is normally 100% change)

A few things to note here:

  • I’m not suggesting the above change rates are real-world, I’ve just shoved them into a spreadsheet as examples.
  • I’m not factoring in the amount of the same content that changes each day vs unique content that changes each day**.

At this point if we’re still sizing for either tape or conventional disk, we can more accurately come up with the storage capacity required. Based on those figures, our actual required capacity comes down from 1340.388TB to 1318.50TB. Not a substantial difference, but a difference nonetheless. (The quality and accuracy of your calculation always depends on the quality and accuracy of your data, after all.)

If we assumed a flat deduplication rate we might have enough data now to come up with a sizing for deduplication storage, but in reality there’s a minimum of three deduplication ratios you want to consider, notably:

  • Deduplication achieved from first full backup
  • Deduplication achieved from subsequent full backups
  • Deduplication achieved for incremental backups

In reality, it’s more complex than that – again, returning to the rate of unique vs non-unique change within the environment. Coming back to data isn’t data isn’t data though, that’ll be different for each data type.

So let’s come up with some basic deduplication ratios – again, I’m just pulling numbers out of my head and these should in no way be seen as ‘average’. Let’s assume the following:

  • File backups have a first full dedupe of 4x, a subsequent full dedupe of 6x, and an incremental dedupe of 3x
  • Database backups have a first full dedupe of 2.5x, a subsequent full dedupe of 6x, and an incremental dedupe of 1.5x
  • Mail backups have a first full dedupe of 3x, a subsequent full dedupe of 4x, and an incremental dedupe rate of 2.5x
  • VMware backups have a first full dedupe of 6x, a subsequent full dedupe of 12x, and an incremental dedupe rate of 6x

If we plug those into a basic spreadsheet (given I still count on my fingers), we might see a sizing and capacity requirement of:

  • Files – 32.93 TB
  • Database – 37.53 TB
  • Mail – 23.08 TB
  • VMware – 85.17 TB
  • Total – 180.71 TB

It’s here that you need to be aware of any gotchas. What happens, for instance, if an environment has some sort of high security requirement for file storage, and all files on fileservers are encrypted before being written to disk? In that scenario, the backup product would be dealing with 10TB of storage that won’t deduplicate at all. That might result in no deduplication at all for each of the backup scenarios (first full, subsequent full and incrementals) for the file data: we’d have a 1:1 storage requirement for those backups. This would mean our file backup storage would require 186.3TB of backup capacity (vs 32.93 TB above), bringing the total storage with deduplication to 334.08 TB.

The example I’ve given is pretty simplistic, and in no way exhaustive, but it should start to elaborate on why the old way of specifying how much data you have just doesn’t cut it any more. Examples of where the above would need further clarification would include:

  • What is the breakdown between virtual machines hosting regular data and database data? (increasingly important as virtualisation loads increase)
  • For each dataset, would there be any data that’s already compressed, already encrypted, or some form of multimedia? (10TB of word documents will have a completely different storage profile to 10TB of MP4 files, for instance).

And then, of course, as we look at multi-site environments, it’s then important to understand:

  • What is the breakdown of data per site?
  • What is the link speed between each site?

This is all just for sizing alone. For performance obviously it’s then important to understand so much more – recovery time objectives, recovery point objectives, frequency of recoveries, backup window, and so on … but this brings us back to the title of the article:

Data isn’t data isn’t data.

So if you’re reaching that point where you are perhaps considering deduplication for the first time, remember to get your data classified by type and work with your local supplier or vendor (which I’m hoping will be EMC, of course) to understand what your likely deduplication ratios are.


* Actually, “a few tapes here and there” can add up spectacularly quickly, but that’s another matter.
** By this I mean the difference between a different 1.75% of files being edited each day on the fileserver, the same 1.75% of files being edited each day on the fileserver, or some mix thereof – this plays an important factor that I’m disregarding for simplicity.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.