Architecture Matters: Protection in the Cloud (Part 2)

 Architecture  Comments Off on Architecture Matters: Protection in the Cloud (Part 2)
Jun 052017

(Part 1).

Particularly when we think of IaaS style workloads in the Cloud, there’s two key approaches that can be used for data protection.

The first is snapshots. Snapshots fulfil part of a data protection strategy, but we do always need to remember with snapshots that:

  • They’re an inefficient storage and retrieval model for long-term retention
  • Cloud or not, they’re still essentially on-platform

As we know, and something I cover in my book quite a bit – a real data protection strategy will be multi-layered. Snapshots undoubtedly can provide options around meeting fast RTOs and minimal RPOs, but traditional backup systems will deliver a sufficient recovery granularity for protection copies stretching back weeks, months or years.

Stepping back from data protection itself – public cloud is a very different operating model to traditional  in-datacentre infrastructure spending. The classic in-datacentre infrastructure procurement process is an up-front investment designed around 3- or 5-year depreciation schedules. For some businesses that may mean a literal up-front purchase to cover the entire time-frame (particularly so when infrastructure budget is only released for the initial deployment project), and for others with more fluid budget options, there’ll be an investment into infrastructure that can be expanded over the 3- or 5-year solution lifetime to meet systems growth.

Cloud – public Cloud – isn’t costed or sold that way. It’s a much smaller billing window and costing model; use a GB or RAM, pay for a GB of RAM. Use a GHz or CPU, pay for a GHz of CPU. Use a GB of storage, pay for a GB of storage. Public cloud costing models often remind me of Master of the House from Les Miserables, particularly this verse:

Charge ’em for the lice, extra for the mice
Two percent for looking in the mirror twice
Here a little slice, there a little cut
Three percent for sleeping with the window shut
When it comes to fixing prices
There are a lot of tricks I knows
How it all increases, all them bits and pieces
Jesus! It’s amazing how it grows!

Master of the House, Les Miserables.

That’s the Cloud operating model in a nutshell. Minimal (or no) up-front investment, but you pay for every scintilla of resource you use – every day or month.

If you say, deploy a $30,000 server into your datacentre, you then get to use that as much or as little as you want, without any further costs beyond power and cooling*. With Cloud, you won’t be paying that $30,000 initial fee, but you will pay for every MHz, KB of RAM and byte of storage consumed within every billing period.

If you want Cloud to be cost-effective, you have to be able to optimise – you have to effectively game the system, so to speak. Your in-Cloud services have to be maximally streamlined. We’ve become inured to resource wastage in the datacentre because resources have been cheap for a long time. RAM size/speed grows, CPU speed grows, as does the number of cores, and storage – well, storage seems to have an infinite expansion capability. Who cares if what you’re doing generates 5 TB of logs per day? Information is money, after all.

To me, this is just the next step in the somewhat lost art of programmatic optimisation. I grew up in the days of 8-bit computing**, and we knew back then that CPU, RAM and storage weren’t infinite. This didn’t end with 8-bit computing, though. When I started in IT as a Unix system administrator, swap file sizing, layout and performance was something that formed a critical aspect of your overall configuration, because if – Jupiter forbid – your system started swapping, you needed a fighting chance that the swapping wasn’t going to kill your performance. Swap file optimisation was, to use a Bianca Del Rio line, all about the goal: “Not today, Satan.”

That’s Cloud, now. But we’re not so much talking about swap files as we are resource consumption. Optimisation is critical. A failure to optimise means you’ll pay more. The only time you want to pay more is when what you’re paying for delivers a tangible, cost-recoverable benefit to the business. (I.e., it’s something you get to charge someone else for, either immediately, or later.)

Cloud Cost

If we think about backup, it’s about getting data from location A to location B. In order to optimise it, you want to do two distinct thinks:

  • Minimise the number of ‘hops’ that data has to make in order to get from A to B
  • Minimise the amount of data that you need to send from A to B.

If you don’t optimise that, you end up in a ‘classic’ backup architecture that we used to rely so much on in the 90s and early 00s, such as:

Cloud Architecture Matters 1

(In this case I’m looking just at backup services that land data into object storage. There are situations where you might want higher performance than what object offers, but let’s stick just with object storage for the time being.)

I don’t think this diagram is actually good at giving the full picture. There’s another way I like to draw the diagram, and it looks like this:

Cloud Architecture Matters 2

In the Cloud, you’re going to pay for the systems you’re running for business purposes no matter what. That’s a cost you have to accept, and the goal is to ensure that whatever services or products you’re on-selling to your customers using those services will pay for the running costs in the Cloud***.

You want to ensure you can protect data in the Cloud, but sticking to architectures designed at the time of on-premises infrastructure – and physical infrastructure at that – is significantly sub-optimal.

Think of how traditional media servers (or in NetWorker parlance, storage nodes) needed to work. A media server is designed to be a high performance system that funnels data coming from client to protection storage. If a backup architecture still heavily relies on media servers, then the cost in the Cloud is going to be higher than you need it – or want it – to be. That gets worse if a media server needs to be some sort of highly specced system encapsulating non-optimised deduplication. For instance, one of NetWorker’s competitors provides details on their website of hardware requirements for deduplication media servers, so I’ve taken these specifications directly from their website. To work with just 200 TB of storage allocated for deduplication, a media server for that product needs:

  • 16 CPU Cores
  • 128 GB of RAM
  • 400 GB SSD for OS and applications
  • 2 TB of SSD for deduplication databases
  • 2 TB of 800 IOPs+ disk (SSD recommended in some instances) for index cache

For every 200 TB. Think on that for a moment. If you’re deploying systems in the Cloud that generate a lot of data, you could very easily find yourself having to deploy multiple systems such as the above to protect those workloads, in addition to the backup server itself and the protection storage that underpins the deduplication system.

Or, on the other hand, you could work with an efficient architecture designed to minimise the number of data hops, and minimise the amount of data transferred:

CloudBoost Workflow

That’s NetWorker with CloudBoost. Unlike that competitor, a single CloudBoost appliance doesn’t just allow you to address 200TB of deduplication storage, but 6 PB of logical object storage. 6 PB, not 200 TB. All that using 4 – 8 CPUs and 16 – 32GB of RAM, and with a metadata sizing ratio of 1:2000 (i.e., every 100 GB of metadata storage allows you to address 200 TB of logical capacity). Yes, there’ll be SSD optimally for the metadata, but noticeably less than the competitor’s media server – and with a significantly greater addressable range.

NetWorker and CloudBoost can do that because the deduplication workflow has been optimised. In much the same way that NetWorker and Data Domain work together, within a CloudBoost environment, NetWorker clients will participate in the segmentation, deduplication, compression (and encryption!) of the data. That’s the first architectural advantage: rather than needing a big server to handle all the deduplication of the protection environment, a little bit of load is leveraged in each client being protected. The second architectural advantage is that the CloudBoost appliance does not pass the data through. Clients send their deduplicated, compressed and encrypted data directly to the object storage, minimising the data hops involved****.

To be sure, there are still going to be costs associated with running a NetWorker+CloudBoost configuration in public cloud – but that will be true of any data protection service. That’s the nature of public cloud – you use it, you pay for it. What you do get with NetWorker+CloudBoost though is one of the most streamlined and optimised public cloud backup options available. In an infrastructure model where you pay for every resource consumed, it’s imperative that the backup architecture be as resource-optimised as possible.

IaaS workloads will only continue to grow in public cloud. If your business uses NetWorker, you can take comfort in being able to still protect those workloads while they’re in public cloud, and doing it efficiently, optimised for maximum storage potential with minimised resource cost. Remember always: architecture matters, no matter where your infrastructure is.

Hey, if you found this useful, don’t forget to check out Data Protection: Ensuring Data Availability.


* Yes, I am aware there’ll be other costs beyond power and cooling when calculating a true system management price, but I’m not going to go into those for the purposes of this blog.

** Some readers of my blog may very well recall earlier computing models. But I started with a Vic-20, then the Commodore-64, and both taught me valuable lessons about what you can – and can’t – fit in memory.

*** Many a company has been burnt by failing to cost that simple factor, but in the style of Michael Ende, that is another story, for another time.

**** Linux 64-bit clients do this now. Windows 64-bit clients are supported in NetWorker 9.2, coming soon. (In the interim Windows clients work via a storage node.)

May 052017

There was a time, comparatively not that long ago, when the biggest governing factor in LAN capacity for a datacentre was not the primary production workloads, but the mechanics of getting a full backup from each host over to the backup media. If you’ve been around in the data protection industry long enough you’ll have had experience of that – for instance, the drive towards 1Gbit networks over Fast Ethernet started more often than not in datacentres I was involved in thanks to backup. Likewise, the first systems I saw being attached directly to 10Gbit backbones in datacentres were the backup infrastructure.

Well architected deduplication can eliminate that consideration. That’s not to say you won’t eventually need 10Gbit, 40Gbit or even more in your datacentre, but if deduplication is architected correctly, you won’t need to deploy that next level up of network performance to meet your backup requirements.

In this blog article I want to take you through an example of why deduplication architecture matters, and I’ll focus on something that amazingly still gets consideration from time to time: post-ingest deduplication.

Before I get started – obviously, Data Domain doesn’t use post-ingest deduplication. Its pre-ingest deduplication ensures the only data written to the appliance is already deduplicated, and it further increases efficiency by pushing deduplication segmentation and processing out to the individual clients (in a NetWorker/Avamar environment) to limit the amount of data flowing across the network.

A post-deduplication architecture though has your protection appliance feature two distinct tiers of storage – the landing or staging tier, and the deduplication tier. So that means when it’s time to do a backup, all your clients send all their data across the network to sit, in original sized format, on the staging tier:

Post Process Dedupe 01

In the example above we’ve already had backups run to the post-ingest deduplication appliance; so there’s a heap of deduplicated data sitting in the deduplication tier, but our staging tier has just landed all the backups from each of the clients in the environment. (If it were NetWorker writing to the appliance, each of those backups would be the full sized savesets.)

Now, at some point after the backup completes (usually a preconfigured time), post-processing kicks in. This is effectively a data-migration window in a post-ingest appliance where all the data in the staging tier has to be read and processed for deduplication. For example, using the example above, we might start with inspecting ‘Backup01’ for commonality to data on the deduplication tier:

Post Process Dedupe 02

So the post-ingest processing engine starts by reading through all the content of Backup01 and constructs fingerprint analysis of the data that has landed.

Post Process Dedupe 03

As fingerprints are assembled, data can be compared against the data already residing in the deduplication tier. This may result in signature matches or signature misses, indicating new data that needs to be copied into the deduplication tier.

Post Process Dedupe 04

In this it’s similar to regular deduplication – signature matches result in pointers for existing data being updated and extended, and a signature miss results in needing to store new data on the deduplication tier.

Post Process Dedupe 05

Once the first backup file written to the staging tier has been dealt with, we can delete that file from the staging area and move onto the second backup file to start the process all over again. And we keep doing that over and over and over on the staging tier until we’re left with an empty staging tier:

Post Process Dedupe 06

Of course, that’s not the end of the process – then the deduplication tier will have to run its regular housekeeping operations to remove data that’s no longer referenced by anything.

Architecturally, post-ingest deduplication is a kazoo to pre-ingest deduplication’s symphony orchestra. Sure, you might technically get to hear the 1812 Overture, but it’s not really going to be the same, right?

Let’s go through where architecturally, post-ingest deduplication fails you:

  1. The network becomes your bottleneck again. You have to send all your backup data to the appliance.
  2. The staging tier has to have at least as much capacity available as the size of your biggest backup, assuming it can execute its post-process deduplication within the window between when your previous backup finishes and your next backup starts.
  3. The deduplication process becomes entirely spindle bound. If you’re using spinning disk, that’s a nightmare. If you’re using SSD, that’s $$$.
  4. There’s no way of telling how much space will be occupied on the deduplication tier after deduplication processing completes. This can lead you into very messy situations where say, the staging tier can’t empty because the deduplication tier has filled. (Yes, capacity maintenance is a requirement still on pre-ingest deduplication systems, but it’s half the effort.)

What this means is simple: post-ingest deduplication architectures are asking you to pay for their architectural inefficiencies. That’s where:

  1. You have to pay to increase your network bandwidth to get a complete copy of your data from client to protection storage within your backup window.
  2. You have to pay for both the staging tier storage and the deduplication tier storage. (In fact, the staging tier is often a lot bigger than the size of your biggest backups in a 24-hour window so the deduplication can be handled in time.)
  3. You have to factor the additional housekeeping operations into blackout windows, outages, etc. Housekeeping almost invariably becomes a daily rather than a weekly task, too.

Compare all that to pre-ingest deduplication:

Pre-Ingest Deduplication

Using pre-ingest deduplication, especially Boost based deduplication, the segmentation and hashing happen directly where the data is, and rather than sending the entire data to be protected from the client to the Data Domain, we only send the unique data. Data that already resides on the Data Domain? All we’ll have sent is a tiny fingerprint so the Data Domain can confirm it’s already there (and update its pointers for existing data), then moved on. After your first backup, that potentially means that on a day to day basis your network requirements for backup are reduced by 95% or more.

That’s why architecture matters: you’re either doing it right, or you’re paying the price for someone else’s inefficiency.

If you want to see more about how a well architected backup environment looks – technology, people and processes, check out my book, Data Protection: Ensuring Data Availability.

Jan 242017

In 2013 I undertook the endeavour to revisit some of the topics from my first book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy”, and expand it based on the changes that had happened in the industry since the publication of the original in 2008.

A lot had happened since that time. At the point I was writing my first book, deduplication was an emerging trend, but tape was still entrenched in the datacentre. While backup to disk was an increasingly common scenario, it was (for the most part) mainly used as a staging activity (“disk to disk to tape”), and backup to disk use was either dumb filesystems or Virtual Tape Libraries (VTL).

The Cloud, seemingly ubiquitous now, was still emerging. Many (myself included) struggled to see how the Cloud was any different from outsourcing with a bit of someone else’s hardware thrown in. Now, core tenets of Cloud computing that made it so popular (e.g., agility and scaleability) have been well and truly adopted as essential tenets of the modern datacentre, as well. Indeed, for on-premises IT to compete against Cloud, on-premises IT has increasingly focused on delivering a private-Cloud or hybrid-Cloud experience to their businesses.

When I started as a Unix System Administrator in 1996, at least in Australia, SANs were relatively new. In fact, I remember around 1998 or 1999 having a couple of sales executives from this company called EMC come in to talk about their Symmetrix arrays. At the time the datacentre I worked in was mostly DAS with a little JBOD and just the start of very, very basic SANs.

When I was writing my first book the pinnacle of storage performance was the 15,000 RPM drive, and flash memory storage was something you (primarily) used in digital cameras only, with storage capacities measured in the hundreds of megabytes more than gigabytes (or now, terabytes).

When the first book was published, x86 virtualisation was well and truly growing into the datacentre, but traditional Unix platforms were still heavily used. Their decline and fall started when Oracle acquired Sun and killed low-cost Unix, with Linux and Windows gaining the ascendency – with virtualisation a significant driving force by adding an economy of scale that couldn’t be found in the old model. (Ironically, it had been found in an older model – the mainframe. Guess what folks, mainframe won.)

When the first book was published, we were still thinking of silo-like infrastructure within IT. Networking, compute, storage, security and data protection all as seperate functions – separately administered functions. But business, having spent a decade or two hammering into IT the need for governance and process, became hamstrung by IT governance and process and needed things done faster, cheaper, more efficiently. Cloud was one approach – hyperconvergence in particular was another: switch to a more commodity, unit-based approach, using software to virtualise and automate everything.

Where are we now?

Cloud. Virtualisation. Big Data. Converged and hyperconverged systems. Automation everywhere (guess what? Unix system administrators won, too). The need to drive costs down – IT is no longer allowed to be a sunk cost for the business, but has to deliver innovation and for many businesses, profit too. Flash systems are now offering significantly more IOPs than a traditional array could – Dell EMC for instance can now drop a 5RU system into your datacentre capable of delivering 10,000,000+ IOPs. To achieve ten million IOPs on a traditional spinning-disk array you’d need … I don’t even want to think about how many disks, rack units, racks and kilowatts of power you’d need.

The old model of backup and recovery can’t cut it in the modern environment.

The old model of backup and recovery is dead. Sort of. It’s dead as a standalone topic. When we plan or think about data protection any more, we don’t have the luxury of thinking of backup and recovery alone. We need holistic data protection strategies and a whole-of-infrastructure approach to achieving data continuity.

And that, my friends, is where Data Protection: Ensuring Data Availability is born from. It’s not just backup and recovery any more. It’s not just replication and snapshots, or continuous data protection. It’s all the technology married with business awareness, data lifecycle management and the recognition that Professor Moody in Harry Potter was right, too: “constant vigilance!”

Data Protection: Ensuring Data Availability

This isn’t a book about just backup and recovery because that’s just not enough any more. You need other data protection functions deployed holistically with a business focus and an eye on data management in order to truly have an effective data protection strategy for your business.

To give you an idea of the topics I’m covering in this book, here’s the chapter list:

  1. Introduction
  2. Contextualizing Data Protection
  3. Data Lifecycle
  4. Elements of a Protection System
  5. IT Governance and Data Protection
  6. Monitoring and Reporting
  7. Business Continuity
  8. Data Discovery
  9. Continuous Availability and Replication
  10. Snapshots
  11. Backup and Recovery
  12. The Cloud
  13. Deduplication
  14. Protecting Virtual Infrastructure
  15. Big Data
  16. Data Storage Protection
  17. Tape
  18. Converged Infrastructure
  19. Data Protection Service Catalogues
  20. Holistic Data Protection Strategies
  21. Data Recovery
  22. Choosing Protection Infrastructure
  23. The Impact of Flash on Data Protection
  24. In Closing

There’s a lot there – you’ll see the first eight chapters are not about technology, and for a good reason: you must have a grasp on the other bits before you can start considering everything else, otherwise you’re just doing point-solutions, and eventually just doing point-solutions will cost you more in time, money and risk than they give you in return.

I’m pleased to say that Data Protection: Ensuring Data Availability is released next month. You can find out more and order direct from the publisher, CRC Press, or order from Amazon, too. I hope you find it enjoyable.

Mar 092016

I’ve been working with backups for 20 years, and if there’s been one constant in 20 years I’d say that application owners (i.e., DBAs) have traditionally been reluctant to have other people (i.e., backup administrators) in control of the backup process for their databases. This leads to some environments where the DBAs maintain control of their backups, and others where the backup administrators maintain control of the database backups.


So the question that many people end up asking is: which way is the right way? The answer, in reality is a little fuzzy, or, it depends.

When we were primarily backing up to tape, there was a strong argument for backup administrators to be in control of the process. Tape drives were a rare commodity needing to be used by a plethora of systems in a backup environment, and with big demands placed on them. The sensible approach was to fold all database backups into a common backup scheduling system so resources could be apportioned efficiently and fairly.

DB Backups with Tape

Traditional backups to tape via a backup server

With limited tape resources and a variety of systems to protect, backup administrators needed to exert reasonably strong controls over what backed up when, and so in a number of organisations it was common to have database backups controlled within the backup product (e.g., NetWorker), with scheduling negotiated between the backup and database administrators. Where such processes have been established, they often continue – backups are, of course, a reasonably habitual process (and for good cause).

For some businesses though, DBAs might feel there was not enough control over the backup process – which might be agreed with based on the mission criticality of the applications running on top of the database, or because of the perceived licensing costs associated with using a plugin or module from the backup product to backup the database. So in these situations if a tape library or drives weren’t allocated directly to the database, the “dump and sweep” approach became quite common, viz.:

Dump and Sweep

Dump and Sweep

One of the most pervasive results of the “dump and sweep” methodology however is the amount of primary storage it uses. Due to it being much faster than tape, database administrators would often get significantly larger areas of storage – particularly as storage became cheaper – to conduct their dumps to. Instead of one or two days, it became increasingly common to have anywhere from 3-5 days of database dumps sitting on primary storage being swept up nightly by a filesystem backup agent.

Dump and sweep of course poses problems: in addition to needing large amounts of primary storage, the first backup for the database is on-platform – there’s no physical separation. That means the timing of getting the database backup completed before the filesystem sweep starts is critical. However, the timing for the dump is controlled by the DBA and dependent on the database load and the size of the database, whereas the timing of the filesystem backup is controlled by the backup administrator. This would see many environments spring up where over time the database grew to a size it wouldn’t get an off-platform backup for 24 hours – until the next filesystem backup happened. (E.g., a dump originally taking an hour to complete would be started at 19:00. The backup administrators would start the filesystem backup at 20:30, but over time the database backups would grow and wouldn’t complete until say, 21:00. Net result could be a partial or failed backup of the dump files the first night, with the second night being the first successful backup of the dump.)

Over time backup to disk entered popularity to overcome the overnight operational challenges of tape, then grew, and eventually the market has expanded to include deduplication storage, purpose built backup appliances and even when I’d normally consider to be integrated data protection appliances – ones where the intelligence (e.g., deduplication functionality) is extended out from the appliance to the individual systems being protected. That’s what we get, for instance, with Data Domain: the Boost functionality embedded in APIs on the client systems leveraging distributed segment processing to have everything being backed up participate in its own deduplication. The net result is one that scales better than the traditional 3-tier “client/server/{media server|storage node}” environment, because we’re scaling where it matters: out at the hosts being protected and up at protection storage, rather than adding a series of servers in the middle to manage bottlenecks. (I.e., we remove the bottlenecks.)

Even as large percentages of businesses switched to deduplicated storage – Data Domains mostly from a NetWorker perspective – and had the capability of leveraging distributed deduplication processes to speed up the backups, that legacy “dump and sweep” approach, if it had been in the business, often remained in the business.

We’re far enough into this now that I can revisit the two key schools of thought within data protection:

  • Backup administrators should schedule and control backups regardless of the application being backed up
  • Subject Matter Experts (SMEs) should have some control over their application backup process because they usually deeply understand how the business functions leveraging the application work

I’d suggest that the smaller the business, the more correct the first option is – or rather, when an environment is such that DBAs are contracted or outsourced in particular, having the backup administrator in charge of the backup process is probably more important to the business. But that creates a requirement for the backup administrator to know the ins and outs of backing up and recovering the application/database almost as deeply as a DBA themselves.

As businesses grow in size and as the number of mission critical systems sitting on top of databases/applications grow, there’s equally a strong opinion the second argument is correct: the SMEs need to be intimately involved in the backup and recovery process. Perhaps even more so, in a larger backup environment, you don’t want your backup administrators to actually be bottlenecks in a disaster situation (and they’d usually agree to this as well – it’s too stressful).

With centralised disk based protection storage – particularly deduplicating protection storage – we can actually get the best of both worlds now though. The backup administrators can be in control of the protection storage and set broad guidance on data protection at an architectural and policy level for much of the environment, but the DBAs can leverage that same protection storage and fold their backups into the overall requirements of their application. (This might be to even leverage third party job control systems to only trigger backups once batch jobs or data warehousing tasks have completed.)

Backup Process With Data Domain and Backup Server

Backup Process With Data Domain and Backup Server

That particular flow is great for businesses that have maintained centralised control over the backup process of databases and applications, but what about those where dump and sweep has been the design principle, and there’s a desire to keep a strong form of independence on the backup process, or where the overriding business goal is to absolutely limit the number of systems database administrators need to learn so they can focus on their job? They’re definitely legitimate approaches – particularly so in larger environments with more mission critical systems.

That’s why there’s the Data Domain Boost plugins for Applications and Databases – covering SAP, DB2, Oracle, SQL Server, etc. That gives a slightly different architecture, viz.:

DB Backups with Boost Plugin

DB Backups with Boost Plugin

In that model, the backup server (e.g., NetWorker) still controls and coordinates the majority of the backups in the environment, but the Boost Plugin for Databases/Applications is used on the database servers instead to allow complete integration between the DBA tools and the backup process.

So returning to the initial question – which way is right?

Well, that comes down to the real question: which way is right for your business? Pull any emotion or personal preferences out of the question and look at the real architectural requirements of the business, particularly relating to mission critical applications. Which way is the right way? Only your business can decide.

Here’s a thought I’ll leave you with though: there’s two critical components to being able to make the choice completely based on business requirements:

  • You need centralised protection storage where there aren’t the traditional (tape-inherited) limitations on concurrent device access
  • You need a data protection framework approach rather than a data protection monolith approach

The former allows you to make decisions without being impeded by arbitrary practical/physical limitations (e.g., “I can’t read from a tape and write to it at the same time”), and more importantly, the latter lets you build an adaptive data protection strategy using best of breed components at the different layers rather than squeezing everything into one box and making compromises at every step of the way. (NetWorker, as I’ve mentioned before, is a framework based backup product – but I’m talking more broadly here: framework based data protection environments.)

Happy choosing!

Data isn’t data isn’t data

 Architecture  Comments Off on Data isn’t data isn’t data
Jan 182016

An integral part of effective data protection is data awareness. You can’t adequately protect what you don’t know about, but similarly, you can’t adequately protect what you don’t understand, either. Understanding what sort of data you have is critical to understanding how you can protect it – and even more so from a business perspective, how much you may need to spend in order to protect it.

As the title says, Data isn’t Data isn’t Data.

I think this is most striking for me in organisations which have been running with data protection solutions that have organically developed over time (probably since the company was either quite small or operationally quite informal) and are now looking at making major, hopefully long-reaching changes to their data protection strategy.

The scenario works like this: the company asks for proposals on a holistic data protection strategy that tells prospective bidders all about where data is, and what operating systems data sits on, and usually even what the link speeds are between its sites, but doesn’t have more details about the type of data it is. By type, I mean:

  • What percentage of the data is traditional database;
  • What percentage is traditional file/operating system;
  • What percentage is NAS;
  • What percentage is virtual machine images;
  • What percentage of each must be sent or stored in an encrypted format,
  • and so on.

At one time, that information wasn’t necessarily all that relevant: if it were all being sent to tape the biggest headaches came from whether or not there were particularly dense file systems. (You can’t stream tape backups over WAN-speed links so you’d typically not care about the link speed so long as you could deploy sufficient tape infrastructure in each required location.) If data was already compressed or already encrypted before it was backed up, that might reduce the compression ratio achieved on individual tapes in the data protection environment, but what’s a few tapes here and there?*

As data protection gets more efficient and smarter though, this sort of information becomes as important to understanding what will be involved in protecting it as the more traditional questions.

Consider for instance a company that wants to protect 70TB of data using deduplication storage so as to minimise the protection footprint and gain the most efficiencies out of disk based backup strategies. The typical starting questions you’d need to answer for a backup and recovery environment might be say:

  • How long do you want to keep your daily/weekly backups for?
  • How long do you want to keep monthly fulls for?
  • Do you need long term retention for yearlies or other backups?

For the purposes of simplicity, let’s stick to just those first two questions and provide some basic answers to work with:

  • Daily incrementals and weekly fulls to be kept for 6 weeks
  • Monthly backups to be kept for 12 months

We’ll also assume all data is in one location. When suggesting a backup environment for size, the above would have been enough information to come up with an approximate configuration to meet the backup capacity requirements for the environment in the old world of tape. (Noted: it would certainly not be enough for determining speed requirements.)

But if you want to take advantage of deduplication, data isn’t data isn’t data. Knowing that you have 70TB of data doesn’t allow anyone to make any reliable recommendations about what sort of protection storage you might need if your intent is to drop tape and move to more efficient formats. OK, let’s start providing a few more details and see what happens.

Let’s say you’re told:

  • 70 TB of data
  • Weekly fulls retained for 6 weeks
  • Daily incrementals retained for 6 weeks
  • Monthly fulls retained for 12 months
  • 3.19% average daily change rate

If you’re just going to be backing up to tape, or plain disk, this has now given you enough information to have a stab at coming up with a potential capacity, which would start with:

Size ~= 70 TB x 6 (weekly fulls) + 70 TB x 12 (monthly fulls) + (70 TB x 3.19% x 36 incrementals)

Size ~= 1340.388 TB

But is that accurate? Well, no: we don’t have enough information to properly understand the environment. Is it possible, for instance, to work out how much deduplication storage you might need to provide protection for 1340.388TB of backups? What’s the ‘average’ deduplication ratio for any data, regardless of what it is? (Hint: there’s no such thing.)

Coming back to the original point of the article, data isn’t data isn’t data. So let’s start breaking this out a little more and see what happens. That 70TB of data becomes:

  • 10 TB Files
  • 5 TB Databases
  • 5 TB Mail
  • 50 TB VMware

Let’s also assume that because we now know the data types, we also know the per-type change rate rather than relying on an average change rate, and so we actually have:

  • 10 TB Files at 1.75% daily change
  • 5 TB Databases at 6% daily change
  • 5 TB Mail at 3% daily change
  • 50 TB VMware at 2% daily change (within the VMs, not the individual container files – which of course is normally 100% change)

A few things to note here:

  • I’m not suggesting the above change rates are real-world, I’ve just shoved them into a spreadsheet as examples.
  • I’m not factoring in the amount of the same content that changes each day vs unique content that changes each day**.

At this point if we’re still sizing for either tape or conventional disk, we can more accurately come up with the storage capacity required. Based on those figures, our actual required capacity comes down from 1340.388TB to 1318.50TB. Not a substantial difference, but a difference nonetheless. (The quality and accuracy of your calculation always depends on the quality and accuracy of your data, after all.)

If we assumed a flat deduplication rate we might have enough data now to come up with a sizing for deduplication storage, but in reality there’s a minimum of three deduplication ratios you want to consider, notably:

  • Deduplication achieved from first full backup
  • Deduplication achieved from subsequent full backups
  • Deduplication achieved for incremental backups

In reality, it’s more complex than that – again, returning to the rate of unique vs non-unique change within the environment. Coming back to data isn’t data isn’t data though, that’ll be different for each data type.

So let’s come up with some basic deduplication ratios – again, I’m just pulling numbers out of my head and these should in no way be seen as ‘average’. Let’s assume the following:

  • File backups have a first full dedupe of 4x, a subsequent full dedupe of 6x, and an incremental dedupe of 3x
  • Database backups have a first full dedupe of 2.5x, a subsequent full dedupe of 6x, and an incremental dedupe of 1.5x
  • Mail backups have a first full dedupe of 3x, a subsequent full dedupe of 4x, and an incremental dedupe rate of 2.5x
  • VMware backups have a first full dedupe of 6x, a subsequent full dedupe of 12x, and an incremental dedupe rate of 6x

If we plug those into a basic spreadsheet (given I still count on my fingers), we might see a sizing and capacity requirement of:

  • Files – 32.93 TB
  • Database – 37.53 TB
  • Mail – 23.08 TB
  • VMware – 85.17 TB
  • Total – 180.71 TB

It’s here that you need to be aware of any gotchas. What happens, for instance, if an environment has some sort of high security requirement for file storage, and all files on fileservers are encrypted before being written to disk? In that scenario, the backup product would be dealing with 10TB of storage that won’t deduplicate at all. That might result in no deduplication at all for each of the backup scenarios (first full, subsequent full and incrementals) for the file data: we’d have a 1:1 storage requirement for those backups. This would mean our file backup storage would require 186.3TB of backup capacity (vs 32.93 TB above), bringing the total storage with deduplication to 334.08 TB.

The example I’ve given is pretty simplistic, and in no way exhaustive, but it should start to elaborate on why the old way of specifying how much data you have just doesn’t cut it any more. Examples of where the above would need further clarification would include:

  • What is the breakdown between virtual machines hosting regular data and database data? (increasingly important as virtualisation loads increase)
  • For each dataset, would there be any data that’s already compressed, already encrypted, or some form of multimedia? (10TB of word documents will have a completely different storage profile to 10TB of MP4 files, for instance).

And then, of course, as we look at multi-site environments, it’s then important to understand:

  • What is the breakdown of data per site?
  • What is the link speed between each site?

This is all just for sizing alone. For performance obviously it’s then important to understand so much more – recovery time objectives, recovery point objectives, frequency of recoveries, backup window, and so on … but this brings us back to the title of the article:

Data isn’t data isn’t data.

So if you’re reaching that point where you are perhaps considering deduplication for the first time, remember to get your data classified by type and work with your local supplier or vendor (which I’m hoping will be EMC, of course) to understand what your likely deduplication ratios are.

* Actually, “a few tapes here and there” can add up spectacularly quickly, but that’s another matter.
** By this I mean the difference between a different 1.75% of files being edited each day on the fileserver, the same 1.75% of files being edited each day on the fileserver, or some mix thereof – this plays an important factor that I’m disregarding for simplicity.

Sep 232015

The LTO consortium has announced:

That the LTO Ultrium format generation 7 specifications are now available for licensing by storage mechanism and media manufacturers.

LTO-7 will feature tape capacities of up to 15TB (compressed) and streaming speeds of up to 750MB/s (compressed). LTO is now working off a 2.5:1 compression ratio – so those numbers are (uncompressed) 6TB and 300MB/s.

Don’t get me wrong – I’m not going to launch into a tape is dead article here. Clearly it’s not dead.

That rocket car is impressive. It hit 1,033KM/h – Mach 9.4* – over a 16KM track. There’s no denying it’s fast. There’s also no denying that you couldn’t just grab it and use it to commute to work. And if you could commute to work using it but there happened to be a small pebble on the track, what would happen?

I do look at LTO increasingly and find myself asking … how relevant is it for average businesses? It’s fast and it has high capacity – and this is increasing with the LTO-7 format. Like the rocket car above though, it’s impressive as long as you only want to go in one direction and you don’t hit any bumps.

Back when Tape Was King, a new tape format meant a general rush on storage refresh towards the new tape technology in order to get optimum speed and capacity for a hungry backup environment. And backup environments are still hungry for capacity and speed, but they’re also hungry for flexibility, something that’s not as well provided by tape. Except in very particular conditions, tape is no longer seen as the optimum first landing zone for backup data – and increasingly, it’s not being seen as the ideal secondary landing zone either. More and more businesses are designing backup strategies around minimising the amount of tape they use in their environment. It’s not in any way unusual now to see backup processes designed to keep at least all of the normal daily/weekly cycles on disk (particularly if it’s deduplication storage) and push only the long-term retention backups out to tape. (Tape is even being edged out there for many businesses, but I’ll leave that as a topic for another time.)

Much of the evolution we’ve seen in backup and recovery functionality has come from developing features around high speed random access of backups. Deduplication, highly granular recoveries, mounting from the backup system and even powering on virtual machines from backup storage all require one thing in common: disk. As we’ve come to expect that functionality in data protection products, the utility of tape for most organisations has likewise decreased significantly. Recoverability and even access-without-recovery has become a far more crucial consideration in a data protection environment than the amount of data you can fit onto a cartridge.

I have no doubt LTO-7 will win high-praise from many. But like the high speed rocket car video above, I don’t think it’s your “daily commute” data protection vehicle. It clearly has purpose, it clearly has backing, and it clearly has utility. As long as you need to go in a very straight line, don’t make any changes in direction and don’t attempt to change your speed too much.

As always, plan your data protection environment around the entire end-to-end data protection process, and the utility of that protected data.

* Oops, Mach 0.94. Thanks, Tony. (That’ll teach me to blindly copy the details from the original video description.)

Pool size and deduplication

 Architecture, Avamar, Data Domain, NetWorker  Comments Off on Pool size and deduplication
May 202015

When you start looking into deduplication, one of the things that becomes immediately apparent is … size matters. In particular, the size of your deduplication pool matters.

Deduplication Pool

In this respect, what I’m referring to is the analysis pool for comparison when performing deduplication. If we’re only talking target based deduplication, that’s simple – it’s the size of the bucket you’re writing your backup to. However, the problems with a purely target based deduplication approach to data protection are network congestion and time wasted – a full backup of a 1TB fileserver will still see 1TB of data transferred over the network to have most of its data dropped as being duplicate. That’s an awful lot of packets going to /dev/null, and an awful lot of bandwidth wasted.

For example, consider the following diagram being of a solution using target only deduplication (e.g., VTL only or no Boost API on the hosts):

Dedupe Target Only

In this diagram, consider the outline arrow heads to indicate where deduplication is being evaluated. Thus, if each server had 1TB of storage to be backed up, then each server would send 1TB of storage over to the Data Domain to be backed up, with deduplication performed only at the target end. That’s not how deduplication has to work now, but it’s a reminder of where we were only a relatively short period of time ago.

That’s why source based deduplication (e.g., NetWorker Client Direct with a DDBoost enabled connection, or Data Domain Boost for Enterprise Applications) brings so many efficiencies to a data protection system. While there’ll be a touch more processing performed on the individual clients, that’ll be significantly outweighed by the ofttimes massive reduction in data sent onto the network for ingestion into the deduplication appliance.

So that might look more like:

Source Dedupe

I.e., in this diagram with outline arrow heads indicating location of deduplication activities, we get an immediate change – each of those hosts will still have 1TB of backup to perform, but they’ll evaluate via hashing mechanisms whether or not that data actually needs to be sent to the target appliance.

There’s still efficiencies to be had even here though, which is where the original point about pool size becomes critical. To understand why, let’s look at the diagram a slightly different way:

Source Dedupe Global

In this case, we’ve still got source deduplication, but the merged lines represent something far more important … we’ve got global, source deduplication.

Or to put it a slightly different way:

  • Target deduplication:
    • Client: “Hey, here’s all my data. Check to see what you want to store.”
  • Source deduplication (limited):
    • Client: “Hey, I want to backup <data>. Tell me what I need to send based on what I’ve sent you before.”
  • Source deduplication (global):
    • Client: “Hey, I want to backup <data>. Tell me what I need to send based on anything you’ve ever received before.”

That limited deduplication component may not be limited on a per host basis. Some products might deduplicate on a per host basis, while others might deduplicate based on particular pool sizes – e.g., xTB. But even so, there’s a huge difference between deduplicating against a small comparison set and deduplicating against a large comparison set.

Where that global deduplication pool size comes into play is the commonality of data that exists between hosts within an environment. Consider for instance the minimum recommended size for a Windows 2012 installation – 32GB. Now, assume you might get a 5:1 deduplication ratio on a Windows 2012 server (I literally picked a number out of the air as an example, not a fact) … that’ll mean a target occupied data size of 6.4GB to hold 32GB of data.

But we rarely consider a single server in isolation. Let’s expand this out to encompass 100 x Windows 2012 servers, each at 32GB in size. It’s here we see the importance of a large pool of data for deduplication analysis:

  • If that deduplication analysis were being performed at the per-server level, then realistically we’d be getting 100 x 6.4GB of target data, or 640GB.
  • If the deduplication analysis were being performed against all data previously deduplicated, then we’d assume that same 5:1 deduplication ratio for the first server backup, and then much higher deduplication ratios for each subsequent server backup, as they evaluate against previously stored data. So that might mean 1 x 5:1 + 99 x 20:1 … 164.8GB instead of 640GB or even (if we want to compare against tape) 3,200GB.

Throughout this article I’ve been using the term pool, but I’m not referring to NetWorker media pools – everything written to a Data Domain as an example, regardless of what media pool it’s associated with in NetWorker will be globally deduplicated against everything else on the Data Domain. But this does make a strong case for right-sizing your appliance, and in particular planning for more data to be stored on it than you would for a conventional disk ‘staging’ or ‘landing’ area. The old model – backup to disk, transfer to tape – was premised on having a disk landing zone big enough to accommodate your biggest backup, so long as you could subsequently transfer that to tape before your next backup. (Or some variant thereof.) A common mistake when evaluating deduplication is to think along similar lines. You don’t want storage that’s just big enough to hold a single big backup – you want it big enough to hold many backups so you can actually see the strategic and operational benefit of deduplication.

The net lesson is a simple one: size matters. The size of the deduplication pool, and what deduplication activities are compared against will make a significantly noticeable impact to how much space is occupied by your data protection activities, how long it takes to perform those activities, and what the impact of those activities are on your LAN or WAN.

Feb 152015

I’m pleased to say I’ve completed and made available the NetWorker usage report for 2014. I’m particularly grateful to everyone who took the time to answer the 20 questions in the survey conducted between December 1, 2014 and January 31, 2015.

The report continues to track trends in NetWorker usage within organisations: deduplication adoption continues to grow, for instance, and Data Domain remains the overwhelmingly preferred method to enable that deduplication.

This was the first survey that asked a basic question I should have been asking for the last several years (!), how big is a full backup for your environment? The results on that question were particularly insightful as to just how large some environments get, and put to rest that FUD you see occasionally from other vendors that NetWorker doesn’t cut the mustard. (It not only cuts it, but it spreads it in an entirely appetising way…)

You can find on the main NetWorker Hub site, or access it directly here.


Not so squeezy

 Scripting, Tidbits  Comments Off on Not so squeezy
Nov 182014

It’s funny, the little tools you build up over the years as someone heavily involved in backup, particularly when it comes to testing.

I have two tools that help me with filesystem and performance testing – one I call generate-filesystem, and one called genbf (generate big file).

The genbf tool came about when I wanted files that were highly resistant to being compressed – and indeed, to subsequently being deduplicated as well. Sure, bigasm can produce good results, but it isn’t guaranteed to produce highly random data. That’s where genbf comes in. Best of all, it’s fast. For example, a 1GB file on my 12-core lab server gets created in under 10 seconds:

[pmdg@orilla test]$ date; -s 1024 -f test.dat; date
Tue Nov 18 19:08:24 AEDT 2014
     Pre-generating random data chunk. (This may take a while.)
     0% of random data chunk generated.
     10% of random data chunk generated.
     20% of random data chunk generated.
     30% of random data chunk generated.
     40% of random data chunk generated.
     50% of random data chunk generated.
     60% of random data chunk generated.
     70% of random data chunk generated.
     80% of random data chunk generated.
     90% of random data chunk generated.
 Creating 1024 MB file test.dat
Wrote data file in 5121 chunks.
Tue Nov 18 19:08:33 AEDT 2014

OK, OK, a 1GB file can be created quickly if you’re just pulling in from /dev/zero, but here’s the file size difference pre and post-compressed:

[pmdg@orilla test]$ ls -al test.dat 
-rw-rw-r-- 1 pmdg pmdg 1073741824 Nov 18 19:08 test.dat
[pmdg@orilla test]$ pbzip2 -r test.dat
[pmdg@orilla test]$ ls -al test.dat.bz2 
-rw-rw-r-- 1 pmdg pmdg 1065615793 Nov 18 19:08 test.dat.bz2

(If you haven’t heard of pbzip2, enlighten yourself and support the author. It’s brilliant.)

When it comes to subsequently sending the generated data to Data Domain, the deduplication is extremely low – 20 x 1GB files using the standard setting above, for instance, yields an almost straight additional 20GB occupied space.

If you want to try it out, you can download it from here. (You’ll need Perl on your system.) Standard usage is below:

genbf usage


Aug 042014

Client direct is a (relatively) new feature, introduced in the 8.x series, (8.0 to be exact) which allows for a client to communicate directly with the backup device rather than going through a storage node.

This obviously requires the client to have access to the backup device, and only happens when you’re backing up to Data Domain Boost or an Advanced File Type Device (AFTD). (Direct access to tape drives – physical or virtual – comes in library or dynamic drive sharing.)

The real power of this feature happens when you’re using Data Domain within your environment though. When Data Domain Boost compatibility was first introduced in NetWorker, it was on the basis of a very traditional storage node model, i.e.:

Traditional client data flow



As a starting point with Data Domain Boost, the NetWorker storage node software was updated to support deduplication offloading, referred to as “distributed segment processing”. That’s where the storage node communicated with the Data Domain and determined what client data it was receiving actually needed to be sent across to the  Data Domain. (This came in for NetWorker 7.6 SP1, where support for Data Domain Boost devices was added.)

It was, in effect, a pseudo source-based deduplication process … it wasn’t quite source-based, but it wasn’t technically target based, either. The storage node would act as a deduplication agent for the Data Domain. This had two advantages:

  • By distributing the deduplication processing, higher ingest rates could be achieved. (Though it should be noted that Data Domain is no slouch when it comes to ingest speed. But next to recoverability, speed is always important in a backup and recovery environment.)
  • Less data would be sent across the network; while the client data would sent to the storage node, the storage node would not have to send as much data across to the Data Domain. (Which, of course, also helps backup speeds.)

However, the Boost distributed segment processing functionality introduced into NetWorker Storage Nodes was just the beginning. When NetWorker 8.0 introduced the “Client Direct” feature, this was combined with pushing the Boost functionality for distributed segment processing out from the storage nodes to the clients. The net result was a substantial change:

Client Direct Data Flow


Consider the implications of this: each client performs the Data Domain equivalent of source-level deduplication. It interacts with the Data Domain to perform segment analysis/processing of its own data, substantially minimising the amount of data that needs to be sent across the network. Only the unique segments need to be sent. As we know from products like Avamar, this makes a significant reduction to network traffic, and as a result, to the time taken to perform a backup.

This is the real magic of Client Direct and Data Domain Boost within NetWorker – ‘flattening’ the backup requirements by achieving a far more efficient distributed processing than can be achieved by a traditional server + storage node + client layout.

In fact, there’s two key benefits:

  • Drastic reduction in the amount of data that has to be sent across the network for a backup.
  • Substantial reduction in the requirements for and specifications of storage nodes.

Consider: in a traditional NetWorker model, storage nodes are typically hefty physical boxes. For a larger environment, they need multiple multi-core processors, significant IP and/or FC connectivity, and may also need large amounts of RAM.

Client Direct + Boost kicks all of thsoe requirements out of the window.

To be certain, those requirements aren’t fully eliminated – processing is shifted to the clients, but at a considerable saving in network traffic, which by and large has usually been a big bottleneck in the backup process.

The biggest selling point of Data Domain of course is the deduplication achieved. If you compare raw Data Domain storage against raw storage in a conventional backup target, the numbers may not favour Data Domain on a TB-by-TB comparison. But as we know, a TB-by-TB comparison isn’t effective; after deduplication and compression, a Data Domain will store considerably more than 1TB within 1TB of space … in that sense deduplication exhibits TARDIS-like qualities.

Beyond the target deduplication levels (and therefore increased storage at a lowered physical footprint) there are the other benefits outlined above, thanks to Client Direct: faster backups, fewer storage nodes, and less processing specification required for those storage nodes. With the vast majority of clients able to communicate directly with the Data Domain, storage nodes step down to being required only for niche activities – acting as a central point for firewall reasons, or supporting clients/modules that can’t work with Boost+Client Direct. (Over time, the second reason will continue to diminish.)

And that’s what you really need to understand about Client Direct.

%d bloggers like this: