Would you buy a dangerbase?

 Backup theory, Policies  Comments Off on Would you buy a dangerbase?
Jun 072017

Databases. They’re expensive, aren’t they?

What if I sold you a Dangerbase instead?

What’s a dangerbase!? I’m glad you asked. A dangerbase is functionally almost exactly the same as a database, except it may be a little bit more lax when it comes to controls. Referential integrity might slip. Occasionally an insert might accidentally trigger a background delete. Nothing major though. It’s twenty percent less of the cost with only four times the risk of one of those pesky ‘databases’! (Oh, you might need 15% more infrastructure to run it on, but you don’t have to worry about that until implementation.)

Dangerbases. They’re the next big thing. They have a marketshare that’s doubling every two years! Two years! (Admittedly that means they’re just at 0.54% marketshare at the moment, but that’s double what it was last year!)

A dangerbase is a stupid idea. Who’d trust storing their mission critical data in a dangerbase? The idea is preposterous.

Sadly, dangerbases get considered all too often in the world of data protection.

Destroyed Bridge

What’s a dangerbase in the world of data protection? Here’s just some examples:

  • Relying solely on an on-platform protection mechanism. Accidents happen. Malicious activities happen. You need to always ensure you’ve got a copy of your data outside of the original production platform it is created and maintained on, regardless of what protection you’ve got in place there. And you should at least have one instance of each copy in a different physical location to the original.
  • Not duplicating your backups. Whether you call it a clone or a copy or a duplication doesn’t matter to me here – it’s the effect we’re looking for, not individual product nomenclature. If your backup isn’t copied, it means your backup represents a single point of failure in the recovery process.
  • Using post-process deduplication. (That’s something I covered in detail recently.)
  • Relying solely on RAID when you’re doing deduplication. Data Invulnerability Architecture (DIA) isn’t just a buzzterm, it’s essential in a deduplication environment.
  • Turning your databases into dangerbases by doing “dump and sweep”. Plugins have existed for decades. Dump and sweep is an expensive waste of primary storage space and introduces a variety of risk into your data protection environment.
  • Not having a data lifecycle policy! Without it, you don’t have control over capacity growth within your environment. Without that, you’re escalating your primary storage costs unnecessarily, and placing strain on your data protection environment – strain that can easily break it.
  • Not having a data protection advocate, or data protection architect, within your organisation. If data is the lifeblood of a company’s operations, and information is money, then failing to have a data protection architect/advocate within the organisation is like not bothering with having finance people.
  • Not having a disaster recovery policy that integrates into a business continuity policy. DR is just one aspect of business continuity, but if it doesn’t actually slot into the business continuity process smoothly, it’s as likely going to hinder than help the company.
  • Not understanding system dependencies. I’ve been talking about system dependency maps or tables for years. Regardless of what structure you use, the net effect is the same: the only way you can properly protect your business services is to know what IT systems they rely on, and what IT systems those IT systems rely on, and so on, until you’re at the root level.

That’s just a few things, but hopefully you understand where I’m coming from.

I’ve been living and breathing data protection for more than twenty years. It’s not just a job, it’s genuinely something I’m passionate about. It’s something everyone in IT needs to be passionate about, because it can literally make the difference between your company surviving or failing in a disaster situation.

In my book, I cover all sorts of considerations and details from a technical side of the equation, but the technology in any data protection solution is just one aspect of a very multi-faceted approach to ensuring data availability. If you want to take data protection within your business up to the next level – if you want to avoid having the data protection equivalent of a dangerbase in your business – check my book out. (And in the book there’s a lot more detail about integrating into IT governance and business continuity, a thorough coverage of how to work out system dependencies, and all sorts of details around data protection advocates and the groups that they should work with.)

May 052017

There was a time, comparatively not that long ago, when the biggest governing factor in LAN capacity for a datacentre was not the primary production workloads, but the mechanics of getting a full backup from each host over to the backup media. If you’ve been around in the data protection industry long enough you’ll have had experience of that – for instance, the drive towards 1Gbit networks over Fast Ethernet started more often than not in datacentres I was involved in thanks to backup. Likewise, the first systems I saw being attached directly to 10Gbit backbones in datacentres were the backup infrastructure.

Well architected deduplication can eliminate that consideration. That’s not to say you won’t eventually need 10Gbit, 40Gbit or even more in your datacentre, but if deduplication is architected correctly, you won’t need to deploy that next level up of network performance to meet your backup requirements.

In this blog article I want to take you through an example of why deduplication architecture matters, and I’ll focus on something that amazingly still gets consideration from time to time: post-ingest deduplication.

Before I get started – obviously, Data Domain doesn’t use post-ingest deduplication. Its pre-ingest deduplication ensures the only data written to the appliance is already deduplicated, and it further increases efficiency by pushing deduplication segmentation and processing out to the individual clients (in a NetWorker/Avamar environment) to limit the amount of data flowing across the network.

A post-deduplication architecture though has your protection appliance feature two distinct tiers of storage – the landing or staging tier, and the deduplication tier. So that means when it’s time to do a backup, all your clients send all their data across the network to sit, in original sized format, on the staging tier:

Post Process Dedupe 01

In the example above we’ve already had backups run to the post-ingest deduplication appliance; so there’s a heap of deduplicated data sitting in the deduplication tier, but our staging tier has just landed all the backups from each of the clients in the environment. (If it were NetWorker writing to the appliance, each of those backups would be the full sized savesets.)

Now, at some point after the backup completes (usually a preconfigured time), post-processing kicks in. This is effectively a data-migration window in a post-ingest appliance where all the data in the staging tier has to be read and processed for deduplication. For example, using the example above, we might start with inspecting ‘Backup01’ for commonality to data on the deduplication tier:

Post Process Dedupe 02

So the post-ingest processing engine starts by reading through all the content of Backup01 and constructs fingerprint analysis of the data that has landed.

Post Process Dedupe 03

As fingerprints are assembled, data can be compared against the data already residing in the deduplication tier. This may result in signature matches or signature misses, indicating new data that needs to be copied into the deduplication tier.

Post Process Dedupe 04

In this it’s similar to regular deduplication – signature matches result in pointers for existing data being updated and extended, and a signature miss results in needing to store new data on the deduplication tier.

Post Process Dedupe 05

Once the first backup file written to the staging tier has been dealt with, we can delete that file from the staging area and move onto the second backup file to start the process all over again. And we keep doing that over and over and over on the staging tier until we’re left with an empty staging tier:

Post Process Dedupe 06

Of course, that’s not the end of the process – then the deduplication tier will have to run its regular housekeeping operations to remove data that’s no longer referenced by anything.

Architecturally, post-ingest deduplication is a kazoo to pre-ingest deduplication’s symphony orchestra. Sure, you might technically get to hear the 1812 Overture, but it’s not really going to be the same, right?

Let’s go through where architecturally, post-ingest deduplication fails you:

  1. The network becomes your bottleneck again. You have to send all your backup data to the appliance.
  2. The staging tier has to have at least as much capacity available as the size of your biggest backup, assuming it can execute its post-process deduplication within the window between when your previous backup finishes and your next backup starts.
  3. The deduplication process becomes entirely spindle bound. If you’re using spinning disk, that’s a nightmare. If you’re using SSD, that’s $$$.
  4. There’s no way of telling how much space will be occupied on the deduplication tier after deduplication processing completes. This can lead you into very messy situations where say, the staging tier can’t empty because the deduplication tier has filled. (Yes, capacity maintenance is a requirement still on pre-ingest deduplication systems, but it’s half the effort.)

What this means is simple: post-ingest deduplication architectures are asking you to pay for their architectural inefficiencies. That’s where:

  1. You have to pay to increase your network bandwidth to get a complete copy of your data from client to protection storage within your backup window.
  2. You have to pay for both the staging tier storage and the deduplication tier storage. (In fact, the staging tier is often a lot bigger than the size of your biggest backups in a 24-hour window so the deduplication can be handled in time.)
  3. You have to factor the additional housekeeping operations into blackout windows, outages, etc. Housekeeping almost invariably becomes a daily rather than a weekly task, too.

Compare all that to pre-ingest deduplication:

Pre-Ingest Deduplication

Using pre-ingest deduplication, especially Boost based deduplication, the segmentation and hashing happen directly where the data is, and rather than sending the entire data to be protected from the client to the Data Domain, we only send the unique data. Data that already resides on the Data Domain? All we’ll have sent is a tiny fingerprint so the Data Domain can confirm it’s already there (and update its pointers for existing data), then moved on. After your first backup, that potentially means that on a day to day basis your network requirements for backup are reduced by 95% or more.

That’s why architecture matters: you’re either doing it right, or you’re paying the price for someone else’s inefficiency.

If you want to see more about how a well architected backup environment looks – technology, people and processes, check out my book, Data Protection: Ensuring Data Availability.

Mar 132017

The NetWorker usage report for 2016 is now complete and available here. Per previous years surveys, the survey ran from December 1, 2016 through to January 1, 2017.


There were some interesting statistics and trends arising from this survey. The percentages of businesses not using backup to disk in at least some form within their environment fell to just 1% of respondents. That’s 99% of respondents having some form of backup to disk within their environment!

More and more respondents are cloning within their environments – if you’re not cloning in yours, you’re falling behind the curve now in terms of ensuring your backup environment can’t be a single point of failure.

There’s plenty of other results and details in the survey report you may be interested in, including:

  • Changes to the number of respondents using dedicated backup administrators
  • Cloud adoption rates
  • Ransomware attacks
  • The likelihood of businesses using or planning to use object storage as part of their backup environment
  • and many more

You can download the survey from the link above.

Just a reminder: “Data Protection: Ensuring Data Availability” is out now, and you can buy it in both paperback and electronic format from Amazon, or in paperback from the publisher, CRC Press. If you’ve enjoyed or found my blog useful, I’m sure you’ll find value in my latest book, too!

One respondent from this year’s survey will be receiving a signed copy of the book directly from me, too! That winner has been contacted.

Jan 242017

In 2013 I undertook the endeavour to revisit some of the topics from my first book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy”, and expand it based on the changes that had happened in the industry since the publication of the original in 2008.

A lot had happened since that time. At the point I was writing my first book, deduplication was an emerging trend, but tape was still entrenched in the datacentre. While backup to disk was an increasingly common scenario, it was (for the most part) mainly used as a staging activity (“disk to disk to tape”), and backup to disk use was either dumb filesystems or Virtual Tape Libraries (VTL).

The Cloud, seemingly ubiquitous now, was still emerging. Many (myself included) struggled to see how the Cloud was any different from outsourcing with a bit of someone else’s hardware thrown in. Now, core tenets of Cloud computing that made it so popular (e.g., agility and scaleability) have been well and truly adopted as essential tenets of the modern datacentre, as well. Indeed, for on-premises IT to compete against Cloud, on-premises IT has increasingly focused on delivering a private-Cloud or hybrid-Cloud experience to their businesses.

When I started as a Unix System Administrator in 1996, at least in Australia, SANs were relatively new. In fact, I remember around 1998 or 1999 having a couple of sales executives from this company called EMC come in to talk about their Symmetrix arrays. At the time the datacentre I worked in was mostly DAS with a little JBOD and just the start of very, very basic SANs.

When I was writing my first book the pinnacle of storage performance was the 15,000 RPM drive, and flash memory storage was something you (primarily) used in digital cameras only, with storage capacities measured in the hundreds of megabytes more than gigabytes (or now, terabytes).

When the first book was published, x86 virtualisation was well and truly growing into the datacentre, but traditional Unix platforms were still heavily used. Their decline and fall started when Oracle acquired Sun and killed low-cost Unix, with Linux and Windows gaining the ascendency – with virtualisation a significant driving force by adding an economy of scale that couldn’t be found in the old model. (Ironically, it had been found in an older model – the mainframe. Guess what folks, mainframe won.)

When the first book was published, we were still thinking of silo-like infrastructure within IT. Networking, compute, storage, security and data protection all as seperate functions – separately administered functions. But business, having spent a decade or two hammering into IT the need for governance and process, became hamstrung by IT governance and process and needed things done faster, cheaper, more efficiently. Cloud was one approach – hyperconvergence in particular was another: switch to a more commodity, unit-based approach, using software to virtualise and automate everything.

Where are we now?

Cloud. Virtualisation. Big Data. Converged and hyperconverged systems. Automation everywhere (guess what? Unix system administrators won, too). The need to drive costs down – IT is no longer allowed to be a sunk cost for the business, but has to deliver innovation and for many businesses, profit too. Flash systems are now offering significantly more IOPs than a traditional array could – Dell EMC for instance can now drop a 5RU system into your datacentre capable of delivering 10,000,000+ IOPs. To achieve ten million IOPs on a traditional spinning-disk array you’d need … I don’t even want to think about how many disks, rack units, racks and kilowatts of power you’d need.

The old model of backup and recovery can’t cut it in the modern environment.

The old model of backup and recovery is dead. Sort of. It’s dead as a standalone topic. When we plan or think about data protection any more, we don’t have the luxury of thinking of backup and recovery alone. We need holistic data protection strategies and a whole-of-infrastructure approach to achieving data continuity.

And that, my friends, is where Data Protection: Ensuring Data Availability is born from. It’s not just backup and recovery any more. It’s not just replication and snapshots, or continuous data protection. It’s all the technology married with business awareness, data lifecycle management and the recognition that Professor Moody in Harry Potter was right, too: “constant vigilance!”

Data Protection: Ensuring Data Availability

This isn’t a book about just backup and recovery because that’s just not enough any more. You need other data protection functions deployed holistically with a business focus and an eye on data management in order to truly have an effective data protection strategy for your business.

To give you an idea of the topics I’m covering in this book, here’s the chapter list:

  1. Introduction
  2. Contextualizing Data Protection
  3. Data Lifecycle
  4. Elements of a Protection System
  5. IT Governance and Data Protection
  6. Monitoring and Reporting
  7. Business Continuity
  8. Data Discovery
  9. Continuous Availability and Replication
  10. Snapshots
  11. Backup and Recovery
  12. The Cloud
  13. Deduplication
  14. Protecting Virtual Infrastructure
  15. Big Data
  16. Data Storage Protection
  17. Tape
  18. Converged Infrastructure
  19. Data Protection Service Catalogues
  20. Holistic Data Protection Strategies
  21. Data Recovery
  22. Choosing Protection Infrastructure
  23. The Impact of Flash on Data Protection
  24. In Closing

There’s a lot there – you’ll see the first eight chapters are not about technology, and for a good reason: you must have a grasp on the other bits before you can start considering everything else, otherwise you’re just doing point-solutions, and eventually just doing point-solutions will cost you more in time, money and risk than they give you in return.

I’m pleased to say that Data Protection: Ensuring Data Availability is released next month. You can find out more and order direct from the publisher, CRC Press, or order from Amazon, too. I hope you find it enjoyable.

Data isn’t data isn’t data

 Architecture  Comments Off on Data isn’t data isn’t data
Jan 182016

An integral part of effective data protection is data awareness. You can’t adequately protect what you don’t know about, but similarly, you can’t adequately protect what you don’t understand, either. Understanding what sort of data you have is critical to understanding how you can protect it – and even more so from a business perspective, how much you may need to spend in order to protect it.

As the title says, Data isn’t Data isn’t Data.

I think this is most striking for me in organisations which have been running with data protection solutions that have organically developed over time (probably since the company was either quite small or operationally quite informal) and are now looking at making major, hopefully long-reaching changes to their data protection strategy.

The scenario works like this: the company asks for proposals on a holistic data protection strategy that tells prospective bidders all about where data is, and what operating systems data sits on, and usually even what the link speeds are between its sites, but doesn’t have more details about the type of data it is. By type, I mean:

  • What percentage of the data is traditional database;
  • What percentage is traditional file/operating system;
  • What percentage is NAS;
  • What percentage is virtual machine images;
  • What percentage of each must be sent or stored in an encrypted format,
  • and so on.

At one time, that information wasn’t necessarily all that relevant: if it were all being sent to tape the biggest headaches came from whether or not there were particularly dense file systems. (You can’t stream tape backups over WAN-speed links so you’d typically not care about the link speed so long as you could deploy sufficient tape infrastructure in each required location.) If data was already compressed or already encrypted before it was backed up, that might reduce the compression ratio achieved on individual tapes in the data protection environment, but what’s a few tapes here and there?*

As data protection gets more efficient and smarter though, this sort of information becomes as important to understanding what will be involved in protecting it as the more traditional questions.

Consider for instance a company that wants to protect 70TB of data using deduplication storage so as to minimise the protection footprint and gain the most efficiencies out of disk based backup strategies. The typical starting questions you’d need to answer for a backup and recovery environment might be say:

  • How long do you want to keep your daily/weekly backups for?
  • How long do you want to keep monthly fulls for?
  • Do you need long term retention for yearlies or other backups?

For the purposes of simplicity, let’s stick to just those first two questions and provide some basic answers to work with:

  • Daily incrementals and weekly fulls to be kept for 6 weeks
  • Monthly backups to be kept for 12 months

We’ll also assume all data is in one location. When suggesting a backup environment for size, the above would have been enough information to come up with an approximate configuration to meet the backup capacity requirements for the environment in the old world of tape. (Noted: it would certainly not be enough for determining speed requirements.)

But if you want to take advantage of deduplication, data isn’t data isn’t data. Knowing that you have 70TB of data doesn’t allow anyone to make any reliable recommendations about what sort of protection storage you might need if your intent is to drop tape and move to more efficient formats. OK, let’s start providing a few more details and see what happens.

Let’s say you’re told:

  • 70 TB of data
  • Weekly fulls retained for 6 weeks
  • Daily incrementals retained for 6 weeks
  • Monthly fulls retained for 12 months
  • 3.19% average daily change rate

If you’re just going to be backing up to tape, or plain disk, this has now given you enough information to have a stab at coming up with a potential capacity, which would start with:

Size ~= 70 TB x 6 (weekly fulls) + 70 TB x 12 (monthly fulls) + (70 TB x 3.19% x 36 incrementals)

Size ~= 1340.388 TB

But is that accurate? Well, no: we don’t have enough information to properly understand the environment. Is it possible, for instance, to work out how much deduplication storage you might need to provide protection for 1340.388TB of backups? What’s the ‘average’ deduplication ratio for any data, regardless of what it is? (Hint: there’s no such thing.)

Coming back to the original point of the article, data isn’t data isn’t data. So let’s start breaking this out a little more and see what happens. That 70TB of data becomes:

  • 10 TB Files
  • 5 TB Databases
  • 5 TB Mail
  • 50 TB VMware

Let’s also assume that because we now know the data types, we also know the per-type change rate rather than relying on an average change rate, and so we actually have:

  • 10 TB Files at 1.75% daily change
  • 5 TB Databases at 6% daily change
  • 5 TB Mail at 3% daily change
  • 50 TB VMware at 2% daily change (within the VMs, not the individual container files – which of course is normally 100% change)

A few things to note here:

  • I’m not suggesting the above change rates are real-world, I’ve just shoved them into a spreadsheet as examples.
  • I’m not factoring in the amount of the same content that changes each day vs unique content that changes each day**.

At this point if we’re still sizing for either tape or conventional disk, we can more accurately come up with the storage capacity required. Based on those figures, our actual required capacity comes down from 1340.388TB to 1318.50TB. Not a substantial difference, but a difference nonetheless. (The quality and accuracy of your calculation always depends on the quality and accuracy of your data, after all.)

If we assumed a flat deduplication rate we might have enough data now to come up with a sizing for deduplication storage, but in reality there’s a minimum of three deduplication ratios you want to consider, notably:

  • Deduplication achieved from first full backup
  • Deduplication achieved from subsequent full backups
  • Deduplication achieved for incremental backups

In reality, it’s more complex than that – again, returning to the rate of unique vs non-unique change within the environment. Coming back to data isn’t data isn’t data though, that’ll be different for each data type.

So let’s come up with some basic deduplication ratios – again, I’m just pulling numbers out of my head and these should in no way be seen as ‘average’. Let’s assume the following:

  • File backups have a first full dedupe of 4x, a subsequent full dedupe of 6x, and an incremental dedupe of 3x
  • Database backups have a first full dedupe of 2.5x, a subsequent full dedupe of 6x, and an incremental dedupe of 1.5x
  • Mail backups have a first full dedupe of 3x, a subsequent full dedupe of 4x, and an incremental dedupe rate of 2.5x
  • VMware backups have a first full dedupe of 6x, a subsequent full dedupe of 12x, and an incremental dedupe rate of 6x

If we plug those into a basic spreadsheet (given I still count on my fingers), we might see a sizing and capacity requirement of:

  • Files – 32.93 TB
  • Database – 37.53 TB
  • Mail – 23.08 TB
  • VMware – 85.17 TB
  • Total – 180.71 TB

It’s here that you need to be aware of any gotchas. What happens, for instance, if an environment has some sort of high security requirement for file storage, and all files on fileservers are encrypted before being written to disk? In that scenario, the backup product would be dealing with 10TB of storage that won’t deduplicate at all. That might result in no deduplication at all for each of the backup scenarios (first full, subsequent full and incrementals) for the file data: we’d have a 1:1 storage requirement for those backups. This would mean our file backup storage would require 186.3TB of backup capacity (vs 32.93 TB above), bringing the total storage with deduplication to 334.08 TB.

The example I’ve given is pretty simplistic, and in no way exhaustive, but it should start to elaborate on why the old way of specifying how much data you have just doesn’t cut it any more. Examples of where the above would need further clarification would include:

  • What is the breakdown between virtual machines hosting regular data and database data? (increasingly important as virtualisation loads increase)
  • For each dataset, would there be any data that’s already compressed, already encrypted, or some form of multimedia? (10TB of word documents will have a completely different storage profile to 10TB of MP4 files, for instance).

And then, of course, as we look at multi-site environments, it’s then important to understand:

  • What is the breakdown of data per site?
  • What is the link speed between each site?

This is all just for sizing alone. For performance obviously it’s then important to understand so much more – recovery time objectives, recovery point objectives, frequency of recoveries, backup window, and so on … but this brings us back to the title of the article:

Data isn’t data isn’t data.

So if you’re reaching that point where you are perhaps considering deduplication for the first time, remember to get your data classified by type and work with your local supplier or vendor (which I’m hoping will be EMC, of course) to understand what your likely deduplication ratios are.

* Actually, “a few tapes here and there” can add up spectacularly quickly, but that’s another matter.
** By this I mean the difference between a different 1.75% of files being edited each day on the fileserver, the same 1.75% of files being edited each day on the fileserver, or some mix thereof – this plays an important factor that I’m disregarding for simplicity.

Of accidental architectures

 Architecture, Backup theory  Comments Off on Of accidental architectures
Jul 202013

Accidental architectures


EMC’s recent big backup announcements included a variety of core product suite enhancements in the BRS space – Data Domain got a substantial refresh, Avamar jumped up to v7, and NetWorker to 8.1. For those of us who work in the BRS space, it was like christmas in July*.

Anyone who has read my NetWorker 8.1 overview knows how much I’m going to enjoy working that release. I’m also certainly looking forward to getting my hands on the new Data Domains, and it’ll be interesting to deep dive into the new features of Avamar 7, but one of the discussion points from EMC caught my attention more than the technology.

Accidental architecture.

Accidental architecture describes incredibly succinctly and completely so many of the mistakes made in enterprise IT, particularly around backup and recovery, archive and storage. It also perfectly encapsulates the net result of siloed groups and teams working independently and at times even at odds from one another, rather than synergistically meeting business requirements.

That sort of siloed development is a macrocosm of course of what I talk about in my book in section – the difference between knowledge-based and person-based groups, viz.:

[T]he best [group] is one where everyone knows at least a little bit about all the systems, and all the work that everyone else does. This is a knowledge-sharing group. Another type … is where everyone does their own thing. Knowledge sharing is at a minimum level and a question from a user about a particular system gets the response, “System X? See Z about that.” This is a person-centric group.

Everyone has seen a person-centric group. They’re rarely the fault of the people in the groups – they speak to a management or organisational failure. Yet, they’re disorganised and dangerous. They promote task isolation and stifle the development of innovative solutions to problems.

Accidental architecture comes when the groups within a business become similarly independent of one another. This happens at two levels – the individual teams within the IT arm, and it can happen at the business group level, too.

EMC’s approach is to work around business dysfunction and provide a seamless BRS experience regardless of who is partaking in the activity. The Data Domain plug-in for RMAN/Boost is a perfect example of this: it’s designed to allow database administrators to take control of their backup processes, writing Oracle backups with a Data Domain as target, completely bypassing whatever backup software is in the field.

Equally, VMware vCenter plugins to allow provisioning of backup and recovery activities from within vSphere is about trying to work around the silos.

It’s an admirable goal, and I think for a lot of businesses it’s going to be the solution they’re looking for.

I also think it’s a goal that shouldn’t need to exist. EMC’s products help to mitigate the problem, but a permanent solution needs to also come from within business change.

Crossing the ravine

As I mentioned in Rage against the Ravine, a lot of the silo issues that exist within an organisation – effectively, the accidental architectures – result from the storage, virtualisation and backup/data protection teams working too independently. These three critical back-of-house functions are so interdependent of one another that there is rarely any good reason to keep them entirely independent. In small to medium enterprises, they should be one team. In the largest of enterprises there may be a need for independent teams, but they should rotate staff between each other for maximised knowledge sharing, and they should be required to fully collaborate with one another.

In itself, that speaks again for the need of a stronger corporate approach to data protection, which requires the appointment of Data Protection Advisors and, of course, the formation an Information Protection Advisory Council.

As I’ve pointed out on more than one occasion, technology is rarely the only solution:

Rest of the iceberg

Technology is the tip of the iceberg in an accidental architecture environment, and deploying new technology doesn’t technically solve the problem, it merely masks it.

EMC’s goal of course is admirable – empower each team to achieve their own backup and recovery requirements, and I’ll fully admit there’ll always be situations where it’s necessary, so it was a direction they had to take. That’s not to say they’re looking in the wrong direction – EMC isn’t a management consulting company, after all. A business following the EMC approach however does get a critical advantage though: breathing space. When accidental architectures have lead to a bunch of siloed deployments and groups within an organisation, those groups end up spending most of their time fighting fires rather than proactively planning in a way that suits the entire organisation. Slot the EMC product suite in and those teams can start pulling back from firefighting. They can start communicating, planning and collaborating more effectively.

If you’ve got an accidental architecture for data protection, your first stop is EMC BRS’s enablement of per-technology/team solutions. Then, once you’ve had time to regroup, your next stop is to develop a cohesive and holistic approach at the personnel, process and business function layer.

At that point … boy, will your business fly.

* The term “christmas in July”, if you’re not aware of it, is fairly popular in Australia in some areas. It’s about having a mock christmas party during our coldest part of the year, mimicking in some small way the sorts of christmas those in the Northern Hemisphere get every year.

Apr 212012

What’s the ravine?

When we talk about data flow rates into a backup environment, it’s easy to focus on the peak speeds – the maximum write performance you can get to a backup device, for instance.

However, sometimes that peak flow rate is almost irrelevant to the overall backup performance.

Backup ravine

Many hosts will exist within an environment where only a relatively modest percentage of their data can be backed up at peak speed; the vast majority of their data will instead be backed up at suboptimal speeds. For instance, consider the following nsrwatch output:

High Performance

That’s a write speed averaging 200MB/s per tape drive (peaks were actually 265MB/s in the above tests), writing around 1.5-1.6GB/s.

However, unless all your data is highly optimised structured data running on high performance hardware with high performance networking, your real-world experiences will vary considerably on a minute to minute basis. As soon as filesystem overheads become a significant factor in the backup activity (i.e., you hit fileservers, regular OS and application parts of the operating system, etc.), your backup performance is generally going to drop by a substantial margin.

This is easy enough to test in real-world scenarios; take a chunk of a filesystem (at least 2x the memory footprint of the host in question), and compare the time to backup:

  • The actual files;
  • A tar of the files.

You’ll see in that situation that there’s a massive performance difference between the two. If you want to see some real-world examples on this, check out “In-lab review of the impact of dense filesystems“.

Unless pretty much all of your data environment consists of optimised structured data which is optimally available, you’ll likely need to focus your performance tuning activities on the performance ravine – those periods of time where performance is significantly sub-optimal. Or to consider it another way – if absolute optimum performance is 200MB/s, spending a day increasing that to 205MB/s doesn’t seem productive if you also determine that 70% of the time the backup environment is running at less than 100MB/s. At that point, you’re going to achieve much more if you flatten the ravine.

Looking for a quick fix

There’s various ways that you can aim to do this. If we stick purely within the backup realm, then you might look at factoring in some form of source based deduplication as well. Avamar, for instance, can ameliorate some issues associated with unstructured data. Admittedly though, if you don’t already have Avamar in your environment, adding it can be a fairly big spend, so it’s at the upper range of options that may be considered, and even then won’t necessarily always be appropriate, depending on the nature of that unstructured data.

Traditional approaches have included sending multiple streams per filesystem, and (in some occasions) considering block-level backup of filesystem data (e.g., via SnapImage – though, increasing virtualisation is further reducing SnapImage’s number of use-cases), or using NDMP if the data layout is more amenable to better handling by a NAS device.

What the performance ravine demonstrates is that backup is not an isolated activity. In many organisations there’s a tendency to have segmentation along the lines of:

  • Operating system administration;
  • Application/database administration;
  • Virtualisation teams;
  • Storage teams;
  • Backup administration.

Looking for the real fix

In reality, fixing the ravine needs significant levels of communication and cooperation between the groups, and, within most organisations, a merger of the final three teams above, viz:

Crossing the ravine

The reason we need such close communication, and even team merger, is that baseline performance improvement can only come when there’s significant synergy between the groups. For instance, consider the classic dense-filesystem issue. Three core ways to solve it are:

  • Ensure the underlying storage supports large numbers of simultaneous IO operations (e.g., a large number of spindles) so that multistream reads can be achieved;
  • Shift the data storage across to NAS, which is able to handle processing of dense filesystems better;
  • Shift the data storage across to NAS, and do replicated archiving of infrequently accessed data to pull the data out of the backup cycle all together.

If you were hoping this article might be about quick fixes to the slower part of backups, I have to disappoint you: it’s not so simple, and as suggested by the above diagram, is likely to require some other changes within IT.

If merger in itself is too unwieldy to consider, the next option is the forced breakdown of any communication barriers between those three groups.

A ravine of our own making

In some senses, we were spoilt when gigabit networking was introduced; the solution became fairly common – put the backup server and any storage nodes on a gigabit core, and smooth out those ravines by ensuring that multiple savesets would always be running; therefore even if a single server couldn’t keep running at peak performance, there was a high chance that aggregated performance would be within acceptable levels of peak performance.

Yet unstructured data has grown at a rate which quite frankly has outstripped sequential filesystem access capabilities. It might be argued that operating system vendors and third party filesystem developers won’t make real inroads on this until they can determine adequate ways of encapsulating unstructured filesystems in structured databases, but development efforts down that path haven’t as yet yielded any mainstream available options. (And in actual fact just caused massive delays.)

The solution as environments switch over to 10Gbit networking however won’t be so simple – I’d suggest it’s not unusual for an environment with 10TB of used capacity to have a breakdown of data along the lines of:

  • 4 TB filesystem
  • 2 TB database (prod)
  • 3 TB database (Q/A and development)
  • 500 GB mail
  • 500 GB application & OS data

Assuming by “mail” we’ve got “Exchange”, then it’s quite likely that 5.5TB of the 10TB space will backup fairly quickly – the structured components. That leaves 4.5TB hanging around like a bad smell though.

Unstructured data though actually proves a fundamental point I’ve always maintained – that Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP) are two reasonably independent activities. If they were the same activity, then the resulting synergy would ensure the data were laid out and managed in such a way that data protection would be a doddle. Remember that ILP resembles the following:

Components of ILP

One place where the ravine can be tackled more readily is in the deployment of new systems, which is where that merger of storage, backup and virtualisation comes in, not to mention the close working relationship between OS, Application/DB Admin and the backup/storage/virtualisation groups. Most forms and documents used by organisations when it comes to commissioning new servers will have at most one or two fields for storage – capacity and level of protection. Yet, anyone who works in storage, and equally anyone who works in backup will know that such simplistic questions are the tip of the iceberg for determining performance levels, not only for production access, but also for backup functionality.

The obvious solution to this is service catalogues that cover key factors such as:

  • Capacity;
  • RAID level;
  • Snapshot capabilities;
  • Performance (IOPs) for production activities;
  • Performance (MB/s) for backup/recovery activities (what would normally be quantified under Service Level Agreements, also including recovery time objectives);
  • Recovery point objectives;
  • etc.

But what has all this got to do with the ravine?

I said much earlier in the piece that if you’re looking for a quick solution to the poor-performance ravine within an environment, you’ll be disappointed. In most organisations, once the ravine appears, there’ll need to be at least technical and process changes in order to adequately tackle it – and quite possibly business structural changes too.

Take (as always seems to be the bad smell in the room) unstructured data. Once it’s built up in a standard configuration beyond a certain size, there’s no “easy” fix because it becomes inherently challenging to manage. If you’ve got a 4TB filesystem serving end users across a large department or even an entire company, it’s easy enough to think of a solution to the problem, but thinking about a problem and solving it are two entirely different things, particularly when you’re discussing production data.

It’s here where team merger seems most appropriate; if you take storage in isolation, a storage team will have a very specific approach to configuring a large filesystem for unstructured data access – the focus there is going to be on maximising the number of concurrent IOs and ensuring that standard data protection is in place. That’s not, however, always going to correlate to a configuration that lends itself to traditional backup and recovery operations.

Looking at ILP as a whole though – factoring in snapshot, backup and replication, you can build an entirely different holistic data protection mechanism. Hourly snapshots for 24-48 hours allow for near instantaneous recovery – often user initiated, too. Keeping one of those snapshots per day for say, 30 days, extends this considerably to cover the vast number of recovery requests a traditional filesystem would get. Replication between two sites (including the replication of the snapshots) allows for a form of more traditional backup without yet going to a traditional backup package. For monthly ‘snapshots’ of the filesystem though, regular backup may be used to allow for longer term retention. Suddenly when the ravine only has to be dealt with once a month rather than daily, it’s no longer much of an issue.

Yet, that’s not the only way the problem might be dealt with – what if 80% of that data being backed up is stagnant data that hasn’t been looked at in 6 months? Shouldn’t that then require deleting and archiving? (Remember, first delete, then archive.)

I’d suggest that a common sequence of problems when dealing with backup performance runs as follows:

  1. Failure to notice: Incrementally increasing backup runtimes over a period of weeks or months often don’t get noticed until it’s already gone from a manageable problem to a serious problem.
  2. Lack of ownership: Is a filesystem backing up slowly the responsibility of the backup administrators or the operating system administrators, or the storage administrators? If they are independent teams, there will very likely be a period where the issue is passed back and forth for evaluation before a cooperative approach (or even if a cooperative approach) is decided upon.
  3. Focus on the technical: The current technical architecture is what got you into the mess – in and of itself, it’s not necessarily going to get you out of the mess. Sometimes organisations focus so strongly on looking for a technical solution that it’s like someone who runs out of fuel on the freeway running to the boot of their car, grabbing a jerry can, then jumping back in the driver’s seat expecting to be able to drive to the fuel station. (Or, as I like to put it: “Loop, infinite: See Infinite Loop; Infinite Loop: See Loop, Infinite”.)
  4. Mistaking backup for recovery: In many cases the problem ends up being solved, but only for the purposes of backup, without attention to the potential impact that may make on either actual recoverability or recovery performance.

The first issue is caused by a lack of centralised monitoring. The second, by a lack of centralised management. The third, by a lack of centralised architecture, and the fourth, by a lack of IT/business alignment.

If you can seriously look at all four of those core issues and say replacing LTO-4 tape drives with LTO-5 tape drives will 100% solve a backup-ravine problem every time, you’re a very, very brave person.

If we consider that backup-performance ravine to be a real, physical one, the only way you’re going to get over it is to build a bridge, and that requires a strong cooperative approach rather than a piecemeal approach that pays scant regard for anything other than the technical.

I’ve got a ravine, what do I do?

If you’re aware you’ve got a backup-performance ravine problem plaguing your backup environment, the first thing you’ve got to do is to pull back from the abyss and stop staring into it. Sure, in some cases, a tweak here or a tweak there may appear to solve the problem, but likely it’s actually just addressing a symptom, instead. One symptom.

Backup-performance ravines should in actual fact be viewed as an opportunity within a business to re-evaluate the broader environment:

  1. Is it time to consider a new technical architecture?
  2. Is it time to consider retrofitting an architecture to the existing environment?
  3. Is it time to evaluate achieving better IT administration group synergy?
  4. Is it time to evaluate better IT/business alignment through SLAs, etc.?

While the problem behind a backup-performance ravine may not be as readily solvable as we’d like, it’s hardly insurmountable – particularly when businesses are keen to look at broader efficiency improvements.

Feb 272011

When it comes time to consider refreshing the hardware in your environment, do you want do do it quickly, or properly?

Because here’s the thing: If you want to do it quickly – if you feel rushed, and want to just get it done ASAP, not seeing the point of actually doing a thorough analysis of your sizing and growth requirements, here’s what you do:

  • Guess at the number of clients you’re going to backup.
  • Guess at the amount of data you’ll be backing up from first implementation.
  • Guess at the growth rate you’ll experience over the X years you want the system to last for.
  • Guess at the number of staff you’ll need to manage it.

Then, once you’ve got those numbers down, multiply each one by at least 4.

Then, ask for twice the budget necessary to achieve those numbers – just to be on the safe side.

If you think I’m joking – I’m not; I’m deadly serious. Deciding to skip an architecture phase where you actually review your needs, your growth patterns, your staffing requirements, etc., because you’re in a hurry is a costly and damning mistake to make. So if you’re going to do it, you may as well try to make sure you can survive the budget period.

And if asking for that much budget scares the heck out of you – well, there is an alternative: conduct a proper system architecture phase. Sure, it may take a little longer to get things running, or cost a little more time/money to get the plan done, but once you’ve got that done, it’ll be gold.

Feb 132009

Note: It’s 2015, and I now completely disagree with what I wrote below. Feel free to read what I had to say, but then check out Virtualised Servers and Storage Nodes.


When it comes to servers, I love virtualisation. No, not to the point where I’d want to marry virtualisation, but it is something I’m particularly keen about. I even use it at home – I’ve gone from 3 servers, one for databases, one as a fileserver, and one as an internet gateway down to one, thanks to VMware Server.

Done rightly, I think the average datacentre should be able to achieve somewhere in the order of 75% to 90% virtualisation. I’m not talking high performance computing environments – just your standard server farms. Indeed, having recently seen a demo for VMware’s Site Recovery Manager (SRM), and having participated in many site failover tests, I’ve become a bigger fan of the time and efficiency savings available through virtualisation.

That being said, I think backup servers fall into that special category of “servers that shouldn’t be virtualised”. In fact, I’d go so far as to say that even if every other machine in your server environment is virtual, your backup server still shouldn’t be a virtual machine.

There are two key reasons why I think having a virtualised backup server is a Really Bad Idea, and I’ll outline them below:


In the event of a site disaster, your backup server should be at least equally the first server that is rebuilt. That is, you may start the process of getting equipment ready for restoration of data, but the backup server needs to be up and running in order to achieve data recovery.

If the backup server is configured as a guest within a virtual machine server, it’s hardly going to be the first machine to be configured is it? The virtual machine server will need to be built and configured first, then the backup server after this.

In this scenario, there is a dependency that results in the build of the backup server becoming a bottleneck to recovery.

I realise that we try to avoid scenarios where the entire datacentre needs to be rebuilt, but this still has to remain a factor in mind – what do you want to be spending time on when you need to recover everything?


Most enterprise class virtualisation systems offer the ability to set performance criteria on a per machine basis – that is, in addition to the basics you’d expect such as “this machine gets 1 CPU and 2GB of RAM”, you can also configure options such as limiting the number of MHz/GHz available to each presented CPU, or guaranteeing performance criteria.

Regardless though, when you’re a guest in a virtual environment, you’re still sharing resources. That might be memory, CPU, backplane performance, SAN paths, etc., but it’s still sharing.

That means at some point, you’re sharing performance. The backup server, which is trying to write data out to the backup medium (be that tape or disk), is potentially either competing with for, or at least sharing backplane throughput with the machines that is backing up.

This may not always make a tangible impact. However, debugging such an impact when it does occur becomes much more challenging. (For instance, in my book, I cover off some of the performance implications of having a lot of machines access storage from a single SAN, and how the performance of any one machine during backup is no longer affected just by that machine. The same non-trivial performance implications come into play when the backup server is virtual.)

In Summary

One way or the other, there’s a good reason why you shouldn’t virtualise your backup environment. It may be that for a small environment, the performance impact isn’t an issue and it seems logical to virtualise. However, if you are in a small environment, it’s likely that your failover to another site is likely to be a very manual process, in which case you’ll be far more likely to hit the dependency issue when it comes time for the full site recovery.

Equally, if you’re a large company that has a full failover site, then while the dependency issue may not be as much of a problem (due to say, replication, snapshots, etc.), there’s a very high chance that backup and recovery operations are very time critical, in which case the performance implications of having a backup server share resources with other machines will likely make a virtual backup server an unpalatable solution.

A final request

As someone who has done a lot of support, I’d make one special request if you do decide to virtualise your backup server*.

Please, please make sure that any time you log a support call with your service provider you let them know you’re running a virtual backup server. Please.

* Much as I’d like everyone to do as I suggest, I (a) recognise this would be a tad boring and (b) am unlikely at any point soon or in the future to become a world dictactor, and thus wouldn’t be able to issue such an edict anyway, not to mention (c) can occasionally be fallible.

%d bloggers like this: