Feb 072018
 

The world is changing, and data protection is changing with it. (OK, that sounds like an ad for Veridian Dynamics, but I promise I’m serious.)

One of the areas in which data protection is changing is that backup environments are growing in terms of deployments. It’s quite common these days to see multiple backup servers deployed in an environment – whether that’s due to acquisitions, required functionality, network topology or physical locations, the reason doesn’t really matter. What does matter is that as you increase the number of systems providing protection within an environment, you want to be able to still manage and monitor those systems centrally.

Data Protection Central (DPC) was released earlier this month, and it’s designed from the ground up as a modern, HTML5 web-based system to allow you to monitor your Avamar, NetWorker and Data Domain environments, providing health and capacity reporting on systems and backup. (It also builds on the Multi Systems Manager for Avamar to allow you to perform administrative functions within Avamar without leaving the DPC console – and, well, more is to come on that front over time.)

I’ve been excited about DPC for some time. You may remember a recent post of mine talking about Data Domain Management Center (DDMC); DPC isn’t (at the moment at least) a replacement for DDMC, but it’s built in the same spirit of letting administrators have easy visibility over their entire backup and recovery environment.

So, what’s involved?

Well, let’s start with the price. DPC is $0 for NetWorker and Avamar customers. That’s a pretty good price, right? (If you’re looking for the product page on the support website by the way, it’s here.)

You can deploy it in one of two ways; if you’ve got a SLES server deployed within your environment that meets the requirement, you can download a .bin installer to drop DPC onto that system. The other way – and quite a simple way, really, is to download a VMware OVA file to allow you to easily deploy it within your virtual infrastructure. (Remember, one of the ongoing themes of DellEMC Data Protection is to allow easy virtual deployment wherever possible.)

So yesterday I downloaded the OVA file and today I did a deployment. From start to finish, including gathering screenshots of its operation, that deployment, configuration and use took me about an hour or so.

When you deploy the OVA file, you’ll get prompted for configuration details so that there’s no post-deployment configuration you have to muck around with:

Deploying DPC as an OVA - Part 1

Deploying DPC as an OVA – Part 1

At this point in the deployment, I’ve already selected where the virtual machine will deploy, and what the disk format is. (If you are deploying into a production environment with a number of systems to manage, you’ll likely want to follow the recommendations for thick provisioning. I chose thin, since I was deploying it into my lab.)

You fill in standard networking properties – IP address, gateway, DNS, etc. Additionally, per the screen shot below, you can also immediately attach DPC into your AD/LDAP environment for enterprise authentication:

DPC Deployment, LDAP

DPC Deployment, LDAP

I get into enough trouble at home for IT complexity, so I don’t run LDAP (any more), so there wasn’t anything else for me to do there.

The deployment is quite quick, and after you’re done, you’re ready to power on the virtual machine.

DPC Deployment, ready to power on

DPC Deployment, ready to power on

In fact, one of the things you’ll want to be aware of is that the initial power on and configuration is remarkably quick. (After power-on, the system was ready to let me log on within 5 minutes or so.)

It’s a HTML5 interface – that means there’s no Java Web Start or anything like that; you simply point your web browser at the FQDN or IP address of the DPC server in a browser, and you’ll get to log in and access the system. (The documentation also includes details for changing the SSL certificate.)

DPC Login Screen

DPC Login Screen

DPC follows Dell’s interface guidelines, so it’s quite a crisp and easy to navigate interface. The documentation includes details of your initial login ID and password, and of course, following best practices for security, you’re prompted to change that default password on first login:

DPC Changing the Default Password

DPC Changing the Default Password

After you’ve logged in, you get to see the initial, default dashboard for DPC:

DPC First Login

DPC First Login

Of course, at this point, it looks a wee bit blank. That makes sense – we haven’t added any systems to the environment yet. But that’s easily fixed, by going to System Management in the left-hand column.

DPC System Management

DPC System Management

System management is quite straightforward – the icons directly under “Systems” and “Groups” are for add, edit and delete, respectively. (Delete simply removes a system from DPC, it doesn’t un-deploy the system, of course.)

When you click the add button, you are prompted whether you want to add a server into DPC. (Make sure you check out the version requirements from the documentation, available on the support page.) Adding systems is a very straight-forward operation, as well. For instance, for Data Domain:

DPC Adding a Data Domain

DPC Adding a Data Domain

Adding an Avamar server is likewise quite simple:

DPC Adding an Avamar Server

DPC Adding an Avamar Server

And finally, adding a NetWorker server:

DPC Adding a NetWorker Server

DPC Adding a NetWorker Server

Now, you’ll notice here, DPC prompts you that there’s some added configuration to do on the NetWorker server; it’s about configuring the NetWorker rabbitmq system to be able to communicate with DPC. For now, that’s a manual process. After following the instructions in the documentation, I also added the following to my /etc/rc.d/rc.local file on my Linux-based NetWorker/NMC server to ensure it happened on every reboot, too:

/bin/cat <<EOF | /opt/nsr/nsrmq/bin/nsrmqctl
monitor andoria.turbamentis.int
quit
EOF

It’s not just NetWorker, Avamar and Data Domain you can add – check out the list here:

DPC Systems you can add

DPC Systems you can add

Once I added all my systems, I went over to look at the Activities > Audit pane, which showed me:

DPC Activity Audit

DPC Activity Audit

Look at those times there – it took me all of 8 minutes to change the password on first login, then add 3 Data Domains, an Avamar Server and a NetWorker server to DPC. DPC has been excellently designed to enable rapid deployment and time to readiness. And guess how many times I’d used DPC before? None.

Once systems have been added to DPC and it’s had time to poll the various servers you’re monitoring, you start getting the dashboards populated. For instance, shortly after their addition, my lab DDVE systems were getting capacity reporting:

DPC Capacity Reporting (DD)

DPC Capacity Reporting (DD)

You can drill into capacity reporting by clicking on the capacity report dashboard element to get a tabular view covering Data Domain and Avamar systems:

DPC Detailed Capacity Reporting

DPC Detailed Capacity Reporting

On that detailed capacity view, you see basic capacity details for Data Domains, and as you can see down the right hand side, details of each Mtree on the Data Domain as well. (My Avamar server is reported there as well.)

Under Health, you’ll see a quick view of all the systems you have configured and DPC’s assessment of their current status:

DPC System Health

DPC System Health

In this case, I had two systems reported as unhealthy – one of my DDVEs had an email configuration problem I lazily had not gotten around to fixing, and likewise, my NetWorker server had a licensing error I hadn’t bothered to investigate and fix. Shamed by DPC, I jumped onto both and fixed them, pronto! That meant when I went back to the dashboards, I got an all clear for system health:

DPC Detailed Dashboard

DPC Detailed Dashboard

I wanted to correct those 0’s, so I fired off a backup in NetWorker, which resulted in DPC updating pretty damn quickly to show something was happening:

DPC Dashboard Backup Running

DPC Detailed Dashboard, Backup Running

Likewise, when the backup completed and cloning started, the dashboard was updated quite promptly:

DPC Detailed Dashboard, Clone Running

DPC Detailed Dashboard, Clone Running

You can also see details of what’s been going on via the Activities > System view:

DPC Activities - Systems

DPC Activities – Systems

Then, with a couple of backup and clone jobs run, the Detailed Dashboard was updated a little more:

DPC, Detailed Dashboard More Use

DPC, Detailed Dashboard More Use

Now, I mentioned before that DPC takes on some Multi Systems Manager functionality for Avamar, viz.:

DPC, Avamar Systems Management

DPC, Avamar Systems Management

So that’s back in the Systems Management view. Clicking the horizontal ‘…’ item next to a system lets you launch the individual system management interface, or in the case of Avamar, also manage policy configuration.

DPC, Avamar Policy View

DPC, Avamar Policy View

In that policy view, you can create new policies, initiate jobs, and edit existing configuration details – all without having to go into the traditional Avamar interface:

DPC, Avamar Schedule Configuration

DPC, Avamar Schedule Configuration

DPC, Avamar Retention Configuration

DPC, Avamar Retention Configuration

DPC, Avamar Policy Editing

DPC, Avamar Policy Editing

That’s pretty much all I’ve got to say about DPC at this point in time – other than to highlight the groups function in System Management. By defining groups of resources (and however you want to), you can then filter dashboard views not only for individual systems, but for groups, too, allowing quick and easy review of very specific hosts:

DPC System Management - Groups

DPC System Management – Groups

In my configuration there I’ve lumped by whether systems are associated with an Avamar backup environment or a NetWorker backup environment, but you can configure groups however you need. Maybe you have services broken up by state, or country, or maybe you have them distributed by customer or service you’re providing. Regardless of how you’d like to group them, you can filter through to them in DPC dashboards easily.

So there you go – that’s DPC v1.0.1. It’s honestly taken me more time to get this blog article written than it took me to deploy and configure DPC.

Note: Things I didn’t show in this article:

  • Search and Recovery – That’s where you’d add a DP Search system (I don’t have DP-Search deployed in my lab)
  • Reports – That’s where you’d add a DPA server, which I don’t have deployed in my lab either.

Search and Recovery lets you springboard into the awesome DP-Search web interface, and Reports will drill into DPA and extract the most popular reports people tend to access in DPA, all within DPC.

I’m excited about DPC and the potential it holds over time. And if you’ve got an environment with multiple backup servers and Data Domains, you’ll get value out of it very quickly.

Basics: Planning A Recovery Service

 Architecture, Basics, Recovery  Comments Off on Basics: Planning A Recovery Service
Jan 302018
 

Introduction

In Data Protection: Ensuring Data Availability, I talk quite a lot about what you need to understand and plan as part of a data protection environment. I’m often reminded of the old saying from clothing and carpentry – “measure twice, cut once”. The lesson in that statement of course is that rushing into something headlong may make your work more problematic. Taking the time to properly plan what you’re doing though can in a lot of instances (and data protection is one such instance) make the entire process easier. This post isn’t meant to be a replacement to the various planning chapters in my book – but I’m sure it’ll have some useful tips regardless.

We don’t backup just as something to do; in fact, we don’t protect data just as something to do, either. We protect data to either shield our applications and services (and therefore our businesses) from failures, and to ensure we can recover it if necessary. So with that in mind, what are some essential activities in planning a recovery service?

Hard disk and magnifying glass

First: Do you know what the data is?

Data classification isn’t something done during a data protection cycle. Maybe one day it will be when AI and machine learning is sufficiently advanced; in the interim though it requires input from people – IT, the business, and so on. Of course, there’s nothing physically preventing you from planning and implementing a recovery service without performing data classification; I’d go so far as to suggest that an easy majority of businesses do exactly that. That doesn’t mean it’s an ideal approach though.

Data classification is all about understanding the purpose of the data, who cares about it, how it is used, and so on. It’s a collection of seemingly innocuous yet actually highly important questions. It’s something I cover quite a bit in my book, and for the very good reason that I honestly believe a recovery service can be made simpler, cheaper and more efficient if it’s complimented by a data classification process within the organisation.

Second: Does the data need to exist?

That’s right – does it need to exist? This is another essential but oft-overlooked part of achieving a cheaper, simpler and more efficient recovery service: data lifecycle management. Yet, every 1TB you can eliminate from your primary storage systems, for the average business at least, is going to yield anywhere between 10 and 30TB savings in protection storage (RAID, replication, snapshots, backup and recovery, long term recovery, etc.). While for some businesses that number may be smaller, for the majority of mid-sized and higher businesses, that 10-30TB saving is likely to go much, much higher – particularly as the criticality of the data increases.

Without a data lifecycle policy, bad things happen over time:

  • Keeping data becomes habitual rather than based on actual need
  • As ‘owners’ of data disappear (e.g., change roles, leave the company, etc.), reluctance to delete, prune or manage the data tends to increase
  • Apathy or intransigence towards developing a data lifecycle programme increases.

Businesses that avoid data classification and data lifecycle condemn themselves to the torment of Sisyphus – constantly trying to roll a boulder up a hill only to have it fall back down again before they get to the top. This manifests in many ways, of course, but in designing, acquiring and managing a data recovery service it usually hits the hardest.

Third: Does the data need to be protected?

I remain a firm believer that it’s always better to backup too much data than not enough. But that’s a default, catchall position rather than one which should be the blanket rule within the business. Part of data classification and data lifecycle will help you determine whether you need to enact specific (or any) data protection models for a dataset. It may be test database instances that can be recovered at any point from production systems; it might be randomly generated data that has no meaning outside of a very specific use case, or it might be transient data merely flowing from one location to another that does not need to be captured and stored.

Remember the lesson from data lifecycle – every 1TB eliminated from primary storage can eliminate 10-30TB of data from protection storage. The next logical step after that is to be able to accurately answer the question, “do we even need to protect this?”

Fourth: What recovery models are required?

At this point, we’ve not talked about technology. This question gets us a little closer to working out what sort of technology we need, because once we have a fair understanding of the data we need to offer recovery services for, we can start thinking about what types of recovery models will be required.

This will essential involve determining how recoveries are done for the data, such as:

  • Full or image level recoveries?
  • Granular recoveries?
  • Point in time recoveries?

Some data may not need every type of recovery model deployed for it. For some data, granular recoverability is equally important as complete recoverability, for other types of data, it could be that the only way to recover it is image/full – wherein granular recoveries would simply leave data corrupted or useless. Does all data require point in time recovery? Much will, but some may not.

Other recovery models you should consider of course are how much users will be involved in recoveries. Self-service for admins? Self-service for end-users? All operator run? Chances are of course it’ll be a mix depending those previous recovery model questions (e.g., you might allow self-service individual email recovery, but full exchange recovery is not going to be an end-user initiated task.)

Fifth: What SLOs/SLAs are required?

Regardless of whether your business has Service Level Objectives (SLOs) or Service Level Agreements (SLAs), there’ll be the potential you have to meet a variety of them depending on the nature of the failure, the criticality and age of the data, and so on. (For the rest of this section, I’ll use ‘SLA’ as a generic term for both SLA and SLO). In fact, there’ll be up to three different categories of SLAs you have to meet:

  • Online: These types of SLAs are for immediate or near-immediate recoverability from failure; they’re meant to keep the data online rather than having to seek to retrieve it from a copy. This will cover options such as continuous replication (e.g., fully mirrored storage arrays), continuous data protection (CDP), as well as more regular replication and snapshot options.
  • Nearline: This is where backup and recovery, archive, and long term retention (e.g., compliance retention of backups/archives) comes into play. Systems in this area are designed to retrieve the data from a copy (or in the case of archive, a tiered, alternate platform) when required, as opposed to ensuring the original copy remains continuously, or near to continuously available.
  • Disaster: These are your “the chips are down” SLAs, which’ll fall into business continuity and/or isolated recovery. Particularly in the event of business continuity, they may overlap with either online or nearline SLAs – but they can also diverge quite a lot. (For instance, in a business continuity situation, data and systems for ‘tier 3’ and ‘tier 4’ services, which may otherwise require a particular level of online or nearline recoverability during normal operations, might be disregarded entirely until full service levels are restored.

Not all data may require all three of the above, and even if data does, unless you’re in a hyperconverged or converged environment, it’s quite possible if you’re a backup administrator, you only need to consider some of the above, with other aspects being undertaken by storage teams, etc.

Now you can plan the recovery service (and conclusion)

And because you’ve gathered the answers to the above, planning and implementing the recovery service is now the easy bit! Trust me on this – working out what a recovery service should look like for the business is when you’ve gathered the above information is a fraction of the effort compared to when you haven’t. Again: “Measure twice, cut once.”

If you want more in-depth information on above, check out chapters in my book such as “Contextualizing Data Protection”, “Data Life Cycle”, “Business Continuity”, and “Data Discovery” – not to mention the specific chapters on protection methods such as backup and recovery, replication, snapshots, continuous data protection, etc.

The Perils of Tape

 Architecture, Best Practice, Recovery  Comments Off on The Perils of Tape
Jan 232018
 

In the 90s, if you wanted to define a “disaster recovery” policy for your datacentre, part of that policy revolved around tape. Back then, tape wasn’t just for operational recovery, but also for disaster recovery – and even isolated recovery. In the 90s of course, we weren’t talking about “isolated recovery” in the same way we do now, but depending on the size of the organisation, or its level of exposure to threats, you might occasionally find a company that would make three copies of data – the on-site tape, the off-site tape, and then the “oh heck, if this doesn’t work we’re all ruined” tape.

These days, whatever you might want to use tape for within a datacentre (and yes, I’ll admit that sometimes it may still be necessary, though I’d argue far less necessary than the average tape deployment), something you most definitely don’t want to use it for is disaster recovery.

The reasoning is simple – time. To understand what I mean, let’s look at an example.

Let’s say your business has an used data footprint for all production systems of just 300 TB. I’m not talking total storage capacity here, just literally the occupied space across all DAS, SAN and NAS storage just for production systems. For the purposes of our discussion, I’m also going to say that based on dependencies and business requirements, disaster recovery means getting back all production data. Ergo, getting back 300 TB.

Now, let’s also say that your business is using LTO-7 media. (I know LTO-8 is out, but not everyone using tape updates as soon as a new format becomes available.) The key specifications on LTO-7 are as follows:

  • Uncompressed capacity: 6 TB
  • Max speed without compression: 300 MB/s
  • Typical rated compression: 2.5:1

Now, I’m going to stop there. When it comes to tape, I’ve found the actual compression ratios achieved to be wildly at odds with reality. Back when everyone used to quote a 2:1 compression ratio, I found that tapes yielding 2:1 or higher were in the extreme, and the norm was more likely 1.4:1, or 1.5:1. So, I’m not going to assume you’re going to get compression in the order of 2.5:1 just because the tape manufacturers say you will. (They also say tapes will reliably be OK on shelves for decades…)

A single LTO-7 tape, if we’re reading or writing at full speed, will take around 5.5 hours to go end-to-end. That’s very generously assuming no shoe-shining and that the read or write can be accomplished as a single process – either via multiplexing, or a single stream. Let’s say you get 2:1 compression on the tape. That’s 12 TB, but still takes 5.5 hours to complete (assuming a linear increase in tape speed. And at 600MB/s, that’s getting iffy – but again, bear with me.)

Drip, drip, drip.

Now, tape drives are always expensive – so you’re not going to see enough tape drives to load one tape per drive, fill the tape, and bang, your backup is done for your entire environment. Tape libraries exist to allow the business to automatically switch tapes as required, avoiding that 1:1 relationship between backup size and the number of drives used.

So let’s say you’ve got 3 tape drives. That means in 5.5 hours, using 2:1 compression and assuming you can absolutely stream at top speed for that compression ratio, you can recover 36 TB. (If you’ve used tape for as long as me, you’ll probably be chuckling at this point.)

So, that’s 36 TB in the first 5.5 hours. At this point, things are already getting a little absurd, so I’ll not worry about the time it takes to unload and load new media. The next 36 TB is done in another 5.5 hours. If we continue down that path, our first 288 TB will be recovered after 44 hours. We might then assume our final 12TB fitted snugly on a single remaining tape – which will take another 5.5 hours to read, getting us to a disaster recovery time of 49.5 hours.

Drip, drip, drip.

Let me be perfectly frank here: those numbers are ludicrous. The only way you have a hope of achieving that sort of read speed in a disaster recovery for that many tapes is if you are literally recovering a 300 TB filesystem filled almost exclusively with 2GB files that each can be compressed down by 50%. If your business production data doesn’t look like that, read on.

Realistically, in a disaster recovery situation, I’d suggest you’re going to look at at least 1.5x the end-to-end read/write time. For all intents and purposes, that assumes the tapes are written with say, 2-way multiplexing and we can conduct 2 reads across it, where each read is able to use filemark-seek-forward (FSFs) to jump over chunks of the tape.

5.5 hours per tape becomes 8.25 hours. (Even that, I’d say, is highly optimistic.) At that point, the 49.5 hour recovery is a slightly more believable 74.25 hour recovery. Slightly. That’s over 3 days. (Don’t forget, we still haven’t factored in tape load, tape rewind, and tape unload times into the overall scope.)

Drip, drip, drip.

What advantage do we get if we read from disk instead of tape? Well, if we’re going to go down that path, we’re going to get instant access (no load, unload or seek times), and concurrency (highly parallel reads – recovering fifty systems at a time, for instance).

But here’s the real kicker. Backup isn’t disaster recovery. And particularly, tape backup is disaster recovery in the same way that a hermetically sealed, 100% barricaded house is safety from a zombie apocalypse. (Eventually, you’re going to run out of time, i.e., food and water.)

This is, quite frankly, where the continuum comes into play:

More importantly, this is where picking the right data protection methodology for the SLAs (the RPOs and RTOs) is essential, viz.:

Storage Admin vs Backup Admin

Backup and recovery systems might be a fallback position when all other types of disaster recovery have been exhausted, but at the point you’ve decided to conduct a disaster recovery of your entire business from physical tape, you’re in a very precarious position. It might have been a valid option in the 90s, but in a modern data protection environment, there’s a plethora of higher level systems that should be built into a disaster recovery strategy. I’ve literally seen examples of companies when, faced with having to do a complete datacentre recovery from tape, taking a month or more to get back operational. (If ever.)

Like the old saying goes – if all you have is a hammer, everything looks like a nail. “Backup only” companies have a tendency to encourage you to think as backup as a disaster recovery function, or even to think of disaster recovery/business continuity as someone else’s problem. Talking disaster recovery is easiest when able to talk to someone across the entire continuum.

Otherwise you may very well find your business perfectly safely barricaded into a hermetically sealed location away from the zombie hoard, watching your food and water supplies continue to decrease hour by hour, day by day. Drip, drip, drip.

Planning holistic data protection strategies is more than just backup – and it’s certainly more than just tape. That’s why Data Protection: Ensuring Data Availability looks at the entire data storage protection spectrum. And that’s why I work for DellEMC, too. It’s not just backup. It’s about the entire continuum.

 

The rise of the hyperconverged administrator

 Architecture, Best Practice  Comments Off on The rise of the hyperconverged administrator
Dec 142017
 

Hyperconverged is a hot topic in enterprise IT these days, and deservedly so. The normal approach to enterprise IT infrastructure after all, is a bit like assembling a jigsaw puzzle over a few years. Over successive years it might be time to refresh storage, then compute, then backup, in any combination or order. So you’re piecing together the puzzle over an extended period of time, and wanting to make sure that you end up in a situation where all the pieces – sometimes purchased up to five years apart – will work well together. That’s before you start to deal with operating system and application compatibility considerations, too.

bigStock Roads

Unless your business is actually in infrastructure service delivery, none of this is of value to the business. The value starts when people can start delivering new business services on top of infrastructure – not building the infrastructure itself. I’m going to (perhaps controversially to some IT people suggest) that if your IT floor or office still has signs scattered around the place along the lines of “Unix Admin Team”, “Windows Admin Team”, “Storage Admin Team”, “Network Admin Team”, “Security Team”, “VMware Admin Team”, “Backup Admin Team”, etc., then your IT is premised on an architectural jigsaw approach that will continue to lose relevance to businesses who are seeking true, cloud-like agility within their environment. Building a new system, or delivering a new service in that silo-based environment is a multi-step process where someone creates a service ticket, which goes for approval, then once approved, goes to the network team to allocate an IP address, before going to the storage team to confirm sufficient storage, then moving to the VMware team to allocate a virtual machine, before that’s passed over to the Unix or Windows team to prepare the system, then the ticket gets migrated to the backup team and … five weeks have passed. If you’re lucky.

Ultimately, a perception of “it’s cheaper” is only part of the reason why businesses look enviously at cloud – it’s also the promise of agility, which is what the business needs more than anything else from IT.

If a business is either investigating, planning or implementing hyperconverged, it’s doing it for efficiency and to achieve cloud-like agility. At the point that hyperconverged lands, those signs in the office have to come down, and the teams have to work together. Administration stops being a single field of expertise and becomes an entire-stack vertical operation.

Let’s look at that (just) from the perspective of data protection. If we think of a typical data protection continuum (excluding data lifecycle management’s archive part), it might look as simple as the following:

Now I’ve highlighted those functions in different colours, for the simple fact that while they’re all data protection, in a classic IT environment, they’re rather differentiated from one another, viz.:

Primary vs Protection Storage

The first three – continuous availability, replication and snapshots – all, for the most part in traditional infrastructure IT shops, fit into the realm of functions of primary storage. On the other hand, operational backup and recovery, as well as long term retention, both fit into models around protection storage (regardless of what your protection storage is). So, as a result, you’ll typically end up with the following administrative separation:

Storage Admin vs Backup Admin

And that’s where the traditional IT model has sat for a long time – storage admins, and backup admins. Both working on different aspects of the same root discipline – data protection. That’s where you might find, in those traditionally silo’d IT shops, a team of storage administrators and a team of backup administrators. Colleagues to be sure, though not part of the same team.

But a funny thing happens when hyperconverged enters the picture. Hyperconverged merges storage and compute, and to a degree, networking, all into the same bucket of resources. This creates a different operational model which the old silo’d approach just can’t handle:

Hyperconverged Administration

Once hyperconverged is in the picture, the traditional delineation between storage administration and backup administration disappears. Storage becomes just one part of the integrated stack. Protection storage might share the hyperconverged environment (depending on the redundancy therein, or the operational budget), or it might be separate. Separate or not though, data protection by necessity is going to become a lot more closely intertwined.

In fact, in a hyperconverged environment, it’s likely to be way more than just storage and backup administration that becomes intertwined – it really does go all the way up the stack. So much so that I can honestly say every environment where I’ve seen hyperconverged deployed and fail is where the traditional silos have remained in place within IT, with processes and intra-group communications effectively handled by service/trouble tickets.

So here’s my tip as we reach the end of the 2017 tunnel and now see 2018 barrelling towards us: if you’re a backup administrator and hyperconverged is coming into your environment, this is your perfect opportunity to make the jump into being a full spectrum data protection administrator (and expert). (This is something I cover in a bit more detail in my book.) Within hyperconverged, you’ll need to understand the entire data protection stack, and this is a great opportunity to grow. It’s also a fantastic opportunity to jump out of pure operational administration and start looking at the bigger picture of service availability, too, since the automation and service catalogue approach to hyperconverged is going to let you free yourself substantially from the more mundane aspects of backup administration.

“Hyperconverged administrator” – it’s something you’re going to hear a lot more of in 2018 and beyond.

 

Talking about Ransomware

 Architecture, Backup theory, General thoughts, Recovery, Security  Comments Off on Talking about Ransomware
Sep 062017
 

The “Wannacry” Ransomware strike saw a particularly large number of systems infected and garnered a great deal of media attention.

Ransomware Image

As you’d expect, many companies discussed ransomware and their solutions for it. There was also backlash from many quarters suggesting people were using a ransomware attack to unethically spruik their solutions. It almost seems to be the IT equivalent of calling lawyers “ambulance chasers”.

We are (albeit briefly, I am sure), between major ransomware outbreaks. So, logically that’ll mean it’s OK to talk about ransomware.

Now, there’s a few things to note about ransomware and defending against it. It’s not as simplistic as “I only have to do X and I’ll solve the problem”. It’s a multi-layered issue requiring user education, appropriate systems patching, appropriate security, appropriate data protection, and so on.

Focusing even on data protection, that’s a multi-layered approach as well. In order to have a data protection environment that can assuredly protect you from ransomware, you need to do the basics, such as operating system level protection for backup servers, storage nodes, etc. That’s just the beginning. The next step is making sure your backup environment itself follows appropriate security protocols. That’s something I’ve been banging on about for several years now. That’s not the full picture though. Once you’ve got operating systems and backup systems secured via best practices, you need to then look at hardening your backup environment. There’s a difference between standard security processes and hardened security processes, and if you’re worried about ransomware this is something you should be thinking about doing. Then, of course, if you really want to ensure you can recover your most critical data from a serious hactivism and ransomware (or outright data destruction) breach, you need to look at IRS as well.

But let’s step back, because I think it’s important to make a point here about when we can talk about ransomware.

I’ve worked in data protection my entire professional career. (Even when I was a system administrator for the first four years of it, I was the primary backup administrator as well. It’s always been a focus.)

If there’s one thing I’ve observed in my career in data protection is that having a “head in the sand” approach to data loss risk is a lamentably common thing. Even in 2017 I’m still hearing things like “We can’t back this environment up because the project which spun it up didn’t budget for backup”, and “We’ll worry about backup later”. Not to mention the old chestnut, “it’s out of warranty so we’ll do an Icarus support contract“.

Now the flipside of the above paragraph is this: if things go wrong in any of those situations, suddenly there’s a very real interest in talking about options to prevent a future issue.

It may be a career limiting move to say this, but I’m not in sales to make sales. I’m in sales to positively change things for my customers. I want to help customers resolve problems, and deliver better outcomes to their users. I’ve been doing data protection for over 20 years. The only reason someone stays in data protection that long is because they’re passionate about it, and the reason we’re passionate about it is because we are fundamentally averse to data loss.

So why do we want to talk about defending against or recovering from ransomware during a ransomware outbreak? It’s simple. At the point of a ransomware outbreak, there’s a few things we can be sure of:

  • Business attention is focused on ransomware
  • People are talking about ransomware
  • People are being directly impacted by ransomware

This isn’t ambulance chasing. This is about making the best of a bad situation – I don’t want businesses to lose data, or have it encrypted and see them have to pay a ransom to get it back – but if they are in that situation, I want them to know there are techniques and options to prevent it from striking them again. And at that point in time – during a ransomware attack – people are interested in understanding how to stop it from happening again.

Now, we have to still be considerate in how we discuss such situations. That’s a given. But it doesn’t mean the discussion can’t be had.

To me this is also an ethical consideration. Too often the focus on ethics in professional IT is around the basics: don’t break the law (note: law ≠ ethics), don’t be sexist, don’t be discriminatory, etc. That’s not really a focus on ethics, but a focus on professional conduct. Focusing on professional conduct is good, but there must also be a focus on the ethical obligations of protecting data. It’s my belief that if we fail to make the best of a bad situation to get an important message of data protection across, we’re failing our ethical obligations as data protection professionals.

Of course, in an ideal world, we’d never need to discuss how to mitigate or recover from a ransomware outbreak during said outbreak, because everyone would already be protected. But harking back to an earlier point, I’m still being told production systems were installed without consideration for data protection, so I think we’re a long way from that point.

So I’ll keep talking about protecting data from all sorts of loss situations, including ransomware, and I’ll keep having those discussions before, during and after ransomware outbreaks. That’s my job, and that’s my passion: data protection. It’s not gloating, it’s not ambulance chasing, it’s let’s make sure this doesn’t happen again.


On another note, sales are really great for my book, Data Protection: Ensuring Data Availability, released earlier this year. I have to admit, I may have squealed a little when I got my first royalty statement. So, if you’ve already purchased my book: you have my sincere thanks. If you’ve not, that means you’re missing out on an epic story of protecting data in the face of amazing odds. So check it out, it’s in eBook or Paperback format on Amazon (prior link), or if you’d prefer to, you can buy direct from the publisher. And thanks again for being such an awesome reader.

Architecture Matters: Protection in the Cloud (Part 2)

 Architecture  Comments Off on Architecture Matters: Protection in the Cloud (Part 2)
Jun 052017
 

(Part 1).

Particularly when we think of IaaS style workloads in the Cloud, there’s two key approaches that can be used for data protection.

The first is snapshots. Snapshots fulfil part of a data protection strategy, but we do always need to remember with snapshots that:

  • They’re an inefficient storage and retrieval model for long-term retention
  • Cloud or not, they’re still essentially on-platform

As we know, and something I cover in my book quite a bit – a real data protection strategy will be multi-layered. Snapshots undoubtedly can provide options around meeting fast RTOs and minimal RPOs, but traditional backup systems will deliver a sufficient recovery granularity for protection copies stretching back weeks, months or years.

Stepping back from data protection itself – public cloud is a very different operating model to traditional  in-datacentre infrastructure spending. The classic in-datacentre infrastructure procurement process is an up-front investment designed around 3- or 5-year depreciation schedules. For some businesses that may mean a literal up-front purchase to cover the entire time-frame (particularly so when infrastructure budget is only released for the initial deployment project), and for others with more fluid budget options, there’ll be an investment into infrastructure that can be expanded over the 3- or 5-year solution lifetime to meet systems growth.

Cloud – public Cloud – isn’t costed or sold that way. It’s a much smaller billing window and costing model; use a GB or RAM, pay for a GB of RAM. Use a GHz or CPU, pay for a GHz of CPU. Use a GB of storage, pay for a GB of storage. Public cloud costing models often remind me of Master of the House from Les Miserables, particularly this verse:

Charge ’em for the lice, extra for the mice
Two percent for looking in the mirror twice
Here a little slice, there a little cut
Three percent for sleeping with the window shut
When it comes to fixing prices
There are a lot of tricks I knows
How it all increases, all them bits and pieces
Jesus! It’s amazing how it grows!

Master of the House, Les Miserables.

That’s the Cloud operating model in a nutshell. Minimal (or no) up-front investment, but you pay for every scintilla of resource you use – every day or month.

If you say, deploy a $30,000 server into your datacentre, you then get to use that as much or as little as you want, without any further costs beyond power and cooling*. With Cloud, you won’t be paying that $30,000 initial fee, but you will pay for every MHz, KB of RAM and byte of storage consumed within every billing period.

If you want Cloud to be cost-effective, you have to be able to optimise – you have to effectively game the system, so to speak. Your in-Cloud services have to be maximally streamlined. We’ve become inured to resource wastage in the datacentre because resources have been cheap for a long time. RAM size/speed grows, CPU speed grows, as does the number of cores, and storage – well, storage seems to have an infinite expansion capability. Who cares if what you’re doing generates 5 TB of logs per day? Information is money, after all.

To me, this is just the next step in the somewhat lost art of programmatic optimisation. I grew up in the days of 8-bit computing**, and we knew back then that CPU, RAM and storage weren’t infinite. This didn’t end with 8-bit computing, though. When I started in IT as a Unix system administrator, swap file sizing, layout and performance was something that formed a critical aspect of your overall configuration, because if – Jupiter forbid – your system started swapping, you needed a fighting chance that the swapping wasn’t going to kill your performance. Swap file optimisation was, to use a Bianca Del Rio line, all about the goal: “Not today, Satan.”

That’s Cloud, now. But we’re not so much talking about swap files as we are resource consumption. Optimisation is critical. A failure to optimise means you’ll pay more. The only time you want to pay more is when what you’re paying for delivers a tangible, cost-recoverable benefit to the business. (I.e., it’s something you get to charge someone else for, either immediately, or later.)

Cloud Cost

If we think about backup, it’s about getting data from location A to location B. In order to optimise it, you want to do two distinct thinks:

  • Minimise the number of ‘hops’ that data has to make in order to get from A to B
  • Minimise the amount of data that you need to send from A to B.

If you don’t optimise that, you end up in a ‘classic’ backup architecture that we used to rely so much on in the 90s and early 00s, such as:

Cloud Architecture Matters 1

(In this case I’m looking just at backup services that land data into object storage. There are situations where you might want higher performance than what object offers, but let’s stick just with object storage for the time being.)

I don’t think this diagram is actually good at giving the full picture. There’s another way I like to draw the diagram, and it looks like this:

Cloud Architecture Matters 2

In the Cloud, you’re going to pay for the systems you’re running for business purposes no matter what. That’s a cost you have to accept, and the goal is to ensure that whatever services or products you’re on-selling to your customers using those services will pay for the running costs in the Cloud***.

You want to ensure you can protect data in the Cloud, but sticking to architectures designed at the time of on-premises infrastructure – and physical infrastructure at that – is significantly sub-optimal.

Think of how traditional media servers (or in NetWorker parlance, storage nodes) needed to work. A media server is designed to be a high performance system that funnels data coming from client to protection storage. If a backup architecture still heavily relies on media servers, then the cost in the Cloud is going to be higher than you need it – or want it – to be. That gets worse if a media server needs to be some sort of highly specced system encapsulating non-optimised deduplication. For instance, one of NetWorker’s competitors provides details on their website of hardware requirements for deduplication media servers, so I’ve taken these specifications directly from their website. To work with just 200 TB of storage allocated for deduplication, a media server for that product needs:

  • 16 CPU Cores
  • 128 GB of RAM
  • 400 GB SSD for OS and applications
  • 2 TB of SSD for deduplication databases
  • 2 TB of 800 IOPs+ disk (SSD recommended in some instances) for index cache

For every 200 TB. Think on that for a moment. If you’re deploying systems in the Cloud that generate a lot of data, you could very easily find yourself having to deploy multiple systems such as the above to protect those workloads, in addition to the backup server itself and the protection storage that underpins the deduplication system.

Or, on the other hand, you could work with an efficient architecture designed to minimise the number of data hops, and minimise the amount of data transferred:

CloudBoost Workflow

That’s NetWorker with CloudBoost. Unlike that competitor, a single CloudBoost appliance doesn’t just allow you to address 200TB of deduplication storage, but 6 PB of logical object storage. 6 PB, not 200 TB. All that using 4 – 8 CPUs and 16 – 32GB of RAM, and with a metadata sizing ratio of 1:2000 (i.e., every 100 GB of metadata storage allows you to address 200 TB of logical capacity). Yes, there’ll be SSD optimally for the metadata, but noticeably less than the competitor’s media server – and with a significantly greater addressable range.

NetWorker and CloudBoost can do that because the deduplication workflow has been optimised. In much the same way that NetWorker and Data Domain work together, within a CloudBoost environment, NetWorker clients will participate in the segmentation, deduplication, compression (and encryption!) of the data. That’s the first architectural advantage: rather than needing a big server to handle all the deduplication of the protection environment, a little bit of load is leveraged in each client being protected. The second architectural advantage is that the CloudBoost appliance does not pass the data through. Clients send their deduplicated, compressed and encrypted data directly to the object storage, minimising the data hops involved****.

To be sure, there are still going to be costs associated with running a NetWorker+CloudBoost configuration in public cloud – but that will be true of any data protection service. That’s the nature of public cloud – you use it, you pay for it. What you do get with NetWorker+CloudBoost though is one of the most streamlined and optimised public cloud backup options available. In an infrastructure model where you pay for every resource consumed, it’s imperative that the backup architecture be as resource-optimised as possible.

IaaS workloads will only continue to grow in public cloud. If your business uses NetWorker, you can take comfort in being able to still protect those workloads while they’re in public cloud, and doing it efficiently, optimised for maximum storage potential with minimised resource cost. Remember always: architecture matters, no matter where your infrastructure is.


Hey, if you found this useful, don’t forget to check out Data Protection: Ensuring Data Availability.


 


* Yes, I am aware there’ll be other costs beyond power and cooling when calculating a true system management price, but I’m not going to go into those for the purposes of this blog.

** Some readers of my blog may very well recall earlier computing models. But I started with a Vic-20, then the Commodore-64, and both taught me valuable lessons about what you can – and can’t – fit in memory.

*** Many a company has been burnt by failing to cost that simple factor, but in the style of Michael Ende, that is another story, for another time.

**** Linux 64-bit clients do this now. Windows 64-bit clients are supported in NetWorker 9.2, coming soon. (In the interim Windows clients work via a storage node.)

May 232017
 

I’m going to keep this one short and sweet. In Cloud Boost vs Cloud Tier I go through a few examples of where and when you might consider using Cloud Boost instead of Cloud Tier.

One interesting thing I’m noticing of late is a variety of people talking about “VTL in the Cloud”.

BigStock Exhausted

I want to be perfectly blunt here: if your vendor is talking to you about “VTL in the Cloud”, they’re talking to you about transferring your workloads rather than transforming your workloads. When moving to the Cloud, about the worst thing you can do is lift and shift. Even in Infrastructure as a Service (IaaS), you need to closely consider what you’re doing to ensure you minimise the cost of running services in the Cloud.

Is your vendor talking to you about how they can run VTL in the Cloud? That’s old hat. It means they’ve lost the capacity to innovate – or at least, lost interest in it. They’re not talking to you about a modern approach, but just repeating old ways in new locations.

Is that really the best that can be done?

In a coming blog article I’ll talk about the criticality of ensuring your architecture is streamlined for running in the Cloud; in the meantime I just want to make a simple point: talking about VTL in the Cloud isn’t a “modern” discussion – in fact, it’s quite the opposite.

Dell EMC Integrated Data Protection Appliance

 Architecture  Comments Off on Dell EMC Integrated Data Protection Appliance
May 102017
 

Dell EMC World is currently on in Las Vegas, and one of the most exciting announcements to come out of the show (in my opinion) is the Integrated Data Protection Appliance (IDPA).

Hyperconverged is eating into the infrastructure landscape – it’s a significantly growing market for Dell EMC, as evidenced by the VxRail and VxRack product lines. These allow you to deploy fast, efficiently and with a modernised consumption approach thanks to Enterprise Hybrid Cloud.

The next step in that hyperconverged path is hyperconverged data protection, which is where the IDPA comes in.

DellEMCIDPA

Hyperconverged data protection works on the same rationale as hyperconverged primary production infrastructure: you can go out to market and buy a backup product, data protection storage, systems infrastructure to run it on, etc., then when it all arrives, assemble, test and configure it, or, you could buy a single appliance with the right starting and growth capacity for you, get it delivered on-site pre-built and tested, and a few hours later be up and running your first backup.

The IDPA is an important step in the evolution of data protection, recognising the changing landscape in the IT infrastructure environment, notably:

  • Businesses want to see results realised from their investment as soon as it arrives
  • Businesses don’t want IT staff spending time doing ‘one-off’ installation activities.
  • The silo, ‘communicate via service tickets’ approach to IT is losing ground as the infrastructure administrator becomes a real role within organisations. It’s not just infrastructure becoming hyperconverged – it’s people, too.
  • The value of automation is finally being understood, since it frees IT resources to concentrate on projects and issue resolution, rather than BAU button pressing.
  • Mobile workforces and remote office environments increasingly means you may not have an IT person present on-site to physically make a change, etc.
  • Backup administrators need to become data protection administrators, and data protection architects.

And finally, there’s another, final aspect to the IDPA that cannot be overstated in the realm of hyper-virtualised environments: the IDPA is natural physical separation of your protection data from your operational infrastructure. Consider a traditional protection environment:

Traditional Environment

In a traditional protection environment, you’ll typically have separated protection storage (e.g., Data Domain), but it’s very typical these days, particularly in hyper-virtualised environments, to see the backup services themselves running within the same environment they’re protecting. That means if there is a significant primary systems infrastructure issue, your recovery time may take longer because you’ll have to first get the backup services up and running again.

IDPA provides complete separation though:

IDPA Environment

The backup services and configuration no longer run on your primary systems infrastructure, instead running in a separate appliance. This gives you higher levels of redundancy and protection for your protection environment, decreasing risk within your business.

Top picks for where you should consider an IDPA:

  • When deploying large-scale hyperconverged environments (e.g., VxRack)
  • For remote offices
  • For greenfields computer-rooms
  • For dealing with large new workloads
  • For modernising your approach to data protection
  • Whenever you want a single, turnkey approach to data protection with a single vendor supporting the entire stack

The IDPA can scale with your business; there’s models starting as low as 34TB usable (pre-dedupe) and scaling all the way to 1PB usable (and that’s before you consider cloud-tiering).

If you’re wanting to read more about IDPA, check out the official Dell EMC blog post for the release here.

May 052017
 

There was a time, comparatively not that long ago, when the biggest governing factor in LAN capacity for a datacentre was not the primary production workloads, but the mechanics of getting a full backup from each host over to the backup media. If you’ve been around in the data protection industry long enough you’ll have had experience of that – for instance, the drive towards 1Gbit networks over Fast Ethernet started more often than not in datacentres I was involved in thanks to backup. Likewise, the first systems I saw being attached directly to 10Gbit backbones in datacentres were the backup infrastructure.

Well architected deduplication can eliminate that consideration. That’s not to say you won’t eventually need 10Gbit, 40Gbit or even more in your datacentre, but if deduplication is architected correctly, you won’t need to deploy that next level up of network performance to meet your backup requirements.

In this blog article I want to take you through an example of why deduplication architecture matters, and I’ll focus on something that amazingly still gets consideration from time to time: post-ingest deduplication.

Before I get started – obviously, Data Domain doesn’t use post-ingest deduplication. Its pre-ingest deduplication ensures the only data written to the appliance is already deduplicated, and it further increases efficiency by pushing deduplication segmentation and processing out to the individual clients (in a NetWorker/Avamar environment) to limit the amount of data flowing across the network.

A post-deduplication architecture though has your protection appliance feature two distinct tiers of storage – the landing or staging tier, and the deduplication tier. So that means when it’s time to do a backup, all your clients send all their data across the network to sit, in original sized format, on the staging tier:

Post Process Dedupe 01

In the example above we’ve already had backups run to the post-ingest deduplication appliance; so there’s a heap of deduplicated data sitting in the deduplication tier, but our staging tier has just landed all the backups from each of the clients in the environment. (If it were NetWorker writing to the appliance, each of those backups would be the full sized savesets.)

Now, at some point after the backup completes (usually a preconfigured time), post-processing kicks in. This is effectively a data-migration window in a post-ingest appliance where all the data in the staging tier has to be read and processed for deduplication. For example, using the example above, we might start with inspecting ‘Backup01’ for commonality to data on the deduplication tier:

Post Process Dedupe 02

So the post-ingest processing engine starts by reading through all the content of Backup01 and constructs fingerprint analysis of the data that has landed.

Post Process Dedupe 03

As fingerprints are assembled, data can be compared against the data already residing in the deduplication tier. This may result in signature matches or signature misses, indicating new data that needs to be copied into the deduplication tier.

Post Process Dedupe 04

In this it’s similar to regular deduplication – signature matches result in pointers for existing data being updated and extended, and a signature miss results in needing to store new data on the deduplication tier.

Post Process Dedupe 05

Once the first backup file written to the staging tier has been dealt with, we can delete that file from the staging area and move onto the second backup file to start the process all over again. And we keep doing that over and over and over on the staging tier until we’re left with an empty staging tier:

Post Process Dedupe 06

Of course, that’s not the end of the process – then the deduplication tier will have to run its regular housekeeping operations to remove data that’s no longer referenced by anything.

Architecturally, post-ingest deduplication is a kazoo to pre-ingest deduplication’s symphony orchestra. Sure, you might technically get to hear the 1812 Overture, but it’s not really going to be the same, right?

Let’s go through where architecturally, post-ingest deduplication fails you:

  1. The network becomes your bottleneck again. You have to send all your backup data to the appliance.
  2. The staging tier has to have at least as much capacity available as the size of your biggest backup, assuming it can execute its post-process deduplication within the window between when your previous backup finishes and your next backup starts.
  3. The deduplication process becomes entirely spindle bound. If you’re using spinning disk, that’s a nightmare. If you’re using SSD, that’s $$$.
  4. There’s no way of telling how much space will be occupied on the deduplication tier after deduplication processing completes. This can lead you into very messy situations where say, the staging tier can’t empty because the deduplication tier has filled. (Yes, capacity maintenance is a requirement still on pre-ingest deduplication systems, but it’s half the effort.)

What this means is simple: post-ingest deduplication architectures are asking you to pay for their architectural inefficiencies. That’s where:

  1. You have to pay to increase your network bandwidth to get a complete copy of your data from client to protection storage within your backup window.
  2. You have to pay for both the staging tier storage and the deduplication tier storage. (In fact, the staging tier is often a lot bigger than the size of your biggest backups in a 24-hour window so the deduplication can be handled in time.)
  3. You have to factor the additional housekeeping operations into blackout windows, outages, etc. Housekeeping almost invariably becomes a daily rather than a weekly task, too.

Compare all that to pre-ingest deduplication:

Pre-Ingest Deduplication

Using pre-ingest deduplication, especially Boost based deduplication, the segmentation and hashing happen directly where the data is, and rather than sending the entire data to be protected from the client to the Data Domain, we only send the unique data. Data that already resides on the Data Domain? All we’ll have sent is a tiny fingerprint so the Data Domain can confirm it’s already there (and update its pointers for existing data), then moved on. After your first backup, that potentially means that on a day to day basis your network requirements for backup are reduced by 95% or more.

That’s why architecture matters: you’re either doing it right, or you’re paying the price for someone else’s inefficiency.


If you want to see more about how a well architected backup environment looks – technology, people and processes, check out my book, Data Protection: Ensuring Data Availability.

Mar 222017
 

It’s fair to say I’m a big fan of Queen. They shaped my life – the only band to have even a remotely similar effect on me was ELO. (Yes, I’m an Electric Light Orchestra fan. Seriously, if you haven’t listened to the Eldorado or Time operatic albums in the dark you haven’t lived.)

Queen taught me a lot: the emotional perils of travelling at near-relativistic speeds and returning home, that maybe immorality isn’t what fantasy makes it seem like, and, amongst a great many other things, that you need to take a big leap from time to time to avoid getting stuck in a rut.

But you can find more prosaic meanings in Queen, too, if you want to. One of them deals with long term retention. We get that lesson from one of the choruses for Too much love will kill you:

Too much love will kill you,

Just as sure as none at all

Hang on, you may be asking, what’s that got to do with long term retention?

Replace ‘love’ with ‘data’ and you’ve got it.

Glass

I’m a fan of the saying:

It’s always better to backup a bit too much than not quite enough.

In fact, it’s something I mention again in my book, Data Protection: Ensuring Data Availability. Perhaps more than once. (I’ve mentioned my book before, right? If you like my blog or want to know more about data protection, you should buy the book. I highly recommend it…)

That’s something that works quite succinctly for what I’d call operational backups: your short term retention policies. They’re going to be the backups where you’re keeping say, weekly fulls and daily incrementals for (typically) between 4-6 weeks for most businesses. For those sorts of backups, you definitely want to err on the side of caution when choosing what to backup.

Now, that’s not to say you don’t err on the side of caution when you’re thinking about long term retention, but caution definitely becomes a double-edged sword: the caution of making sure you’re backing up what you are required to, but also the caution of making sure you’re not wasting money.

Let’s start with a simpler example: do you backup your non-production systems? For a lot of environments, the answer is ‘yes’ (and that’s good). So if the answer is ‘yes’, let me ask the follow-up: do you apply the same retention policies for your non-production backups as you do for your production backups? And if the answer to that is ‘yes’, then my final question is this: why? Specifically, are you doing it because it’s (a) habit, (b) what you inherited, or (c) because there’s a mandated and sensible reason for doing so? My guess is that in 90% of scenarios, the answer is (a) or (b), not (c). That’s OK, you’re in the same boat as the rest of the industry.

Let’s say you have 10TB of production data, and 5TB of non-production data. Not worrying about deduplication for the moment, if you’re doing weekly fulls and daily incrementals, with a 3.5% daily change (because I want to hurt my brain with mathematics tonight – trust me, I still count on my fingers, and 3.5 on your fingers is hard) with a 5 week retention period then you’re generating:

  • 5 x (10+5) TB in full backups
  • 30 x ((10+5) x 0.035) TB in incremental backups

That’s 75 TB (full) + 15.75 TB (incr) of backups generated for 15TB of data over a 5 week period. Yes, we’ll use deduplication because it’s so popular with NetWorker and shrink that number quite nicely thank-you, but 90.75 TB of logical backups over 5 weeks for 15TB of data is the end number we get at.

But do you really need to generate that many backups? Do you really need to keep five weeks worth of non-production backups? What if instead you’re generating:

  • 5 x 10 TB in full production backups
  • 2 x 5 TB in full non-prod backups
  • 30 x 10 x 0.035 TB in incremental production backups
  • 12 x 5 x 0.035 TB in incremental non-prod backups

That becomes 50TB (full prod) + 10 TB (full non-prod) + 10.5 TB (incr prod) + 2.1 TB (incr non-prod) over any 5 week period, or 72.6 TB instead of 90.75 TB – a saving of 20%.

(If you’re still pushing your short-term operational backups to tape, your skin is probably crawling at the above suggestion: “I’ll need more tape drives!” Well, yes you would, because tape is inflexible. So using backup to disk means you can start saving on media, because you don’t need to make sure you have enough tape drives for every potential pool that would be written to at any given time.)

A 20% saving on operational backups for 15TB of data might not sound like a lot, but now let’s start thinking about long term retention (LTR).

There’s two particular ways we see long term retention data handled: monthlies kept for the entire LTR period, or keeping monthlies for 12-13 months and just keeping end-of-calendar-year (EoCY) + end-of-financial-year (EoFY) for the LTR period. I’d suggest that the knee-jerk reaction by many businesses is to keep monthlies for the entire time. That doesn’t necessarily have to be the case though – and this is the sort of thing that should also be investigated: do you legally need to keep all your monthly backups for your LTR, or do you just need to keep those EoCY and EoFY backups for that period? That alone might be a huge saving.

Let’s assume though that you’re keeping those monthly backups for your entire LTR period. We’ll assume you’re also not in engineering, where you need to keep records for the lifetime of the product, or biosciences, where you need to keep records for the lifetime of the patient (and longer), and just stick with the tried-and-trusted 7 year retention period seen almost everywhere.

For LTR, we also have to consider yearly growth. I’m going to cheat and assume 10% year on year growth, but the growth only kicks in once a year. (In reality for many businesses it’s more like a true compound annual growth, ammortized monthly, which does change things around a bit.)

So let’s go back to those numbers. We’ve already established what we need for operational backups, but what do we need for LTR?

If we’re not differentiating between prod and non-prod (and believe me, that’s common for LTR), then our numbers look like this:

  • Year 1: 12 x 15 TB
  • Year 2: 12 x 16.5 TB
  • Year 3: 12 x 18.15 TB
  • Year 4: 12 x 19.965 TB
  • Year 5: 12 x 21.9615 TB
  • Year 6: 12 x 24.15765 TB
  • Year 7: 12 x 26.573415 TB

Total? 1,707.69 TB of LTR for a 7 year period. (And even as data ages out, that will still grow as the YoY growth continues.)

But again, do you need to keep non-prod backups for LTR? What if we didn’t – what would those numbers look like?

  • Year 1: 12 x 10 TB
  • Year 2: 12 x 11 TB
  • Year 3: 12 x 12.1 TB
  • Year 4: 12 x 13.31 TB
  • Year 5: 12 x 14.641 TB
  • Year 6: 12 x 16.1051 TB
  • Year 7: 12  17.71561 TB

That comes down to just 1,138 TB over 7 years – a 33% saving in LTR storage.

We got that saving just by looking at splitting off non-production data from production data for our retention policies. What if we were to do more? Do you really need to keep all of your production data for an entire 7-year LTR period? If we’re talking a typical organisation looking at 7 year retention periods, we’re usually only talking about critical systems that face compliance requirements – maybe some financial databases, one section of a fileserver, and email. What if that was just 1 TB of the production data? (I’d suggest that for many companies, a guesstimate of 10% of production data being the data required – legally required – for compliance retention is pretty accurate.)

Well then your LTR data requirements would be just 113.85 TB over 7 years, and that’s a saving of 93% of LTR storage requirements (pre-deduplication) over a 7 year period for an initial 15 TB of data.

I’m all for backing up a little bit too much than not enough, but once we start looking at LTR, we have to take that adage with a grain of salt. (I’ll suggest that in my experience, it’s something that locks a lot of companies into using tape for LTR.)

Too much data will kill you,

Just as sure as none at all

That’s the lesson we get from Queen for LTR.

…Now if you’ll excuse me, now I’ve talked a bit about Queen, I need to go and listen to their greatest song of all time, March of the Black Queen.

%d bloggers like this: