My cup runneth over

Introduction

Note: I first wrote this in November 2016, but since this covers topics that comes up regularly with customers, I felt it was time to revisit.

How do you handle data protection storage capacity?

How do you handle growth – regular or unexpected – in your data protection volumes?

Is your business reactive or proactive to data protection capacity requirements?

Glass

In the land of tape, dealing with capacity growth in data protection was both easy and insidiously obfuscated. Tape capacity management is basically a modern version of Hilbert’s Infinite Hotel Paradox – you sort-of, kind-of never run out of capacity because you always just buy another box of tapes. Problem solved, right? (No, more a case of the can kicked down the road.) Problem “solved” and you’ve got 1,000, 10,000, 50,000 tapes in a multitude of media types that you don’t even have tape drives to read any more.

Yet we like to focus on the real world now and tape isn’t the defacto standard any more for backup systems: it’s disk. Disk gives us great power, but with great power comes great responsibility (sorry, even though I’m not a Spiderman fan, I couldn’t resist. Tape is the opposite: tape gives us no power, and with no power comes no responsibility – yes, I’m also a Kickass fan.)

For businesses that still do disk-to-disk-to-tape, where disk is treated more like a staging area and excess data is written out to tape, the problem is seemingly solved because – you guessed it – you can always just buy another box of tapes and stage more data from disk backup storage to tape storage. Again, that’s kicking the can down the road. I’ve known businesses who have had company-wide data protection policies mandating up to 3 months of online recoverability from disk getting down to two weeks or less of data stored on disk because the data to be protected has continued to grow, no scaling has been done on the storage, and – you guessed it – tape was the interim solution.

Aside: When I first joined my first Unix system administration team in 1996, my team had just recently configuring an interim DNS server which they called tmp, because it was going to be quickly replaced by another server, which for the short term was called nc for new computer. When I left in 2000, tmp and nc were still there; in fact, nnc (yes, new-new-computer) was deployed shortly thereafter to replace nc and eventually, a year or two after I left, tmp was finally decommissioned.

Interim solutions have a tendency to stick. In fact, it’s a common story – capacity problem with data protection so let’s deploy an interim solution and solve it later. Later-later. Much later. Much later-later. Ad-infinitum. (That’s why way too many companies just keep on kicking the can down the road when it comes to data lifecycle management: “this is too hard, we’ll just buy another 100 TB”.)

There is, undoubtedly, a growing maturity in handling data protection storage management and capacity planning coming out of the pure disk and disk/cloud storage formats. While this is driven by necessity, it’s also and important demonstration that IT processes need to mature as the business matures as well.

If you’re new to pure disk based, or disk/cloud based data protection storage, you might want to stop and think carefully about your data protection policies and procurement processes/cycles so that you’re able to properly meet the requirements of the business. Here are a few tips I’ve learnt over the years…

80% is the new 100%

This one is easy. Don’t think of 100% capacity as being 100% capacity. Think of 80% as 100%. Why? Because you need runway to either procure storage, migrate data or get formal approval for changes to retention and backup policies. If you wait until you’re at 90, 95 or even 100% capacity, you’ve left your run too late and you’re just asking for many late or sleepless nights managing a challenge that could have been proactively dealt with earlier.

Note: This isn’t to say that your data protection storage will stop working at 80%! I’ve seriously been in meetings where people have said that’s what I mean along that line. Maybe that might be true of other data protection storage, but it’s certainly not for Data Domain: it’ll keep happily writing away until you hit 100%. But it’s never going to let you write to 101%.

At 80% you have to know what your plan is. Actually, I’ll go one better than that: you should know well in advance what your potential plans are, and at 80% the only work should be to decide which plan you need to invoke. That’s it.

The key to management is measurement

I firmly believe you can’t manage something that has operational capacity restraints (e.g., “if we hit 100% capacity we can’t do more backups”) if you’re not actively measuring it. That doesn’t mean periodically logging into a console or running a “df -h” or whatever the “at a glance” look is for your data protection storage, it means capturing measurement data and having it available in both reports and dashboards so it is instantly visible.

The key to measurement is trending

You can capture all the data in the world and make it available in a dashboard, but if you don’t perform appropriate localised trending against that data to analyse it, you’re making your own good self the bottleneck (and weakest link) in the capacity management equation. You need to have trends produced as part of your reporting processes to understand how capacity is changing over time. These trends should be reflective of your own seasonal data variations or sampled over multiple time periods. Why? Well, if you have disk based data protection storage in your environment and do a linear forecast on capacity utilisation from day one, you’ll likely get a smoothing based on lower figures from earlier in the system lifecycle that could actually obfuscate more recent results. So you want to capture and trend that information for comparison, but you equally want to capture and trend shorter timeframes to ensure you have an understanding of shorter term changes. Trends based on the last six and three months usage profiles can be very useful in identifying what sort of capacity management challenges you’ve got based on short term changes in data usage profiles – a few systems for instance might be considerably spiking in utilisation, and if you’re still comparing against a 3-year timeframe dataset or something along those lines, the more recent profile may not be accurately represented in forecasts.

In short: measuring over multiple periods gives you the best accuracy.

Maximum is the new minimum

Linear forecasts of trending information are good if you’re just slowly, continually increasing your storage requirements. But if you’re either staging data (disk as staging) or running garbage collection (e.g., deduplication), it’s quite possible to get increasing sawtooth cycles in capacity utilisation on your data protection storage. And guess what? It doesn’t matter if your capacity requirements for the average utilisation are met if you’ll grow beyond the capacity requirements of the day before the oldest backups are deleted or garbage collection takes place. So make sure when you’re trending you’re looking at how you meet the changing maximum peaks, not the average sizes.

10GB is not 10GB

Something I get asked from time to time goes like this: “Hey, I just want to backup 20TB. Can you give me a price for that?” Here’s the short answer: well, no I can’t. And that’s not because I’m in presales rather than sales. It’s because with that level of information, I literally can’t give you an answer.

As part of modern capacity management, it’s not good enough to know that you’re going to be adding 10GB, 100GB, 100TB, or any other amount of data to your backup environment: you need to understand what the data is, as well. You also need to have a rough idea of what its daily change rate, and growth rate is as well. (Here’s a hint: if someone with a shiny new product tells you they don’t need that information, it’s probably because all they’re doing is compressing rather than deduplicating data. I can give you an average compression ratio in less time than it took me to type this sentence: deduplication – true deduplication – is another thing entirely though.)

If you’re backing up to tape, you’ll assume an average of the hardware compression ratio (and hell, another box of tapes is just another box of tapes, right?); if you’re backing up to dumb disk that’s not doing any deduplication, then yeah, 10GB is 10GB. But in the world of deduplication, it’s really important to understand: is that filesystem data, virtual machine image data, database data, uncompressable data, encrypted data, etc. And so, if you’re looking at understanding capacity and growth rules for a modern data protection environment, you’ll have some estimates of what those different types of data might contribute to the capacity – or know to double-check. Sometimes the scariest request I get is “We’re adding a new workload, can you quote an expansion tray?” My first answer is always: “What’s this new workload?”

Know your windows

There’s three types of windows I’m referring to here – change, change freeze, and procurement.

You need to know them all intimately.

You’re at 95% capacity but you anticipated this and additional data protection storage has just arrived in your datacentre’s receiving bay, so you should be right to install it – right? What happens if you then have to wait an extra week to have the change board consider your request for an outage – or datacentre access – to install the extra capacity? Will you be able to hold on that long? That’s knowing your change windows.

You know you’re going to run out of capacity in two months time if nothing is done, so you order additional data protection storage and it arrives on December 20. The only problem is a mandatory company change blackout started on December 19 and you literally cannot install anything, until January 20. Do you have enough capacity to survive? That’s knowing your freeze windows.

You know you’re at 80% capacity today and based on the trends you’ll be at 90% capacity in 3 weeks and 95% capacity in 4 weeks. How long does it take to get a purchase order approved? How long does it take the additionally purchased systems to arrive on-site? If it takes you 4 weeks to get purchase approval and another 3 weeks for it to arrive after the purchase order is sent, maybe 70%, not 80%, is your new 100%. That’s knowing your procurement windows.

Final thoughts

I want to stress – this isn’t a doom and gloom article, even if it seems I’m painting a blunt picture. What I’ve described above is expert tips – not just from myself, but from my customers, and customers of colleagues and friends, whom I’ve seen manage data protection storage capacity well. If you follow at least the above guidelines, you’re going to have a far more successful – and more relaxed – time of it all.

And maybe you’ll get to spend Thanksgiving, Christmas, Ramadan, Maha Shivaratri, Summer Solstice, Melbourne Cup Day, Labour Day or whatever your local holidays and festivals are with your friends and families, rather than manually managing an otherwise completely manageable situation.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.