Mar 222017
 

It’s fair to say I’m a big fan of Queen. They shaped my life – the only band to have even a remotely similar effect on me was ELO. (Yes, I’m an Electric Light Orchestra fan. Seriously, if you haven’t listened to the Eldorado or Time operatic albums in the dark you haven’t lived.)

Queen taught me a lot: the emotional perils of travelling at near-relativistic speeds and returning home, that maybe immorality isn’t what fantasy makes it seem like, and, amongst a great many other things, that you need to take a big leap from time to time to avoid getting stuck in a rut.

But you can find more prosaic meanings in Queen, too, if you want to. One of them deals with long term retention. We get that lesson from one of the choruses for Too much love will kill you:

Too much love will kill you,

Just as sure as none at all

Hang on, you may be asking, what’s that got to do with long term retention?

Replace ‘love’ with ‘data’ and you’ve got it.

Glass

I’m a fan of the saying:

It’s always better to backup a bit too much than not quite enough.

In fact, it’s something I mention again in my book, Data Protection: Ensuring Data Availability. Perhaps more than once. (I’ve mentioned my book before, right? If you like my blog or want to know more about data protection, you should buy the book. I highly recommend it…)

That’s something that works quite succinctly for what I’d call operational backups: your short term retention policies. They’re going to be the backups where you’re keeping say, weekly fulls and daily incrementals for (typically) between 4-6 weeks for most businesses. For those sorts of backups, you definitely want to err on the side of caution when choosing what to backup.

Now, that’s not to say you don’t err on the side of caution when you’re thinking about long term retention, but caution definitely becomes a double-edged sword: the caution of making sure you’re backing up what you are required to, but also the caution of making sure you’re not wasting money.

Let’s start with a simpler example: do you backup your non-production systems? For a lot of environments, the answer is ‘yes’ (and that’s good). So if the answer is ‘yes’, let me ask the follow-up: do you apply the same retention policies for your non-production backups as you do for your production backups? And if the answer to that is ‘yes’, then my final question is this: why? Specifically, are you doing it because it’s (a) habit, (b) what you inherited, or (c) because there’s a mandated and sensible reason for doing so? My guess is that in 90% of scenarios, the answer is (a) or (b), not (c). That’s OK, you’re in the same boat as the rest of the industry.

Let’s say you have 10TB of production data, and 5TB of non-production data. Not worrying about deduplication for the moment, if you’re doing weekly fulls and daily incrementals, with a 3.5% daily change (because I want to hurt my brain with mathematics tonight – trust me, I still count on my fingers, and 3.5 on your fingers is hard) with a 5 week retention period then you’re generating:

  • 5 x (10+5) TB in full backups
  • 30 x ((10+5) x 0.035) TB in incremental backups

That’s 75 TB (full) + 15.75 TB (incr) of backups generated for 15TB of data over a 5 week period. Yes, we’ll use deduplication because it’s so popular with NetWorker and shrink that number quite nicely thank-you, but 90.75 TB of logical backups over 5 weeks for 15TB of data is the end number we get at.

But do you really need to generate that many backups? Do you really need to keep five weeks worth of non-production backups? What if instead you’re generating:

  • 5 x 10 TB in full production backups
  • 2 x 5 TB in full non-prod backups
  • 30 x 10 x 0.035 TB in incremental production backups
  • 12 x 5 x 0.035 TB in incremental non-prod backups

That becomes 50TB (full prod) + 10 TB (full non-prod) + 10.5 TB (incr prod) + 2.1 TB (incr non-prod) over any 5 week period, or 72.6 TB instead of 90.75 TB – a saving of 20%.

(If you’re still pushing your short-term operational backups to tape, your skin is probably crawling at the above suggestion: “I’ll need more tape drives!” Well, yes you would, because tape is inflexible. So using backup to disk means you can start saving on media, because you don’t need to make sure you have enough tape drives for every potential pool that would be written to at any given time.)

A 20% saving on operational backups for 15TB of data might not sound like a lot, but now let’s start thinking about long term retention (LTR).

There’s two particular ways we see long term retention data handled: monthlies kept for the entire LTR period, or keeping monthlies for 12-13 months and just keeping end-of-calendar-year (EoCY) + end-of-financial-year (EoFY) for the LTR period. I’d suggest that the knee-jerk reaction by many businesses is to keep monthlies for the entire time. That doesn’t necessarily have to be the case though – and this is the sort of thing that should also be investigated: do you legally need to keep all your monthly backups for your LTR, or do you just need to keep those EoCY and EoFY backups for that period? That alone might be a huge saving.

Let’s assume though that you’re keeping those monthly backups for your entire LTR period. We’ll assume you’re also not in engineering, where you need to keep records for the lifetime of the product, or biosciences, where you need to keep records for the lifetime of the patient (and longer), and just stick with the tried-and-trusted 7 year retention period seen almost everywhere.

For LTR, we also have to consider yearly growth. I’m going to cheat and assume 10% year on year growth, but the growth only kicks in once a year. (In reality for many businesses it’s more like a true compound annual growth, ammortized monthly, which does change things around a bit.)

So let’s go back to those numbers. We’ve already established what we need for operational backups, but what do we need for LTR?

If we’re not differentiating between prod and non-prod (and believe me, that’s common for LTR), then our numbers look like this:

  • Year 1: 12 x 15 TB
  • Year 2: 12 x 16.5 TB
  • Year 3: 12 x 18.15 TB
  • Year 4: 12 x 19.965 TB
  • Year 5: 12 x 21.9615 TB
  • Year 6: 12 x 24.15765 TB
  • Year 7: 12 x 26.573415 TB

Total? 1,707.69 TB of LTR for a 7 year period. (And even as data ages out, that will still grow as the YoY growth continues.)

But again, do you need to keep non-prod backups for LTR? What if we didn’t – what would those numbers look like?

  • Year 1: 12 x 10 TB
  • Year 2: 12 x 11 TB
  • Year 3: 12 x 12.1 TB
  • Year 4: 12 x 13.31 TB
  • Year 5: 12 x 14.641 TB
  • Year 6: 12 x 16.1051 TB
  • Year 7: 12  17.71561 TB

That comes down to just 1,138 TB over 7 years – a 33% saving in LTR storage.

We got that saving just by looking at splitting off non-production data from production data for our retention policies. What if we were to do more? Do you really need to keep all of your production data for an entire 7-year LTR period? If we’re talking a typical organisation looking at 7 year retention periods, we’re usually only talking about critical systems that face compliance requirements – maybe some financial databases, one section of a fileserver, and email. What if that was just 1 TB of the production data? (I’d suggest that for many companies, a guesstimate of 10% of production data being the data required – legally required – for compliance retention is pretty accurate.)

Well then your LTR data requirements would be just 113.85 TB over 7 years, and that’s a saving of 93% of LTR storage requirements (pre-deduplication) over a 7 year period for an initial 15 TB of data.

I’m all for backing up a little bit too much than not enough, but once we start looking at LTR, we have to take that adage with a grain of salt. (I’ll suggest that in my experience, it’s something that locks a lot of companies into using tape for LTR.)

Too much data will kill you,

Just as sure as none at all

That’s the lesson we get from Queen for LTR.

…Now if you’ll excuse me, now I’ve talked a bit about Queen, I need to go and listen to their greatest song of all time, March of the Black Queen.

Mar 132017
 

The NetWorker usage report for 2016 is now complete and available here. Per previous years surveys, the survey ran from December 1, 2016 through to January 1, 2017.

Survey

There were some interesting statistics and trends arising from this survey. The percentages of businesses not using backup to disk in at least some form within their environment fell to just 1% of respondents. That’s 99% of respondents having some form of backup to disk within their environment!

More and more respondents are cloning within their environments – if you’re not cloning in yours, you’re falling behind the curve now in terms of ensuring your backup environment can’t be a single point of failure.

There’s plenty of other results and details in the survey report you may be interested in, including:

  • Changes to the number of respondents using dedicated backup administrators
  • Cloud adoption rates
  • Ransomware attacks
  • The likelihood of businesses using or planning to use object storage as part of their backup environment
  • and many more

You can download the survey from the link above.

Just a reminder: “Data Protection: Ensuring Data Availability” is out now, and you can buy it in both paperback and electronic format from Amazon, or in paperback from the publisher, CRC Press. If you’ve enjoyed or found my blog useful, I’m sure you’ll find value in my latest book, too!

One respondent from this year’s survey will be receiving a signed copy of the book directly from me, too! That winner has been contacted.

Mar 102017
 

In 2008 I published “Enterprise Systems Backup and Recovery: A corporate insurance policy”. It dealt pretty much exclusively, as you might imagine, with backup and recovery concepts. Other activities like snapshots, replication, etc., were outside the scope of the book. Snapshots, as I recall, were mainly covered as an appendix item.

Fast forward almost a decade and there’s a new book on the marketplace, “Data Protection: Ensuring Data Availability” by yours truly, and it is not just focused on backup and recovery. There’s snapshots, replication, continuous data protection, archive, etc., all covered. Any reader of my blogs will know though that I don’t just think of the technology: there’s the business aspects to it as well, the process, training and people side of the equation. There’s two other titles I bandied with: “Backup is dead, long live backup”, and “Icarus Fell: Understanding risk in the modern IT environment”.

You might be wondering why in 2017 there’s a need for a book dedicated to data protection.

Puzzle Pieces

We’ve come a long way in data protection, but we’re now actually teetering on an interesting precipice, one which we need to understand and manage very carefully. In fact, one which has resulted in significant data loss situations for many companies world-wide.

IT has shifted from the datacentre to – well, anywhere. There’s still a strong datacentre focus. The estimates from various industry analysts is that around 70% of IT infrastructure spend is still based in the datacentre. That number is shrinking, but IT infrastructure is not; instead, it’s morphing. ‘Shadow IT’ is becoming more popular – business units going off on their own and deploying systems without necessarily talking to their IT departments. To be fair, Shadow IT always existed – it’s just back in the 90s and early 00s, it required the business units to actually buy the equipment. Now they just need to provide a credit card to a cloud provider.

Businesses are also starting to divest themselves of IT activities that aren’t their “bread and butter”, so to speak. A financial company or a hospital doesn’t make money from running an email system, so they outsource that email – and increasingly it’s to someone like Microsoft via Office 365.

Simply put, IT has become significantly more commoditised, accessible and abstracted over the past decade. All of this is good for the business, except it brings the business closer to that precipice I mentioned before.

What precipice? Risk. We’re going from datacentres where we don’t lose data because we’re deploying on highly resilient systems with 5 x 9s availability, robust layers of data protection and formal processes into situations where data is pushed out of the datacentre, out of the protection of the business. The old adage, “never assume, you make an ass out of u and me” is finding new ground in this modern approach to IT. Business groups trying to do a little data analytics rent a database at an hourly rate from a cloud provider and find good results, so they start using it more and more. But don’t think about data protection because they’ve never had to before. That led to things like the devastating data losses encountered by MongoDB users. Startups with higher level IT ideas are offering services without any understanding of the fundamental requirements of infrastructure protection. Businesses daily are finding that because they’ve spread their data over such a broad area, the attack vector has staggeringly increased, and hackers are turning that into a profitable business.

So returning to one of my first comments … you might be wondering why in 2017 there’s a need for a book dedicated to data protection? It’s simple: the requirement for data protection never goes away, regardless of whose infrastructure you’re using, or where your data resides. IT is standing on the brink of a significant evolution in how services are offered and consumed, and in so many situations it’s like a return to the early 90s. “Oh yeah, we bought a new server for a new project, it’s gone live. Does anyone know how we back it up?” It’s a new generation of IT and business users that need to be educated about data protection. Business is also demanding a return on investment for as much IT spend as possible, and that means data protection also needs to evolve to offer something back to the business other than saving you when the chips are down.

That’s why I’ve got a new book out about data protection: because the problem has not gone away. IT has evolved, but so has risk. That means data protection technology, data protection processes, and the way that we talk about data protection has to evolve as well. Otherwise we, as IT professionals, have failed in our professional duties.

I’m a passionate believer that we can always find a way to protect data. We think of it as business data, but it’s also user data. Customer data. If you work in IT for an airline it’s not just a flight bookings database you’re protecting, but the travel plans, the holiday plans, the emergency trips to sick relatives or getting to a meeting on time that you’re protecting, too. If you work in IT at a university, you’re not just protecting details that can be used for student billing, but also the future hopes and dreams of every student to pass through.

Let’s be passionate about data protection together. Let’s have that conversation with the business and help them understand how data protection doesn’t go away just because infrastructure it evolving. Let’s help the business understand that data protection isn’t a budget sink-hole, but it can improve processes and deliver real returns to the business. Let’s make sure that data, no matter where it is, is adequately protected and we can avoid that precipice.

“Data Protection: Ensuring Data Availability” is available now from the a variety of sellers, including my publisher and Amazon. Come on a journey with me and discover why backup is dead, long live backup.

Build vs Buy

 Architecture, Backup theory, Best Practice  Comments Off on Build vs Buy
Feb 182017
 

Converged, and even more so, hyperconverged computing, is all premised around the notion of build vs buy. Are you better off having your IT staff build your infrastructure from the ground up, managing it in silos of teams, or are you do you want to buy tightly integrated kit, land it on the floor and start using it immediately?

Dell-EMC’s team use the analogy – do you build your car, or do you buy it? I think this is a good analogy: it speaks to how the vast majority of car users consume vehicle technology. They buy a complete, engineered car as a package, and drive it off the car sales lot complete. Sure, there’s tinkerers who might like to build a car from scratch, but they’re not the average consumer. For me it’s a bit like personal computing – I gave up years ago wanting to build my own computers. I’m not interested in buying CPUs, RAM, motherboards, power supplies, etc., dealing with the landmines of compatibility, drivers and physical installation before I can get a usable piece of equipment.

This is where many people believe IT is moving, and there’s some common sense in it – it’s about time to usefulness.

A question I’m periodically posed is – what has backup got to do with the build vs buy aspect of hyperconverged? For one, it’s not just backup – it’s data protection – but secondly, it has everything to do with hyperconverged.

If we return to that build vs buy example of – would you build a car or buy a car, let me ask a question of you as a car consumer – a buyer rather than a builder of a car. Would you get airbags included, or would you search around for third party airbags?

Airbags

To be honest, I’m not aware of anyone who buys a car, drives it off the lot, and starts thinking, “Do I go to Airbags R Us, or Art’s Airbag Emporium to get my protection?”

That’s because the airbags come built-in.

For me at least, that’s the crux of the matter in the converged and hyper-converged market. Do you want third party airbags that you have to install and configure yourself, and hope they work with that integrated solution you’ve got bought, or do you want airbags included and installed as part of the purchase?

You buy a hyperconverged solution because you want integrated virtualisation, integrated storage, integrated configuration, integrated management, integrated compute, integrated networking. Why wouldn’t you also want integrated data protection? Integrated data protection that’s baked into the service catalogue and part of the kit as it lands on your floor. If it’s about time to usefulness it doesn’t stop at the primary data copy – it should also include the protection copies, too.

Airbags shouldn’t be treated as optional, after-market extras, and neither should data protection.

Feb 122017
 

On January 31, GitLab suffered a significant issue resulting in a data loss situation. In their own words, the replica of their production database was deleted, the production database was then accidentally deleted, then it turned out their backups hadn’t run. They got systems back with snapshots, but not without permanently losing some data. This in itself is an excellent example of the need for multiple data protection strategies; your data protection should not represent a single point of failure within the business, so having layered approaches to achieve a variety of retention times, RPOs, RTOs and the potential for cascading failures is always critical.

To their credit, they’ve published a comprehensive postmortem of the issue and Root Cause Analysis (RCA) of the entire issue (here), and must be applauded for being so open with everything that went wrong – as well as the steps they’re taking to avoid it happening again.

Server on Fire

But I do think some of the statements in the postmortem and RCA require a little more analysis, as they’re indicative of some of the challenges that take place in data protection.

I’m not going to speak to the scenario that led to the production, rather than replica database, being deleted. This falls into the category of “ooh crap” system administration mistakes that sadly, many of us will make in our careers. As the saying goes: accidents happen. (I have literally been in the situation of accidentally deleting a production database rather than its replica, and I can well and truly sympathise with any system or application administrator making that mistake.)

Within GitLab’s RCA under “Problem 2: restoring GitLab.com took over 18 hours”, several statements were made that irk me as a long-term data protection specialist:

Why could we not use the standard backup procedure? – The standard backup procedure uses pg_dump to perform a logical backup of the database. This procedure failed silently because it was using PostgreSQL 9.2, while GitLab.com runs on PostgreSQL 9.6.

As evidenced by a later statement (see the next RCA statement below), the procedure did not fail silently; instead, GitLab chose to filter the output of the backup process in a way that they did not monitor. There is, quite simply, a significant difference between fail silently and silently ignored results. The latter is a far more accurate statement than the former. A command that fails silently is one that exits with no error condition or alert. Instead:

Why did the backup procedure fail silently? – Notifications were sent upon failure, but because of the Emails being rejected there was no indication of failure. The sender was an automated process with no other means to report any errors.

The pg_dump command didn’t fail silently, as previously asserted. It generated output which was silently ignored due to a system configuration error. Yes, a system failed to accept the emails, and a system therefore failed to send the emails, but at the end of the day, a human failed to see or otherwise check as to why the backup reports were not being received. This is actually a critical reason why we need zero error policies – in data protection, no error should be allowed to continue without investigation and rectification, and a change in or lack of reporting or monitoring data for data protection activities must be treated as an error for investigation.

Why were Azure disk snapshots not enabled? – We assumed our other backup procedures were sufficient. Furthermore, restoring these snapshots can take days.

Simple lesson: If you’re going to assume something in data protection, assume it’s not working, not that it is.

Why was the backup procedure not tested on a regular basis? – Because there was no ownership, as a result nobody was responsible for testing the procedure.

There are two sections of the answer that should serve as a dire warning: “there was no ownership”, “nobody was responsible”. This is a mistake many businesses make, but I don’t for a second believe there was no ownership. Instead, there was a failure to understand ownership. Looking at the “Team | GitLab” page, I see:

  • Dmitriy Zaporozhets, “Co-founder, Chief Technical Officer (CTO)”
    • From a technical perspective the buck stops with the CTO. The CTO does own the data protection status for the business from an IT perspective.
  • Sid Sijbrandij, “Co-founder, Chief Executive Officer (CEO)”
    • From a business perspective, the buck stops with the CEO. The CEO does own the data protection status for the business from an operational perspective, and from having the CTO reporting directly up.
  • Bruce Armstrong and Villi Iltchev, “Board of Directors”
    • The Board of Directors is responsible for ensuring the business is running legally, safely and financially securely. They indirectly own all procedures and processes within the business.
  • Stan Hu, “VP of Engineering”
    • Vice-President of Engineering, reporting to the CEO. If the CTO sets the technical direction of the company, an engineering or infrastructure leader is responsible for making sure the company’s IT works correctly. That includes data protection functions.
  • Pablo Carranza, “Production Lead”
    • Reporting to the Infrastructure Director (a position currently open). Data protection is a production function.
  • Infrastructure Director:
    • Currently assigned to Sid (see above), as an open position, the infrastructure director is another link in the chain of responsibility and ownership for data protection functions.

I’m not calling these people out to shame them, or rub salt into their wounds – mistakes happen. But I am suggesting GitLab has abnegated its collective responsibility by simply suggesting “there was no ownership”, when in fact, as evidenced by their “Team” page, there was. In fact, there was plenty of ownership, but it was clearly not appropriately understood along the technical lines of the business, and indeed right up into the senior operational lines of the business.

You don’t get to say that no-one owned the data protection functions. Only that no-one understood they owned the data protection functions. One day we might stop having these discussions. But clearly not today.

 

Jan 242017
 

In 2013 I undertook the endeavour to revisit some of the topics from my first book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy”, and expand it based on the changes that had happened in the industry since the publication of the original in 2008.

A lot had happened since that time. At the point I was writing my first book, deduplication was an emerging trend, but tape was still entrenched in the datacentre. While backup to disk was an increasingly common scenario, it was (for the most part) mainly used as a staging activity (“disk to disk to tape”), and backup to disk use was either dumb filesystems or Virtual Tape Libraries (VTL).

The Cloud, seemingly ubiquitous now, was still emerging. Many (myself included) struggled to see how the Cloud was any different from outsourcing with a bit of someone else’s hardware thrown in. Now, core tenets of Cloud computing that made it so popular (e.g., agility and scaleability) have been well and truly adopted as essential tenets of the modern datacentre, as well. Indeed, for on-premises IT to compete against Cloud, on-premises IT has increasingly focused on delivering a private-Cloud or hybrid-Cloud experience to their businesses.

When I started as a Unix System Administrator in 1996, at least in Australia, SANs were relatively new. In fact, I remember around 1998 or 1999 having a couple of sales executives from this company called EMC come in to talk about their Symmetrix arrays. At the time the datacentre I worked in was mostly DAS with a little JBOD and just the start of very, very basic SANs.

When I was writing my first book the pinnacle of storage performance was the 15,000 RPM drive, and flash memory storage was something you (primarily) used in digital cameras only, with storage capacities measured in the hundreds of megabytes more than gigabytes (or now, terabytes).

When the first book was published, x86 virtualisation was well and truly growing into the datacentre, but traditional Unix platforms were still heavily used. Their decline and fall started when Oracle acquired Sun and killed low-cost Unix, with Linux and Windows gaining the ascendency – with virtualisation a significant driving force by adding an economy of scale that couldn’t be found in the old model. (Ironically, it had been found in an older model – the mainframe. Guess what folks, mainframe won.)

When the first book was published, we were still thinking of silo-like infrastructure within IT. Networking, compute, storage, security and data protection all as seperate functions – separately administered functions. But business, having spent a decade or two hammering into IT the need for governance and process, became hamstrung by IT governance and process and needed things done faster, cheaper, more efficiently. Cloud was one approach – hyperconvergence in particular was another: switch to a more commodity, unit-based approach, using software to virtualise and automate everything.

Where are we now?

Cloud. Virtualisation. Big Data. Converged and hyperconverged systems. Automation everywhere (guess what? Unix system administrators won, too). The need to drive costs down – IT is no longer allowed to be a sunk cost for the business, but has to deliver innovation and for many businesses, profit too. Flash systems are now offering significantly more IOPs than a traditional array could – Dell EMC for instance can now drop a 5RU system into your datacentre capable of delivering 10,000,000+ IOPs. To achieve ten million IOPs on a traditional spinning-disk array you’d need … I don’t even want to think about how many disks, rack units, racks and kilowatts of power you’d need.

The old model of backup and recovery can’t cut it in the modern environment.

The old model of backup and recovery is dead. Sort of. It’s dead as a standalone topic. When we plan or think about data protection any more, we don’t have the luxury of thinking of backup and recovery alone. We need holistic data protection strategies and a whole-of-infrastructure approach to achieving data continuity.

And that, my friends, is where Data Protection: Ensuring Data Availability is born from. It’s not just backup and recovery any more. It’s not just replication and snapshots, or continuous data protection. It’s all the technology married with business awareness, data lifecycle management and the recognition that Professor Moody in Harry Potter was right, too: “constant vigilance!”

Data Protection: Ensuring Data Availability

This isn’t a book about just backup and recovery because that’s just not enough any more. You need other data protection functions deployed holistically with a business focus and an eye on data management in order to truly have an effective data protection strategy for your business.

To give you an idea of the topics I’m covering in this book, here’s the chapter list:

  1. Introduction
  2. Contextualizing Data Protection
  3. Data Lifecycle
  4. Elements of a Protection System
  5. IT Governance and Data Protection
  6. Monitoring and Reporting
  7. Business Continuity
  8. Data Discovery
  9. Continuous Availability and Replication
  10. Snapshots
  11. Backup and Recovery
  12. The Cloud
  13. Deduplication
  14. Protecting Virtual Infrastructure
  15. Big Data
  16. Data Storage Protection
  17. Tape
  18. Converged Infrastructure
  19. Data Protection Service Catalogues
  20. Holistic Data Protection Strategies
  21. Data Recovery
  22. Choosing Protection Infrastructure
  23. The Impact of Flash on Data Protection
  24. In Closing

There’s a lot there – you’ll see the first eight chapters are not about technology, and for a good reason: you must have a grasp on the other bits before you can start considering everything else, otherwise you’re just doing point-solutions, and eventually just doing point-solutions will cost you more in time, money and risk than they give you in return.

I’m pleased to say that Data Protection: Ensuring Data Availability is released next month. You can find out more and order direct from the publisher, CRC Press, or order from Amazon, too. I hope you find it enjoyable.

Jan 132017
 

Introduction

There’s something slightly deceptive about the title for my blog post. Did you spot it?

It’s: vs. It’s a common mistake to think that Cloud Boost and Cloud Tier compete with one another. That’s like suggesting a Winnebago and a hatchback compete with each other. Yes, they both can have one or more people riding in them and they can both be used to get you around, but the actual purpose of each is typically quite different.

It’s the same story when you look at Cloud Boost and Cloud Tier. Of course, both can move data from A to B. But the reason behind each, the purpose for each is quite different. (Does that mean there’s no overlap? Not necessarily. If you need to go on a 500km holiday and sleep in the car, you can do that in a hatchback or a Winnebago, too. You can often get X to do Y even if it wasn’t built with that in mind.)

So let’s examine them, and look at their workflows as well as a few usage examples.

Cloud Boost

First off, let’s consider Cloud Boost. Version 1 was released in 2014, and since then development has continued to the point where CloudBoost now looks like the following:

CloudBoost Workflow

Cloud Boost Workflow

Cloud Boost exists to allow NetWorker (or NetBackup or Avamar) to write deduplicated data out to cloud object storage, regardless of whether that’s on-premises* in something like ECS, or writing out to a public cloud’s object storage system, like Virtustream Storage or Amazon S3. When Cloud Boost was first introduced back in 2014, the Cloud Boost appliance was also a storage node and data had to be cloned from another device to the Cloud Boost storage node, which would push data out to object. Fast forward a couple of years, and with Cloud Boost 2.1 introduced in the second half of 2016, we’re now at the point where there’s a Cloud Boost API sitting in NetWorker clients allowing full distributed data processing, with each client talking directly to the object storage – the Cloud Boost appliance now just facilitates the connection.

In the Cloud Boost model, regardless of whether we’re backing up in a local datacentre and pushing to object, or whether all the systems involved in the backup process are sitting in public cloud, the actual backup data never lands on conventional block storage – after it is deduplicated, compressed and encrypted it lands first and only in object storage.

Cloud Tier

Cloud Tier is new functionality released in the Data Domain product range – it became available with Data Domain OS v6, released in the second half of 2016. The workflow for Cloud Tier looks like the following:

CloudTier Workflow

CloudTier Workflow

Data migration with Cloud Tier is handled as a function of the Data Domain operating system (or controlled by a fully integrated application such as NetWorker or Avamar); the general policy process is that once data has reached a certain age on the Active Tier of the Data Domain, it is migrated to the Cloud Tier without any need for administrator or user involvement.

The key for the differences – and the different use cases – between Cloud Boost and Cloud Tier is in the above sentence: “once data has reached a certain age on the Active Tier”. In this we’re reminded of the primary use case for Cloud Tier – supporting Long Term Retention (LTR) in a highly economical format and bypassing any need for tape within an environment. (Of course, the other easy differentiator is that Cloud Tier is a Data Domain feature – depending on your environment that may form part of the decision process.)

Example use cases

To get a feel for the differences in where you might deploy Cloud Boost or Cloud Tier, I’ve drawn up a few use cases below.

Cloning to Cloud

You currently backup to disk (Data Domain or AFTD) within your environment, and have been cloning to tape. You want to ensure you’ve got a second copy of your data, and you want to keep that data off-site. Instead of using tape, you want to use Cloud object storage.

In this scenario, you might look at replacing your tape library with a Cloud Boost system instead. You’d backup to your local protection storage, then when it’s time to generate your secondary copy, you’d clone to your Cloud Boost device which would push the data (compressed, deduplicated and encrypted) up into object storage. At a high level, that might result in a workflow such as the following:

CloudBoost Clone To Cloud

CloudBoost Clone To Cloud

Backing up to the Cloud

You’re currently backing up locally within your datacentre, but you want to remove all local backup targets.  In this scenario, you might replace your local backup storage with a Cloud Boost appliance, connected to an object store, and backup via Cloud Boost (via client direct), landing data immediately off-premises and into object storage at a cloud provider (public or hosted).

At a high level, the workflow for this resembles the following:

CloudBoost Backup to Cloud

CloudBoost Backup to Cloud

Backing up in Cloud

You’ve got some IaaS systems sitting in the Cloud already. File, web and database servers sitting in say, Amazon, and you need to ensure you can protect the data they’re hosting. You want greater control than say, Amazon snapshots, and since you’re using a NetWorker Capacity license or a DPS capacity license, you know you can just spin up another NetWorker server without an issue – sitting in the cloud itself.

In that case, you’d spin up not only the NetWorker server but a Cloud Boost appliance as well – after all, Amazon love NetWorker + Cloud Boost:

“The availability of Dell EMC NetWorker with CloudBoost on AWS is a particularly exciting announcement for all of the customers who have come to depend on Dell EMC solutions for data protection in their on-premises environments,” said Bill Vass, Vice President, Technology, Amazon Web Services, Inc. “Now these customers can get the same data protection experience on AWS, providing seamless operational backup and recovery, and long-term retention across all of their environments.”

That’ll deliver the NetWorker functionality you’ve come to use on a daily basis, but in the Cloud and writing directly to object storage.

The high level view of the backup workflow here is effectively the same as the original diagram used to introduce Cloud Boost.

Replacing Tape for Long Term Retention

You’ve got a Data Domain in each datacentre; the backups at each site go to the local Data Domain then using Clone Controlled Replication are copied to the other Data Domain as soon as each saveset finishes. You’d like to replace tape for your long term retention, but since you’re protecting a lot of data, you want to push data you rarely need to recover from (say, older than 2 months) out to object storage. When you do need to recover that data, you want to absolutely minimise the amount of data that needs to be retrieved from the Cloud.

This is a definite Cloud Tier solution. Cloud Tier can be used to automatically extend the Data Domain storage, providing a storage tier for long term retention data that’s very cheap and highly reliable. Cloud Tier can be configured to automatically migrate data older than 2 months out to object storage, and the great thing is, it can do it automatically for anything written to the Data Domain. So if you’ve got some databases using DDBoost for Enterprise Apps writing directly, you can setup migration policies for them, too. Best of all, when you do need to recall data from Cloud Tier, Boost for Enterprise Apps and NetWorker can handle that recall process automatically for you, and the Data Domain only ever recalls the delta between deduplicated data already sitting on the active tier and what’s out in the Cloud.

The high level view of the workflow for this use case will resemble the following:

Cloud Tier to LTR NSR+DDBEA

Cloud Tier to LTR for NetWorker and DDBEA

…Actually, you hear there’s an Isilon being purchased and the storage team are thinking about using Cloud Pools to tier really old data out to object storage. Your team and the storage team get to talking and decide that by pooling the protection and storage budget, you get Isilon, Cloud Tier and ECS, providing oodles of cheap object storage on-site at a fraction of the cost of a public cloud, and with none of the egress costs or cloud vendor lock-in.

Wrapping Up

Cloud Tier and Cloud Boost are both able to push data into object storage, but they don’t have exactly the same use cases. There’s good, clear reasons why you would work with one in particular, and hopefully the explanation and examples above has helped to set the scene on their use cases.


* Note, ‘on-premise’ would mean ‘on my argument’. The correct term is ‘on-premises’ 🙂

Why do I need eCDM?

 Architecture  Comments Off on Why do I need eCDM?
Jan 052017
 

Why do I need eCDM? Your first question to me might be what is eCDM? Well, that’s a fair point – it’s a relatively new term, and it’s also an eTLA – an extended Three Letter Acronym. eCDM refers to enterprise copy data management. Now this isn’t necessarily a backup topic, but backup is one of the components that fit into copy data, so it’s a sideline topic and one which will get increasing attention in the coming years.

You see, data growth continues to increase, but the primary data set, the original dataset is typically growing at a fairly straight-forward rate. It’s definitely growing, but not to the same degree as the occupancy by copies of the data. The numbers as such (timescale or size) don’t really matter, but for a lot of businesses, the disparity between original dataset and copies of the dataset if mapped will look something along the lines of the following:

original size vs copies

Your mileage depending on what your original data is, its RPOs and RTOs, and what you use it for will vary of course, and that’ll have an impact on the number of copies you keep. But that sort of comparison will look similar to a lot of people when they sit down to think about it. In fact, when the topic of eCDM first started coming up in (at the time) EMC, a colleague and I got to talking about our experiences with it. I related a few of my experiences from my prior life as a Unix system administrator, and my colleague mentioned in a previous job they’d actually done a thought experiment to calculate the average number of copies being retained for databases in their environment. The number? 26. And his first thought when he said that number was “…it was probably too low”.

If you’re not sure about this, let’s run our own experiment, and start with a 1TB database.

Original Copy

We all know though, right, that 1TB isn’t 1TB isn’t 1TB when it comes to enterprise storage? For a start, that’s not going to occupy 1TB of disk storage – there’s going to be RAID involved. (I’m not going to bother with RAID here, that’s an overhead no matter what.)

For important systems within our environments requiring fast recovery times with minimised data loss, we’ll be looking at maintaining snapshots. Now, these days snapshots are usually logical copies rather than complete 1:1 copies, but there’s still a data overhead as changes occur, and there’s still a management cost (even if only at a human level) for the snapshots. So our 1TB database becomes something along the following lines:

Original Copy + Snapshots

But we don’t just keep a local instance of important data – it gets replicated:

Original + Snapshots + Replica

And of course, when we’re replicating, we typically need to keep snapshots in the replica storage as well to ensure a replication of corruption doesn’t destroy our recoverability in a disaster:

original + snapshots + replica + replsnaps

Of course, so far we’ve just been looking at primary copies, so we also need to think about having a backup copy:

primary snap repl replsnap backup

But we know a single backup isn’t much of a backup, so we’re going to be keeping multiple backups:

primary snap repl replsnap backup buretent

And of course, a backup should never be a single point of failure within an environment, so that’s more likely going to look like the following:

primary snap repl replsnap backup clone

They’re just our regular retention backups though. Most companies, whether we’d encourage them otherwise or not, will use the backup environment to maintain long term retention (LTR) copies for compliance purposes, so we need to consider those copies as well:

primary through to LTR

And again, LTR backups shouldn’t be a single source of failure either, so:

primary through to LTR clone

Phew! Regardless of whether you’re deduplicating, regardless of whether your snapshots are copy-on-first-write or some other space saving technique, that’s a lot of copies being built up, and as changes occur, that’s still a growing increase in the amount of primary and protection storage required for just 1 TB of original data.

Yet we’re not done. For businesses facing new threats, there’ll potentially be copies of the database sitting in an isolated recovery site, too:

primary through to IRS

Even more copies. Admittedly your IRS copies shouldn’t be manageable via conventional means, but they’re there, and if you’ve got to manage them on top of all the other copies, you’re compounding your work again.

Yet we’re still not done. So far I’ve covered an indicative set of copies for the original use case of the original data. That’s not the end of it!

primary through to IRS and other uses

So let’s return to the original question: why do I need eCDM? By now that should be answered: why don’t you need eCDM!? You’ve got copies … and copies and copies and copies … sitting in your environment, occupying primary storage, replication storage, protection storage, and in time even possibly in cloud storage too.

This is where eCDM comes in – having a system that can give you focus on the copies of data being generated in your environment. In version 1 of eCDM, we’ve focused on giving power over ProtectPoint related copies, but that’s a starting point. There’s more to follow over time, of course.

To quote the eCDM product page, eCDM:

  • Discovers copies non-disruptively across the enterprise for global oversight
  • Automates SLO compliance and efficient copy creation
  • Optimizes IT operations based on actionable insight

eCDM is a new tier of data management to help organisations deal with copy data, and if you sit down and think about the number of copies you might have (particularly of critical data sets in your organisation), you’ll probably find yourself wondering why you don’t have it installed already.

My cup runneth over

 Architecture, Backup theory, Best Practice  Comments Off on My cup runneth over
Nov 242016
 

How do you handle data protection storage capacity?

How do you handle growth – regular or unexpected – in your data protection volumes?


Hey, just as an aside, the NetWorker 2016 Usage Survey is up and running. If you can spare 5 minutes to complete it at the end of this article, that would be greatly appreciated!


Is your business reactive or proactive to data protection capacity requirements?

Glass

In the land of tape, dealing with capacity growth in data protection was both easy and insidiously obfuscated. Tape capacity management is basically a modern version of Hilbert’s Infinite Hotel Paradox – you sort-of, kind-of never run out of capacity because you always just buy another box of tapes. Problem solved, right? (No, more a case of the can kicked down the road.) Problem “solved” and you’ve got 1,000, 10,000, 50,000 tapes in a multitude of media types that you don’t even have tape drives to read any more.

Yet we like to focus on the real world now and tape isn’t the defacto standard any more for backup systems: it’s disk. Disk gives us great power, but with great power comes great responsibility (sorry, even though I’m not a Spiderman fan, I couldn’t resist. Tape is the opposite: tape gives us no power, and with no power comes no responsibility – yes, I’m also a Kickass fan.)

For businesses that still do disk-to-disk-to-tape, where disk is treated more like a staging area and excess data is written out to tape, the problem is seemingly solved because – you guessed it – you can always just buy another box of tapes and stage more data from disk backup storage to tape storage. Again, that’s kicking the can down the road. I’ve known businesses who have had company-wide data protection policies mandating up to 3 months of online recoverability from disk getting down to two weeks or less of data stored on disk because the data to be protected has continued to grow, no scaling has been done on the storage, and – you guessed it – tape was the interim solution.

Aside: When I first joined my first Unix system administration team in 1996, my team had just recently configuring an interim DNS server which they called tmp, because it was going to be quickly replaced by another server, which for the short term was called nc for new computer. When I left in 2000, tmp and nc were still there; in fact, nnc (yes, new-new-computer) was deployed shortly thereafter to replace nc and eventually, a year or two after I left, tmp was finally decommissioned.

Interim solutions have a tendency to stick. In fact, it’s a common story – capacity problem with data protection so let’s deploy an interim solution and solve it later. Later-later. Much later. Much later-later. Ad-infinitum.

There is, undoubtedly, a growing maturity in handling data protection storage management and capacity planning coming out of the pure disk and disk/cloud storage formats. While this is driven by necessity, it’s also and important demonstration that IT processes need to mature as the business matures as well.

If you’re new to pure disk based, or disk/cloud based data protection storage, you might want to stop and think carefully about your data protection policies and procurement processes/cycles so that you’re able to properly meet the requirements of the business. Here are a few tips I’ve learnt over the years…

80% is the new 100%

This one is easy. Don’t think of 100% capacity as being 100% capacity. Think of 80% as 100%. Why? Because you need runway to either procure storage, migrate data or get formal approval for changes to retention and backup policies. If you wait until you’re at 90, 95 or even 100% capacity, you’ve left your run too late and you’re just asking for many late or sleepless nights managing a challenge that could have been proactively dealt with earlier.

The key to management is measurement

I firmly believe you can’t manage something that has operational capacity restraints (e.g., “if we hit 100% capacity we can’t do more backups”) if you’re not actively measuring it. That doesn’t mean periodically logging into a console or running a “df -h” or whatever the “at a glance” look is for your data protection storage, it means capturing measurement data and having it available in both reports and dashboards so it is instantly visible.

The key to measurement is trending

You can capture all the data in the world and make it available in a dashboard, but if you don’t perform appropriate localised trending against that data to analyse it, you’re making your own good self the bottleneck (and weakest link) in the capacity management equation. You need to have trends produced as part of your reporting processes to understand how capacity is changing over time. These trends should be reflective of your own seasonal data variations or sampled over multiple time periods. Why? Well, if you have disk based data protection storage in your environment and do a linear forecast on capacity utilisation from day one, you’ll likely get a smoothing based on lower figures from earlier in the system lifecycle that could actually obfuscate more recent results. So you want to capture and trend that information for comparison, but you equally want to capture and trend shorter timeframes to ensure you have an understanding of shorter term changes. Trends based on the last six and three months usage profiles can be very useful in identifying what sort of capacity management challenges you’ve got based on short term changes in data usage profiles – a few systems for instance might be considerably spiking in utilisation, and if you’re still comparing against a 3-year timeframe dataset or something along those lines, the more recent profile may not be accurately represented in forecasts.

In short: measuring over multiple periods gives you the best accuracy.

Maximum is the new minimum

Linear forecasts of trending information are good if you’re just slowly, continually increasing your storage requirements. But if you’re either staging data (disk as staging) or running garbage collection (e.g., deduplication), it’s quite possible to get increasing sawtooth cycles in capacity utilisation on your data protection storage. And guess what? It doesn’t matter if your capacity requirements for the average utilisation are met if you’ll grow beyond the capacity requirements of the day before the oldest backups are deleted or garbage collection takes place. So make sure when you’re trending you’re looking at how you meet the changing maximum peaks, not the average sizes.

Know your windows

There’s three types of windows I’m referring to here – change, change freeze, and procurement.

You need to know them all intimately.

You’re at 95% capacity but you anticipated this and additional data protection storage has just arrived in your datacentre’s receiving bay, so you should be right to install it – right? What happens if you then have a week’s wait to have the change board consider your request for an outage – or datacentre access – to install the extra capacity? Will you be able to hold on that long? That’s knowing your change windows.

You know you’re going to run out of capacity in two months time if nothing is done, so you order additional data protection storage and it arrives on December 20. The only problem is a mandatory company change blackout started on December 19 and you literally cannot install anything, until January 20. Do you have enough capacity to survive? That’s knowing your freeze windows.

You know you’re at 80% capacity today and based on the trends you’ll be at 90% capacity in 3 weeks and 95% capacity in 4 weeks. How long does it take to get a purchase order approved? How long does it take the additionally purchased systems to arrive on-site? If it takes you 4 weeks to get purchase approval and another 3 weeks for it to arrive after the purchase order is sent, maybe 70%, not 80%, is your new 100%. That’s knowing your procurement windows.

Final thoughts

I want to stress – this isn’t a doom and gloom article, even if it seems I’m painting a blunt picture. What I’ve described above is expert tips – not from myself, but from my customers, and customers of colleagues and friends, whom I’ve seen manage data protection storage capacity well. If you follow at least the above guidelines, you’re going to have a far more successful – and more relaxed – time of it all.

And maybe you’ll get to spend Thanksgiving, Christmas, Ramadan, Maha Shivaratri, Summer Solstice, Melbourne Cup Day, Labour Day or whatever your local holidays and festivals are with your friends and families, rather than manually managing an otherwise completely manageable situation.


Hey, just as an aside, the NetWorker 2016 Usage Survey is up and running. If you can spare 5 minutes to complete it at the end of this article, that would be greatly appreciated!


 

Nov 162016
 

Years ago when NetWorker Management Console was first introduced, Australians (and no doubt people in other countries with a similarly named tax law) found themselves either amused or annoyed having to type commands such as:

# /etc/init.d/gst start

Who would want to start a goods and services tax, after all? In the case of NetWorker, GST didn’t stand for a tax on purchases, but the master control software for NMC.

It’s amusing then to be back in the realm of using an overloaded three letter acronym which for many (in this case) US citizens refers to the tax-man – IRS. In this case though, IRS stands for Isolated Recovery Site.

Our view of ‘disaster recovery’ situations by and large hasn’t changed much over the two decades I’ve been working in IT. While we’ve moved from active/passive datacentres to active/active datacentres as being the norm, the types of situations that might lead to invoking disaster recovery and transitioning services from one location to another have remained largely the same, such as:

  • Site loss
  • Site access loss
  • Catastrophic hardware failure
  • Disaster recovery testing

In fact, they’re pretty much the key four reasons why we need to invoke DR – either granularly or for an entire datacentre.

The concept of an IRS is not to provide assistance in any of the above four situations. (In theory it could be utilised partly for any of the above, in practice that’s what your normal disaster recovery datacentre is about, regardless of whether it’s in an active/active or active/passive configuration with your primary.)

Hactivism

Deploying an IRS solution within your environment is about protecting you from modern threat vectors. It represents a business maturity that accepts any, many or all of the following scenarios:

  • Users not understanding what they’re doing represent a threat vector that can no longer be casually protected against by using anti-virus software and firewalls
  • Administrators can make mistakes – not just little ‘boo-boos’, but catastrophic mistakes
  • On-platform protection should only form part of a holistic data protection environment
  • It is no longer a case of keeping malicious individuals out of your IT infrastructure, but also recognising they may already be inside
  • Protests are no longer confined to letter writing campaigns, boycotts and demonstrations

Before I explain some of those situations, it would be helpful to provide a high level overview of what one kind of IRS layout might look like:

Basic High Level IRS

The key things to understand in an IRS configuration such as the above are:

  • Your tertiary data copy (the IRS copy) is not, in the conventional sense of the word, connected to your network
  • You either use physical network separation (with periodic plugging of cables in) or automated control of network separation, with control accessible only within the IRS bunker
  • The Data Domain in your IRS bunker will optimally be configured with governance and retention lock
  • Your primary backup environment will not be aware of the tertiary Data Domain

IRS is not for traditional Business As Usual (BAU) or disaster recovery. You will still run those standard recovery operations out of your primary and/or disaster recovery sites in the same way as you would normally.

So what are some of the examples where you might resort to an IRS copy?

  • Tired/or disgruntled admin triggers deletion of primary and DR storage, including snapshots
  • Ransomware infects a primary file server, encrypting data and flooding the snapshot pool to the point the system can’t be recovered from
  • Hactivists penetrate the network and prior to deleting production system data, delete backup data.

These aren’t ‘example’ use cases, they’ve happened. In the first two if you’re using off-platform protection, you’re probably safe – but if you’re not, you’ve lost data. In the third example, there have been several examples over the last few years where this penetration has successfully been carried out by hactivists.

Maybe you feel your environment is not of interest from hactivists. If you work in the finance industry, you’re wrong. If you work in government, you’re wrong. OK, maybe you don’t work in either of those areas.

With the increasing availability of tools and broader surface area for malicious individuals or groups to strike with, hactivism isn’t limited to just the ‘conventional’ high profile industry verticals. Maybe you’re a pharmaceutical company that purchased the patent on a cheap drug then enraged communities by increasing prices by 400 times. Maybe you’re a theatre chain showing a movie a certain group has taken significant offence at. Maybe you’re a retail company selling products containing palm oil, or toilet paper not sourced from environmentally sustainable forests. Maybe you’re an energy company. Maybe you’re a company doing a really good job but have a few ex-employees with an axe to grind. If you’ve ever read an online forum thread, you’ll probably recognise that some people are trollish enough to do it just for the fun of it.

Gone are the days where you worried about hactivism if you happened to be running a nuclear enrichment programme.

IRS is about protecting you from those sorts of scenarios. By keeping at least a core of your critical data on a tertiary, locked down Data Domain that’s not accessible via the network, you’re not only leveraging the industry leading Data Invulnerability Architecture (DIA) to ensure what’s written is what’s read, you’re also ensuring that tertiary copy is off platform to the rest of your environment.

And the great thing is, products like NetWorker are basically designed from the ground up to be used in an IRS configuration. NetWorker’s long and rich history of command automation means you can build into that Control & Verification service area whatever you need to take read-write snapshots of replicated data, DR an isolated NetWorker server and perform automated test recoveries.

One last point – something I’ve discussed with a few customers recently – you might be having an ahah! moment and point to a box of tapes somewhere and say “There’s my IRS solution!” I can answer that with one simple question: If you went to your business and said you could scrap a disaster recovery site and instead rely on tape to perform all the required recoveries, what would they say? Tape isn’t an IRS option except perhaps for the most lackadaisical data protection environments. (I’d suggest it’d even be an Icarus IRS solution – trusting that wax won’t melt when you fly your business too close to the sun.)

There’s some coverage of IRS in my upcoming book, Data Protection: Ensuring Data Availability, and of course, you can read up on Dell EMC’s IRS offerings too. A good starting point is this solution overview. If you’re in IT – Infrastructure or Security – have a chat to your risk officers and ask them what they think about those sorts of challenges outlined above. Chances are they’re already worried about them, and you could very well be bringing them the solution that’ll let everyone sleep easily at night. You might one day find yourself saying “I love the IRS”.