Basics – Understanding NetWorker Dependency Tracking

 Backup theory, NetWorker  Comments Off on Basics – Understanding NetWorker Dependency Tracking
Sep 162017
 

Dependency tracking is an absolutely essential feature within a backup product. It’s there to ensure you can recover data through the entire specified retention period for your backups, regardless of what mix of full, differential and/or incremental backups you do. It’s staggering to think there are some backup products out there (*cough* net *cough* ‘backup’), that treat backup retention with such contempt that they don’t bother to enforce dependency preservation.

Without dependency tracking, you’ve always got the risk that a recovery you want to do on the edge of your specified retention period might fail.

NetWorker does dependency tracking by default. In fact, it only does dependency tracking. To understand how dependency tracking works, and what that means for protecting your backups, check out my video below. (Make sure to switch it into High Definition – it’s not about being able to see more of my beard, but it is to make sure you can see all the screen content!)


Dependency tracking is such an important feature in data protection that you’ll find it’s also covered in my book, Data Protection: Ensuring Data Availability.


On another note, I’m starting a new project. I may work in IT, but I’ve always been a fan of philosophy, too. The new project is called Fools Rush In, and it’s going to be an ongoing weekly exploration of topics relating to ethics in IT and modern technology. It’s going to be long-form in its approach – the perfect thing to sit down and read over a cup of coffee or tea. This’ll be an exciting journey, and I’d love it if you joined me on it. The introductory article is …where angels fear to tread, and the latest post, What is Ethics? gives a bit of a primer on schools of ethical thought and how we can start approaching ethics in IT/technology.

Talking about Ransomware

 Architecture, Backup theory, General thoughts, Recovery, Security  Comments Off on Talking about Ransomware
Sep 062017
 

The “Wannacry” Ransomware strike saw a particularly large number of systems infected and garnered a great deal of media attention.

Ransomware Image

As you’d expect, many companies discussed ransomware and their solutions for it. There was also backlash from many quarters suggesting people were using a ransomware attack to unethically spruik their solutions. It almost seems to be the IT equivalent of calling lawyers “ambulance chasers”.

We are (albeit briefly, I am sure), between major ransomware outbreaks. So, logically that’ll mean it’s OK to talk about ransomware.

Now, there’s a few things to note about ransomware and defending against it. It’s not as simplistic as “I only have to do X and I’ll solve the problem”. It’s a multi-layered issue requiring user education, appropriate systems patching, appropriate security, appropriate data protection, and so on.

Focusing even on data protection, that’s a multi-layered approach as well. In order to have a data protection environment that can assuredly protect you from ransomware, you need to do the basics, such as operating system level protection for backup servers, storage nodes, etc. That’s just the beginning. The next step is making sure your backup environment itself follows appropriate security protocols. That’s something I’ve been banging on about for several years now. That’s not the full picture though. Once you’ve got operating systems and backup systems secured via best practices, you need to then look at hardening your backup environment. There’s a difference between standard security processes and hardened security processes, and if you’re worried about ransomware this is something you should be thinking about doing. Then, of course, if you really want to ensure you can recover your most critical data from a serious hactivism and ransomware (or outright data destruction) breach, you need to look at IRS as well.

But let’s step back, because I think it’s important to make a point here about when we can talk about ransomware.

I’ve worked in data protection my entire professional career. (Even when I was a system administrator for the first four years of it, I was the primary backup administrator as well. It’s always been a focus.)

If there’s one thing I’ve observed in my career in data protection is that having a “head in the sand” approach to data loss risk is a lamentably common thing. Even in 2017 I’m still hearing things like “We can’t back this environment up because the project which spun it up didn’t budget for backup”, and “We’ll worry about backup later”. Not to mention the old chestnut, “it’s out of warranty so we’ll do an Icarus support contract“.

Now the flipside of the above paragraph is this: if things go wrong in any of those situations, suddenly there’s a very real interest in talking about options to prevent a future issue.

It may be a career limiting move to say this, but I’m not in sales to make sales. I’m in sales to positively change things for my customers. I want to help customers resolve problems, and deliver better outcomes to their users. I’ve been doing data protection for over 20 years. The only reason someone stays in data protection that long is because they’re passionate about it, and the reason we’re passionate about it is because we are fundamentally averse to data loss.

So why do we want to talk about defending against or recovering from ransomware during a ransomware outbreak? It’s simple. At the point of a ransomware outbreak, there’s a few things we can be sure of:

  • Business attention is focused on ransomware
  • People are talking about ransomware
  • People are being directly impacted by ransomware

This isn’t ambulance chasing. This is about making the best of a bad situation – I don’t want businesses to lose data, or have it encrypted and see them have to pay a ransom to get it back – but if they are in that situation, I want them to know there are techniques and options to prevent it from striking them again. And at that point in time – during a ransomware attack – people are interested in understanding how to stop it from happening again.

Now, we have to still be considerate in how we discuss such situations. That’s a given. But it doesn’t mean the discussion can’t be had.

To me this is also an ethical consideration. Too often the focus on ethics in professional IT is around the basics: don’t break the law (note: law ≠ ethics), don’t be sexist, don’t be discriminatory, etc. That’s not really a focus on ethics, but a focus on professional conduct. Focusing on professional conduct is good, but there must also be a focus on the ethical obligations of protecting data. It’s my belief that if we fail to make the best of a bad situation to get an important message of data protection across, we’re failing our ethical obligations as data protection professionals.

Of course, in an ideal world, we’d never need to discuss how to mitigate or recover from a ransomware outbreak during said outbreak, because everyone would already be protected. But harking back to an earlier point, I’m still being told production systems were installed without consideration for data protection, so I think we’re a long way from that point.

So I’ll keep talking about protecting data from all sorts of loss situations, including ransomware, and I’ll keep having those discussions before, during and after ransomware outbreaks. That’s my job, and that’s my passion: data protection. It’s not gloating, it’s not ambulance chasing, it’s let’s make sure this doesn’t happen again.


On another note, sales are really great for my book, Data Protection: Ensuring Data Availability, released earlier this year. I have to admit, I may have squealed a little when I got my first royalty statement. So, if you’ve already purchased my book: you have my sincere thanks. If you’ve not, that means you’re missing out on an epic story of protecting data in the face of amazing odds. So check it out, it’s in eBook or Paperback format on Amazon (prior link), or if you’d prefer to, you can buy direct from the publisher. And thanks again for being such an awesome reader.

Mar 102017
 

In 2008 I published “Enterprise Systems Backup and Recovery: A corporate insurance policy”. It dealt pretty much exclusively, as you might imagine, with backup and recovery concepts. Other activities like snapshots, replication, etc., were outside the scope of the book. Snapshots, as I recall, were mainly covered as an appendix item.

Fast forward almost a decade and there’s a new book on the marketplace, “Data Protection: Ensuring Data Availability” by yours truly, and it is not just focused on backup and recovery. There’s snapshots, replication, continuous data protection, archive, etc., all covered. Any reader of my blogs will know though that I don’t just think of the technology: there’s the business aspects to it as well, the process, training and people side of the equation. There’s two other titles I bandied with: “Backup is dead, long live backup”, and “Icarus Fell: Understanding risk in the modern IT environment”.

You might be wondering why in 2017 there’s a need for a book dedicated to data protection.

Puzzle Pieces

We’ve come a long way in data protection, but we’re now actually teetering on an interesting precipice, one which we need to understand and manage very carefully. In fact, one which has resulted in significant data loss situations for many companies world-wide.

IT has shifted from the datacentre to – well, anywhere. There’s still a strong datacentre focus. The estimates from various industry analysts is that around 70% of IT infrastructure spend is still based in the datacentre. That number is shrinking, but IT infrastructure is not; instead, it’s morphing. ‘Shadow IT’ is becoming more popular – business units going off on their own and deploying systems without necessarily talking to their IT departments. To be fair, Shadow IT always existed – it’s just back in the 90s and early 00s, it required the business units to actually buy the equipment. Now they just need to provide a credit card to a cloud provider.

Businesses are also starting to divest themselves of IT activities that aren’t their “bread and butter”, so to speak. A financial company or a hospital doesn’t make money from running an email system, so they outsource that email – and increasingly it’s to someone like Microsoft via Office 365.

Simply put, IT has become significantly more commoditised, accessible and abstracted over the past decade. All of this is good for the business, except it brings the business closer to that precipice I mentioned before.

What precipice? Risk. We’re going from datacentres where we don’t lose data because we’re deploying on highly resilient systems with 5 x 9s availability, robust layers of data protection and formal processes into situations where data is pushed out of the datacentre, out of the protection of the business. The old adage, “never assume, you make an ass out of u and me” is finding new ground in this modern approach to IT. Business groups trying to do a little data analytics rent a database at an hourly rate from a cloud provider and find good results, so they start using it more and more. But don’t think about data protection because they’ve never had to before. That led to things like the devastating data losses encountered by MongoDB users. Startups with higher level IT ideas are offering services without any understanding of the fundamental requirements of infrastructure protection. Businesses daily are finding that because they’ve spread their data over such a broad area, the attack vector has staggeringly increased, and hackers are turning that into a profitable business.

So returning to one of my first comments … you might be wondering why in 2017 there’s a need for a book dedicated to data protection? It’s simple: the requirement for data protection never goes away, regardless of whose infrastructure you’re using, or where your data resides. IT is standing on the brink of a significant evolution in how services are offered and consumed, and in so many situations it’s like a return to the early 90s. “Oh yeah, we bought a new server for a new project, it’s gone live. Does anyone know how we back it up?” It’s a new generation of IT and business users that need to be educated about data protection. Business is also demanding a return on investment for as much IT spend as possible, and that means data protection also needs to evolve to offer something back to the business other than saving you when the chips are down.

That’s why I’ve got a new book out about data protection: because the problem has not gone away. IT has evolved, but so has risk. That means data protection technology, data protection processes, and the way that we talk about data protection has to evolve as well. Otherwise we, as IT professionals, have failed in our professional duties.

I’m a passionate believer that we can always find a way to protect data. We think of it as business data, but it’s also user data. Customer data. If you work in IT for an airline it’s not just a flight bookings database you’re protecting, but the travel plans, the holiday plans, the emergency trips to sick relatives or getting to a meeting on time that you’re protecting, too. If you work in IT at a university, you’re not just protecting details that can be used for student billing, but also the future hopes and dreams of every student to pass through.

Let’s be passionate about data protection together. Let’s have that conversation with the business and help them understand how data protection doesn’t go away just because infrastructure it evolving. Let’s help the business understand that data protection isn’t a budget sink-hole, but it can improve processes and deliver real returns to the business. Let’s make sure that data, no matter where it is, is adequately protected and we can avoid that precipice.

“Data Protection: Ensuring Data Availability” is available now from the a variety of sellers, including my publisher and Amazon. Come on a journey with me and discover why backup is dead, long live backup.

Build vs Buy

 Architecture, Backup theory, Best Practice  Comments Off on Build vs Buy
Feb 182017
 

Converged, and even more so, hyperconverged computing, is all premised around the notion of build vs buy. Are you better off having your IT staff build your infrastructure from the ground up, managing it in silos of teams, or are you do you want to buy tightly integrated kit, land it on the floor and start using it immediately?

Dell-EMC’s team use the analogy – do you build your car, or do you buy it? I think this is a good analogy: it speaks to how the vast majority of car users consume vehicle technology. They buy a complete, engineered car as a package, and drive it off the car sales lot complete. Sure, there’s tinkerers who might like to build a car from scratch, but they’re not the average consumer. For me it’s a bit like personal computing – I gave up years ago wanting to build my own computers. I’m not interested in buying CPUs, RAM, motherboards, power supplies, etc., dealing with the landmines of compatibility, drivers and physical installation before I can get a usable piece of equipment.

This is where many people believe IT is moving, and there’s some common sense in it – it’s about time to usefulness.

A question I’m periodically posed is – what has backup got to do with the build vs buy aspect of hyperconverged? For one, it’s not just backup – it’s data protection – but secondly, it has everything to do with hyperconverged.

If we return to that build vs buy example of – would you build a car or buy a car, let me ask a question of you as a car consumer – a buyer rather than a builder of a car. Would you get airbags included, or would you search around for third party airbags?

Airbags

To be honest, I’m not aware of anyone who buys a car, drives it off the lot, and starts thinking, “Do I go to Airbags R Us, or Art’s Airbag Emporium to get my protection?”

That’s because the airbags come built-in.

For me at least, that’s the crux of the matter in the converged and hyper-converged market. Do you want third party airbags that you have to install and configure yourself, and hope they work with that integrated solution you’ve got bought, or do you want airbags included and installed as part of the purchase?

You buy a hyperconverged solution because you want integrated virtualisation, integrated storage, integrated configuration, integrated management, integrated compute, integrated networking. Why wouldn’t you also want integrated data protection? Integrated data protection that’s baked into the service catalogue and part of the kit as it lands on your floor. If it’s about time to usefulness it doesn’t stop at the primary data copy – it should also include the protection copies, too.

Airbags shouldn’t be treated as optional, after-market extras, and neither should data protection.

Feb 122017
 

On January 31, GitLab suffered a significant issue resulting in a data loss situation. In their own words, the replica of their production database was deleted, the production database was then accidentally deleted, then it turned out their backups hadn’t run. They got systems back with snapshots, but not without permanently losing some data. This in itself is an excellent example of the need for multiple data protection strategies; your data protection should not represent a single point of failure within the business, so having layered approaches to achieve a variety of retention times, RPOs, RTOs and the potential for cascading failures is always critical.

To their credit, they’ve published a comprehensive postmortem of the issue and Root Cause Analysis (RCA) of the entire issue (here), and must be applauded for being so open with everything that went wrong – as well as the steps they’re taking to avoid it happening again.

Server on Fire

But I do think some of the statements in the postmortem and RCA require a little more analysis, as they’re indicative of some of the challenges that take place in data protection.

I’m not going to speak to the scenario that led to the production, rather than replica database, being deleted. This falls into the category of “ooh crap” system administration mistakes that sadly, many of us will make in our careers. As the saying goes: accidents happen. (I have literally been in the situation of accidentally deleting a production database rather than its replica, and I can well and truly sympathise with any system or application administrator making that mistake.)

Within GitLab’s RCA under “Problem 2: restoring GitLab.com took over 18 hours”, several statements were made that irk me as a long-term data protection specialist:

Why could we not use the standard backup procedure? – The standard backup procedure uses pg_dump to perform a logical backup of the database. This procedure failed silently because it was using PostgreSQL 9.2, while GitLab.com runs on PostgreSQL 9.6.

As evidenced by a later statement (see the next RCA statement below), the procedure did not fail silently; instead, GitLab chose to filter the output of the backup process in a way that they did not monitor. There is, quite simply, a significant difference between fail silently and silently ignored results. The latter is a far more accurate statement than the former. A command that fails silently is one that exits with no error condition or alert. Instead:

Why did the backup procedure fail silently? – Notifications were sent upon failure, but because of the Emails being rejected there was no indication of failure. The sender was an automated process with no other means to report any errors.

The pg_dump command didn’t fail silently, as previously asserted. It generated output which was silently ignored due to a system configuration error. Yes, a system failed to accept the emails, and a system therefore failed to send the emails, but at the end of the day, a human failed to see or otherwise check as to why the backup reports were not being received. This is actually a critical reason why we need zero error policies – in data protection, no error should be allowed to continue without investigation and rectification, and a change in or lack of reporting or monitoring data for data protection activities must be treated as an error for investigation.

Why were Azure disk snapshots not enabled? – We assumed our other backup procedures were sufficient. Furthermore, restoring these snapshots can take days.

Simple lesson: If you’re going to assume something in data protection, assume it’s not working, not that it is.

Why was the backup procedure not tested on a regular basis? – Because there was no ownership, as a result nobody was responsible for testing the procedure.

There are two sections of the answer that should serve as a dire warning: “there was no ownership”, “nobody was responsible”. This is a mistake many businesses make, but I don’t for a second believe there was no ownership. Instead, there was a failure to understand ownership. Looking at the “Team | GitLab” page, I see:

  • Dmitriy Zaporozhets, “Co-founder, Chief Technical Officer (CTO)”
    • From a technical perspective the buck stops with the CTO. The CTO does own the data protection status for the business from an IT perspective.
  • Sid Sijbrandij, “Co-founder, Chief Executive Officer (CEO)”
    • From a business perspective, the buck stops with the CEO. The CEO does own the data protection status for the business from an operational perspective, and from having the CTO reporting directly up.
  • Bruce Armstrong and Villi Iltchev, “Board of Directors”
    • The Board of Directors is responsible for ensuring the business is running legally, safely and financially securely. They indirectly own all procedures and processes within the business.
  • Stan Hu, “VP of Engineering”
    • Vice-President of Engineering, reporting to the CEO. If the CTO sets the technical direction of the company, an engineering or infrastructure leader is responsible for making sure the company’s IT works correctly. That includes data protection functions.
  • Pablo Carranza, “Production Lead”
    • Reporting to the Infrastructure Director (a position currently open). Data protection is a production function.
  • Infrastructure Director:
    • Currently assigned to Sid (see above), as an open position, the infrastructure director is another link in the chain of responsibility and ownership for data protection functions.

I’m not calling these people out to shame them, or rub salt into their wounds – mistakes happen. But I am suggesting GitLab has abnegated its collective responsibility by simply suggesting “there was no ownership”, when in fact, as evidenced by their “Team” page, there was. In fact, there was plenty of ownership, but it was clearly not appropriately understood along the technical lines of the business, and indeed right up into the senior operational lines of the business.

You don’t get to say that no-one owned the data protection functions. Only that no-one understood they owned the data protection functions. One day we might stop having these discussions. But clearly not today.

 

Jan 242017
 

In 2013 I undertook the endeavour to revisit some of the topics from my first book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy”, and expand it based on the changes that had happened in the industry since the publication of the original in 2008.

A lot had happened since that time. At the point I was writing my first book, deduplication was an emerging trend, but tape was still entrenched in the datacentre. While backup to disk was an increasingly common scenario, it was (for the most part) mainly used as a staging activity (“disk to disk to tape”), and backup to disk use was either dumb filesystems or Virtual Tape Libraries (VTL).

The Cloud, seemingly ubiquitous now, was still emerging. Many (myself included) struggled to see how the Cloud was any different from outsourcing with a bit of someone else’s hardware thrown in. Now, core tenets of Cloud computing that made it so popular (e.g., agility and scaleability) have been well and truly adopted as essential tenets of the modern datacentre, as well. Indeed, for on-premises IT to compete against Cloud, on-premises IT has increasingly focused on delivering a private-Cloud or hybrid-Cloud experience to their businesses.

When I started as a Unix System Administrator in 1996, at least in Australia, SANs were relatively new. In fact, I remember around 1998 or 1999 having a couple of sales executives from this company called EMC come in to talk about their Symmetrix arrays. At the time the datacentre I worked in was mostly DAS with a little JBOD and just the start of very, very basic SANs.

When I was writing my first book the pinnacle of storage performance was the 15,000 RPM drive, and flash memory storage was something you (primarily) used in digital cameras only, with storage capacities measured in the hundreds of megabytes more than gigabytes (or now, terabytes).

When the first book was published, x86 virtualisation was well and truly growing into the datacentre, but traditional Unix platforms were still heavily used. Their decline and fall started when Oracle acquired Sun and killed low-cost Unix, with Linux and Windows gaining the ascendency – with virtualisation a significant driving force by adding an economy of scale that couldn’t be found in the old model. (Ironically, it had been found in an older model – the mainframe. Guess what folks, mainframe won.)

When the first book was published, we were still thinking of silo-like infrastructure within IT. Networking, compute, storage, security and data protection all as seperate functions – separately administered functions. But business, having spent a decade or two hammering into IT the need for governance and process, became hamstrung by IT governance and process and needed things done faster, cheaper, more efficiently. Cloud was one approach – hyperconvergence in particular was another: switch to a more commodity, unit-based approach, using software to virtualise and automate everything.

Where are we now?

Cloud. Virtualisation. Big Data. Converged and hyperconverged systems. Automation everywhere (guess what? Unix system administrators won, too). The need to drive costs down – IT is no longer allowed to be a sunk cost for the business, but has to deliver innovation and for many businesses, profit too. Flash systems are now offering significantly more IOPs than a traditional array could – Dell EMC for instance can now drop a 5RU system into your datacentre capable of delivering 10,000,000+ IOPs. To achieve ten million IOPs on a traditional spinning-disk array you’d need … I don’t even want to think about how many disks, rack units, racks and kilowatts of power you’d need.

The old model of backup and recovery can’t cut it in the modern environment.

The old model of backup and recovery is dead. Sort of. It’s dead as a standalone topic. When we plan or think about data protection any more, we don’t have the luxury of thinking of backup and recovery alone. We need holistic data protection strategies and a whole-of-infrastructure approach to achieving data continuity.

And that, my friends, is where Data Protection: Ensuring Data Availability is born from. It’s not just backup and recovery any more. It’s not just replication and snapshots, or continuous data protection. It’s all the technology married with business awareness, data lifecycle management and the recognition that Professor Moody in Harry Potter was right, too: “constant vigilance!”

Data Protection: Ensuring Data Availability

This isn’t a book about just backup and recovery because that’s just not enough any more. You need other data protection functions deployed holistically with a business focus and an eye on data management in order to truly have an effective data protection strategy for your business.

To give you an idea of the topics I’m covering in this book, here’s the chapter list:

  1. Introduction
  2. Contextualizing Data Protection
  3. Data Lifecycle
  4. Elements of a Protection System
  5. IT Governance and Data Protection
  6. Monitoring and Reporting
  7. Business Continuity
  8. Data Discovery
  9. Continuous Availability and Replication
  10. Snapshots
  11. Backup and Recovery
  12. The Cloud
  13. Deduplication
  14. Protecting Virtual Infrastructure
  15. Big Data
  16. Data Storage Protection
  17. Tape
  18. Converged Infrastructure
  19. Data Protection Service Catalogues
  20. Holistic Data Protection Strategies
  21. Data Recovery
  22. Choosing Protection Infrastructure
  23. The Impact of Flash on Data Protection
  24. In Closing

There’s a lot there – you’ll see the first eight chapters are not about technology, and for a good reason: you must have a grasp on the other bits before you can start considering everything else, otherwise you’re just doing point-solutions, and eventually just doing point-solutions will cost you more in time, money and risk than they give you in return.

I’m pleased to say that Data Protection: Ensuring Data Availability is released next month. You can find out more and order direct from the publisher, CRC Press, or order from Amazon, too. I hope you find it enjoyable.

How many copies do I need?

 Architecture, Backup theory, Data loss  Comments Off on How many copies do I need?
May 242016
 

So you’ve got your primary data stored on one array and it replicates to another array. How many backup copies do you need?

Copies

There’s no doubt we’re spawning more and more copies and pseudo-copies of our data. So much so that EMC’s new Enterprise Copy Data Management (eCDM) product was announced at EMC World. (For details on that, check out Chad’s blog here.)

With many production data sets spawning anywhere between 4 and 10 copies, and sometimes a lot more, a question that gets asked from time to time is: why would I need to duplicate my backups?

It seems a fair question if you’re using array to array replication, but let’s stop for a moment and think about the different types of data protection being applied in this scenario:

Replication without Cloning

Let’s say we’ve got two sites, production and disaster recovery, and for the sake of simplicity, a single SAN at each site. The two SANs replicate between one another. Backups are taken at one of the sites – in this example, the production site. There’s no duplication of the backups.

Replication is definitely a form of data protection, but its primary purpose is to provide a degree of fault tolerance – not true fault tolerance of course (that requires more effort), but the idea is that if the primary array is destroyed, there’s a copy of the data on the secondary array and it can take over production functions. Replication can also factor into maintenance activities – if you need to repair, update or even replace the primary array, you can failover operations to the secondary array, work on the primary, then fail back when you’re ready.

In the world of backups there’s an old saying however: nothing corrupts faster than a mirror. The same applies to replication…

“Ahah!”, some interject at this point, “What if the replication is asynchronous? That means if corruption happens in the source array we can turn off replication between the arrays! Problem solved!”

Over a decade ago I met an IT manager who felt the response to a virus infecting his network would be to have an operator run into the computer room and use an axe to quickly chop all the network connections away from the core switches. That might actually be more successful than relying on noticing corruption ahead of asynchronous replication windows and disconnecting replication links.

So if there’s corruption in the primary array that infects the secondary array – that’s no cause for concern, right? After all there’s a backup copy sitting there waiting and ready to be used. The answer is simple – replication isn’t just for minor types of fault tolerance or being able to switch production during maintenance operations, it’s also for those really bad disasters, such as something taking out your datacentre.

At this point it’s common to ‘solve’ the problem by moving the backups onto the secondary site (even if they run cross-site), creating a configuration like the following:

Replication, cross site backup

The thinking goes like this: if there’s a disaster at the primary site, the disaster recovery site not only takes over, but all our backups are there waiting to be used. If there’s a disaster at the disaster recovery site instead, then no data has been lost because all the data is still sitting on the production array.

Well, in only one very special circumstance: if you only need to keep backups for one day.

Backups typically offer reasonably poor RPO and RTO compared to things like replication, continuous data protection, continuous availability, snapshots, etc. But they do offer historical recoverability often essential to meet compliance requirements. Having to provide a modicum of recoverability for 7 years is practically the default these days – medical organisations typically have to retain data for the life of the patient, engineering companies for the lifespan of the construction, and so on. That’s not all backups of course – depending on your industry you’ll likely generate your long term backups either from your monthlies or your yearlies.

Aside: The use of backups to facilitate long term retention is a discussion that’s been running for the 20 years I’ve been working in data protection, and that will still be going in a decade or more. There are strong, valid arguments for using archive to achieve long term retention, but archive requires a data management policy, something many companies struggle with. Storage got cheap and the perceived cost of doing archive created a strong sense of apathy that we’re still dealing with today. Do I agree with that apathy? No, but I still have to deal with the reality of the situation.

So let’s revisit those failure scenarios again that can happen with off-site backups but no backup duplication:

  • If there’s a disaster at the primary site, the disaster recovery site takes over, and all backups are preserved
  • If there’s a disaster at the secondary site, the primary site is unaffected but the production replica data and all backups are lost: short term operational recovery backups and longer term compliance/legal retention backups

Is that a risk worth taking? I had a friend move interstate recently. The day after he moved in, his neighbour’s house burnt down. The fire spread to his house and destroyed most of his possessions. He’d been planning on getting his contents insurance updated the day of the fire.

Bad things happen. Taking the risk that you won’t lose your secondary site isn’t really operational planning, it’s casting your fate to the winds and relying on luck. The solution below though doesn’t rely on luck at all:

Replication and Duplicated Backups

There’s undoubtedly a cost involved; each copy of your data has a tangible cost regardless of whether that’s a primary copy or a secondary copy. Are there some backups you won’t copy? That depends on your requirements: there may for instance be test systems you need to backup, but there’s no need to have a secondary copy of them, but such decisions still have to be made on a risk vs cost basis.

Replication is all well and good, but it’s not a get-out-of-gaol card for avoiding cloned backups.

Melbourne DPUG and VMware Data Protection

 Backup theory, Data Domain, VBA  Comments Off on Melbourne DPUG and VMware Data Protection
Sep 202015
 

Recently a colleague and I initiated the Melbourne Data Protection User Group (DPUG).

Pug in a pile of backup tapes

If you’re interested in joining and participating and based in Melbourne, you can find details for the user group over at Meetup.

Our first presentation was on Wednesday 9 September, and EMC Melbourne were kind enough to provide the office space for the session. That being said, DPUG is not about EMC products – it’s designed to be a vendor neutral community forum to discuss techniques, strategies and best practices relating to data protection.

Starting DPUG was a healthy reminder that data protection is an overloaded term in the IT industry. To those of us who work within data storage and more broadly, IT infrastructure, data protection covers concepts such as backup and recovery, continuous availability, continuous data protection, replication, snapshots and so on. For people who work at the application layer or communication layer though, data protection is almost invariably interpreted to be something like security, data privacy or intrusion detection/threat mitigation. Data protection is a term we share with other areas of the industry. In the end it’s all data protection, but it has two very different areas of focus.

Our first session was about VMware Data Protection. We’re now seeing a very high percentage of virtualisation within most businesses – it’s not uncommon to see 80% or 90% virtualisation now, and many companies are continuing to pursue a strategy of achieving 100% system and infrastructure virtualisation.

In the VMware Data Protection presentation I walked the audience through a history of how the industry overall has protected virtual machines since their inception in the midrange space. First, we started with treating virtual machines like regular hosts – installing agents on each virtual machine and backing it up as if it were no different from a physical host. That provides a high degree of granularity and flexibility, but as we know, virtualisation is about cooperative resource sharing, whereas traditional backups are about minimising the time it takes to get data from the client into the protection storage. There’s not a lot of compatibility between “cooperative resource sharing” and “minimising the time it takes to get data from the client…”, and a poorly designed backup strategy using in-guest backup agents can bring virtual infrastructure to a screaming halt – even today.

The next attempt to provide a comprehensive solution for backing up virtual machines saw businesses installing backup agent software on the hypervisors, and writing custom scripts to snapshot virtual machines prior to copying them to protection storage. This was usually error prone and when you stop to think about how virtual machines are usually just very big files, it meant that a single change within a virtual machine would trigger a new full backup every time. Once technology such as VMotion became available these techniques became difficult if not impossible to maintain – you could not really predict where a virtual machine would be for backups at any given time. What’s more, hypervisors are a bit like NAS appliances – they’re designed to do one thing really well, and you shouldn’t be trying to install third party software on them.

The solution was an API based approach, of course. While different in practice, you can equate the API approach of VMware backups to the NDMP approach of NAS. The virtualisation system provides an integration point for backup software to use, and leveraging that, backup products are able to streamline the data protection process with image level backups and file level recoveries from those image level backups.

This is something that NetWorker for instance has been doing for some time – most recently with VBA. VBA is something I’ve covered a few times over the last twelve months (Current state of Virtual Machine Backups in NetWorker, NetWorker 8.2 and VBA Instant-Access, and Testing and Debugging an Emergency Restore, for instance).

VMware offers its own version of VBA as well so that businesses (particularly smaller ones) can still protect their environments. It used to be split into VDP and VDP/A, but as of vSphere 6 Essentials, those options have been combined into a single (free) VDP. VDP can’t do everything VBA can do – for example, VDP can’t:

  • Perform instant-access to a virtual machine (powering on from Data Domain storage)
  • Perform tape-out
  • Write to storage other than Data Domain or internal storage

As a means of demonstrating some of the advantages of virtual machine image level backups though, VDP is useful, and that’s what I used in the DPUG session earlier this month. And now, after taking the plunge and investing in some screen recording software, I’ve made three of the demos from the DPUG session available for viewing. If you’re using VBA already you’ll be familiar with all of these. However, if you’ve not yet taken the plunge in utilising VBA for your backup environment, check them out – while the demos show the VMware Data Protection Appliance (VDP) in use, they’re equally applicable and in fact it’s the same process for a VBA install in each situation.

Creating and executing a protection policy:

Executing an image level recovery that makes use of changed block tracking:

Executing a file level recovery from an image level backup:

Don’t forget, if you’re in Melbourne and want to participate in DPUG, you’re more than welcome – regardless of whether you use EMC products or not. We want this to be an open group and look forward to seeing a broad spectrum of regular companies, integrators and vendors participating!

Also, if you’re interested in seeing screencasts for NetWorker related topics on this blog, let me know.

One target to rule them all

 Architecture, Backup theory  Comments Off on One target to rule them all
May 132015
 

IntroductionData Domain

It’s true there are some data types that broadly aren’t suitable to sending to Data Domain – any more than they’re suitable for sending to any other deduplication appliance or system within any environment. Large imaging data and video files will yield minimal deduplication except over successive backups (assuming static data), and compressed and/or encrypted data aren’t all suited either.

But the majority of data within most organisations is suited for writing to Data Domain systems.

Years ago when EMC purchased Data Domain, I don’t think anyone anticipated just what they had in mind for the appliance. I certainly didn’t – and I’d been involved in the backup industry for probably 15 years at that point. Deduplication had been kicking around for several years, but it hadn’t been mainstreamed to the degree EMC has achieved.

The numbers practically speak for themselves. Data Domain represents an overwhelming lions share of the deduplication appliance space – but I’m not going to quote numbers here. I’m going to talk about the architectural vision of Data Domain.

As a target-only appliance, Data Domain represents considerable advantage to any business that deploys it, but that’s just the tip of the iceberg. The real magic happens when we start to consider the simple fact that a Data Domain is not a dumb appliance. EMC have chosen to harness the platform to deliver maximum bang for buck for any company that walks down that path.

May the source be with you

Target based deduplication works brilliantly for drastically reducing the total amount of data stored, but it still results in that data being sent. Avamar demonstrates this overwhelmingly – its source based deduplication backup process is unbelievably efficient and powerful and is a powerfully attractive choice for many businesses, particularly those in the xaaS industry.

Data Domain’s Boost functionality extends its deduplication technology up to the origin of the data. For products like NetWorker, Avamar and VDP/VDPA, this goes right to the source. (For third party products such as NetBackup, it covers the media servers.)

If Boost had stopped at NetWorker and Avamar integration, it would have been a remarkably powerful efficiency hook for many businesses, but there’s more power to be had. The extension of Data Domain Boost to include support for enterprise applications such as Oracle, SQL Server, SAP, etc., provides unparalleled extensibility in the backup space to organisations. It also means that businesses who have deployed other backup technologies but leverage the power of Data Domain deduplication in their data protection strategy can get direct client deduplication performance for what is often their most mission critical systems and applications.

I’m the first to admit that I’ve spent years trying to convince DBAs to hand over control of their application backups to NetWorker administrators. It’s a discussion I’ve won as much as I’ve lost, but the Data Domain plugins for databases have proven one key lesson: when I’ve ‘lost’ that discussion it’s not been through lack of conviction, but through lack of process. DBAs are all for efficiencies in the backup process, but given the enterprise criticality of databases in so many organisations, much of the push back on backup centralisation has been from a lack of control of the process.

The Boost application plugins get past that by allowing a business to make the decision to integrate their application backups into centralised backup storage while allowing for highly granular control of the backup process through the agreed and trusted scheduling methods that offer considerably more granular and flexible controls. Backup products offer scheduling, of course, but they’re not meant to be the bees knees of scheduling that you’ll find in products devoted solely to that purpose. That’s what DBAs have mostly resisted. (This, for what it’s worth, is the difference between app-centric aspects to backup and recovery and a decentralised backup ‘system’.)

Here’s where we’re at with Data Domain – it now sits at a nexus in the Data Centre for data protection and nearline archival storage:

May the source be with you

(Yes, it’s even very well suited for archival workloads.)

NetWorker, Avamar, VDP/VDPA, Client Direct, Enterprise Apps – I could go on – Data Domain sits at the centre ready to receive the data you want to send to it.

But that diagram isn’t quite complete. To truly get the maximised efficiency out of Data Domain, the picture really should look more like this:

Protecting the Protection

That’s right – logically, a Data Domain solution will have at least two Data Domains in it, so that whatever you’re protecting via the Data Domain will itself be protected. Now, by itself, Data Domain offers excellent protection for the data you’re storing, but unlike what most people think of on this front, RAID-6 storage protection is just the tip of the iceberg. RAID-6 is nice – it protects you from two drive failures at any point. On top of that though, you have the Data Invulnerability Architecture that you’ll hear EMC folks talk about quite regularly – that’s the magic sauce. The Data Domain doesn’t just sit there storing your data: it stores it, it checks it, it reads it again, and it checks it as part of regular verification. (If you want to compare it to tape, imagine having a tape library big enough to store every tape you keep for retention and constantly sits there loading all the tapes and confirming all the data can be read back.)

But we all know in the data protection world that you still need that added protection of keeping a second copy of that data, regardless of whether that’s for compliance or for true disaster protection. In terms of absolute efficiency, the absolute best way you’ll get a secondary copy of that data is via the global deduplicated replication offered between two Data Domains. (For what it’s worth, that’s where some companies make the mistake of deploying tape as their secondary copy from an original backup target of Data Domain: what’s the point of deploying efficient deduplication if the first thing you’re going to do is rehydrate all the content again?)

Aside: Coming back to encryption and compression

Earlier I said that compressed and encrypted workloads aren’t necessarily suited to Data Domain. That’s true, but that usually reflects an opportunity to revisit the process and thinking behind those workloads.

Compression is typically used in a data streaming activity for data protection because of a requirement to minimise the amount of data going across the network. Boost eliminates that need by doing something better than compression at the client side – deduplication. Deduplication doesn’t just compress the original data, but it substantially reduces the original data by not even bothering to send data that already exists at the target. For instance, if I turn my attention to Oracle, the two most common reasons why DBAs will create compressed Oracle backups are:

(a) They’re writing them to primary storage and trying to minimise the footprint, or

(b) They’re writing them to NAS or some other form of network storage, and want to minimise the amount of data sent over busy links.

Both of those are squarely addressed by Data Domain:

  • For (a), the footprint is automatically reduced by writing it in uncompressed format to the Data Domain. It handles the deduplication automatically. In fact, it’ll be a lot more space efficient than say, the three most recent database backups being written to Tier-1/Primary storage.
  • For (b), because only unique data is sent over the network, and that data is compressed by Boost before it’s sent over the network, you’re still ending up with a more efficient network transfer than writing a compressed copy over the network.

Encryption might be considered a trickier subject, but it’s not really. There’s two types of encryption a business might require – at rest, or in-flight. Data Domain has supported encryption at rest for quite a long time, and the recent support for in-flight encryption has completed that piece of the puzzle. (That in-flight encryption is integrated in such a way that it still allows for local/source deduplication and associated pre-send compression, too.)

What all this means

When EMC first acquired Data Domain, they acquired a solid product that had already established excellent customer trust built from high reliability and performance. While both of those features have continued to grow (not to mention capacity … have you seen the specs on the Data Domain 9500?), those features alone don’t make for a highly extensible product (just a reliable big bucket of storage). The extensibility comes from the vertical integration right up into the application stack, and the horizontal integration across a multitude of use cases.

Last year’s survey results revealed a very high number of NetWorker environments leveraging Data Domain within their environment, but what we see if we step back a little bit from a single-product focus is that Data Domain is a strategic investment in the enterprise, able to be utilised for a plethora of scenarios across the board.

So there’s two lessons – one for those with Data Domain already, and one for those preparing to jump into deduplication: if you’ve already got Data Domain in your environment, start looking at its integration points and talking to either EMC or your supplier about where else Data Domain can offer synergies, and if you’re looking at deploying, keep in mind that it’s a highly flexible appliance capable of fitting in to multiple workloads.

Either way, that’s how you achieve an excellent return on investment.

World backup day misses the point

 Backup theory  Comments Off on World backup day misses the point
Mar 302015
 

iStock Flat Earth Blog Size

It’s fair to say I’m a big fan of backup and recovery. So much so that a substantial part of the last 19 years of my career have been devoted to it in some form or another.

Yet here’s the rub: World backup day (March 31) is full of good intentions but has entirely the wrong focus. By that I don’t just mean it should be World Recovery Day (although that would be a nice change); instead, it places emphasis on just one aspect of data protection, and these days there’s no such thing as a data protection strategy that only leverages a single aspect.

Data protection – Information Lifecycle Protection (ILP), as I like to think of it starts well before the first backup is taken, and extends into a variety of fields: storage, operating systems and virtualisation. You might say at bare minimum, ILP is comprised of the following:

Components of ILP

Components of ILP

(It’s also impossible to have a truly effective Information Lifecycle Protection strategy without also having a data lifecycle management strategy – i.e., be comfortable with archival and pruning of data.)

It would be easy to look at the above diagram and assume it’s all about storage, but there’s more to it than that. Smart companies are starting to focus on their data protection in an application-centric approach. That’s not to suggest decentralisation of data protection, but more decentralised integration with intelligent centralised reporting, capacity management and policy management. For sure, storage is one aspect of what we need to protect, but if you look at an average enterprise now there are whole realms of data protection functions that have made their way up into higher layers – VMware’s SRM, vMotion, etc., are perfect examples of data-protection concepts applied at a higher level to provide more functional protection.

By application-centric approach, I’m not talking about “MSSQL Initiated” or “Oracle Initiated” (though I’ll admit that plays a part in a centralised policy/decentralised integration approach), but more a consideration of how enterprise IT needs to work in an evolving – indeed, evolved – landscape. It’s time in IT we stop thinking about backup and recovery or data protection being about a list of hosts and databases that need protection, and instead think about data protection in terms of business functions, business applications that need to be protected. From the business perspective the hosts cyclopsmedusa and cerberus running the database fipr00 is meaningless – the business wants to know that the financial planning system is being protected. As cloud based approaches to IT take hold and introduce a consumer-based, service-centric view of IT, IT must adjust to think of data protection from a service, application or business function perspective.

Celebrate world backup day by all means, but let’s keep in mind it’s at just one quadrant in the information lifecycle protection approach.

%d bloggers like this: