Basics: Planning A Recovery Service

 Architecture, Basics, Recovery  Comments Off on Basics: Planning A Recovery Service
Jan 302018
 

Introduction

In Data Protection: Ensuring Data Availability, I talk quite a lot about what you need to understand and plan as part of a data protection environment. I’m often reminded of the old saying from clothing and carpentry – “measure twice, cut once”. The lesson in that statement of course is that rushing into something headlong may make your work more problematic. Taking the time to properly plan what you’re doing though can in a lot of instances (and data protection is one such instance) make the entire process easier. This post isn’t meant to be a replacement to the various planning chapters in my book – but I’m sure it’ll have some useful tips regardless.

We don’t backup just as something to do; in fact, we don’t protect data just as something to do, either. We protect data to either shield our applications and services (and therefore our businesses) from failures, and to ensure we can recover it if necessary. So with that in mind, what are some essential activities in planning a recovery service?

Hard disk and magnifying glass

First: Do you know what the data is?

Data classification isn’t something done during a data protection cycle. Maybe one day it will be when AI and machine learning is sufficiently advanced; in the interim though it requires input from people – IT, the business, and so on. Of course, there’s nothing physically preventing you from planning and implementing a recovery service without performing data classification; I’d go so far as to suggest that an easy majority of businesses do exactly that. That doesn’t mean it’s an ideal approach though.

Data classification is all about understanding the purpose of the data, who cares about it, how it is used, and so on. It’s a collection of seemingly innocuous yet actually highly important questions. It’s something I cover quite a bit in my book, and for the very good reason that I honestly believe a recovery service can be made simpler, cheaper and more efficient if it’s complimented by a data classification process within the organisation.

Second: Does the data need to exist?

That’s right – does it need to exist? This is another essential but oft-overlooked part of achieving a cheaper, simpler and more efficient recovery service: data lifecycle management. Yet, every 1TB you can eliminate from your primary storage systems, for the average business at least, is going to yield anywhere between 10 and 30TB savings in protection storage (RAID, replication, snapshots, backup and recovery, long term recovery, etc.). While for some businesses that number may be smaller, for the majority of mid-sized and higher businesses, that 10-30TB saving is likely to go much, much higher – particularly as the criticality of the data increases.

Without a data lifecycle policy, bad things happen over time:

  • Keeping data becomes habitual rather than based on actual need
  • As ‘owners’ of data disappear (e.g., change roles, leave the company, etc.), reluctance to delete, prune or manage the data tends to increase
  • Apathy or intransigence towards developing a data lifecycle programme increases.

Businesses that avoid data classification and data lifecycle condemn themselves to the torment of Sisyphus – constantly trying to roll a boulder up a hill only to have it fall back down again before they get to the top. This manifests in many ways, of course, but in designing, acquiring and managing a data recovery service it usually hits the hardest.

Third: Does the data need to be protected?

I remain a firm believer that it’s always better to backup too much data than not enough. But that’s a default, catchall position rather than one which should be the blanket rule within the business. Part of data classification and data lifecycle will help you determine whether you need to enact specific (or any) data protection models for a dataset. It may be test database instances that can be recovered at any point from production systems; it might be randomly generated data that has no meaning outside of a very specific use case, or it might be transient data merely flowing from one location to another that does not need to be captured and stored.

Remember the lesson from data lifecycle – every 1TB eliminated from primary storage can eliminate 10-30TB of data from protection storage. The next logical step after that is to be able to accurately answer the question, “do we even need to protect this?”

Fourth: What recovery models are required?

At this point, we’ve not talked about technology. This question gets us a little closer to working out what sort of technology we need, because once we have a fair understanding of the data we need to offer recovery services for, we can start thinking about what types of recovery models will be required.

This will essential involve determining how recoveries are done for the data, such as:

  • Full or image level recoveries?
  • Granular recoveries?
  • Point in time recoveries?

Some data may not need every type of recovery model deployed for it. For some data, granular recoverability is equally important as complete recoverability, for other types of data, it could be that the only way to recover it is image/full – wherein granular recoveries would simply leave data corrupted or useless. Does all data require point in time recovery? Much will, but some may not.

Other recovery models you should consider of course are how much users will be involved in recoveries. Self-service for admins? Self-service for end-users? All operator run? Chances are of course it’ll be a mix depending those previous recovery model questions (e.g., you might allow self-service individual email recovery, but full exchange recovery is not going to be an end-user initiated task.)

Fifth: What SLOs/SLAs are required?

Regardless of whether your business has Service Level Objectives (SLOs) or Service Level Agreements (SLAs), there’ll be the potential you have to meet a variety of them depending on the nature of the failure, the criticality and age of the data, and so on. (For the rest of this section, I’ll use ‘SLA’ as a generic term for both SLA and SLO). In fact, there’ll be up to three different categories of SLAs you have to meet:

  • Online: These types of SLAs are for immediate or near-immediate recoverability from failure; they’re meant to keep the data online rather than having to seek to retrieve it from a copy. This will cover options such as continuous replication (e.g., fully mirrored storage arrays), continuous data protection (CDP), as well as more regular replication and snapshot options.
  • Nearline: This is where backup and recovery, archive, and long term retention (e.g., compliance retention of backups/archives) comes into play. Systems in this area are designed to retrieve the data from a copy (or in the case of archive, a tiered, alternate platform) when required, as opposed to ensuring the original copy remains continuously, or near to continuously available.
  • Disaster: These are your “the chips are down” SLAs, which’ll fall into business continuity and/or isolated recovery. Particularly in the event of business continuity, they may overlap with either online or nearline SLAs – but they can also diverge quite a lot. (For instance, in a business continuity situation, data and systems for ‘tier 3’ and ‘tier 4’ services, which may otherwise require a particular level of online or nearline recoverability during normal operations, might be disregarded entirely until full service levels are restored.

Not all data may require all three of the above, and even if data does, unless you’re in a hyperconverged or converged environment, it’s quite possible if you’re a backup administrator, you only need to consider some of the above, with other aspects being undertaken by storage teams, etc.

Now you can plan the recovery service (and conclusion)

And because you’ve gathered the answers to the above, planning and implementing the recovery service is now the easy bit! Trust me on this – working out what a recovery service should look like for the business is when you’ve gathered the above information is a fraction of the effort compared to when you haven’t. Again: “Measure twice, cut once.”

If you want more in-depth information on above, check out chapters in my book such as “Contextualizing Data Protection”, “Data Life Cycle”, “Business Continuity”, and “Data Discovery” – not to mention the specific chapters on protection methods such as backup and recovery, replication, snapshots, continuous data protection, etc.

The Perils of Tape

 Architecture, Best Practice, Recovery  Comments Off on The Perils of Tape
Jan 232018
 

In the 90s, if you wanted to define a “disaster recovery” policy for your datacentre, part of that policy revolved around tape. Back then, tape wasn’t just for operational recovery, but also for disaster recovery – and even isolated recovery. In the 90s of course, we weren’t talking about “isolated recovery” in the same way we do now, but depending on the size of the organisation, or its level of exposure to threats, you might occasionally find a company that would make three copies of data – the on-site tape, the off-site tape, and then the “oh heck, if this doesn’t work we’re all ruined” tape.

These days, whatever you might want to use tape for within a datacentre (and yes, I’ll admit that sometimes it may still be necessary, though I’d argue far less necessary than the average tape deployment), something you most definitely don’t want to use it for is disaster recovery.

The reasoning is simple – time. To understand what I mean, let’s look at an example.

Let’s say your business has an used data footprint for all production systems of just 300 TB. I’m not talking total storage capacity here, just literally the occupied space across all DAS, SAN and NAS storage just for production systems. For the purposes of our discussion, I’m also going to say that based on dependencies and business requirements, disaster recovery means getting back all production data. Ergo, getting back 300 TB.

Now, let’s also say that your business is using LTO-7 media. (I know LTO-8 is out, but not everyone using tape updates as soon as a new format becomes available.) The key specifications on LTO-7 are as follows:

  • Uncompressed capacity: 6 TB
  • Max speed without compression: 300 MB/s
  • Typical rated compression: 2.5:1

Now, I’m going to stop there. When it comes to tape, I’ve found the actual compression ratios achieved to be wildly at odds with reality. Back when everyone used to quote a 2:1 compression ratio, I found that tapes yielding 2:1 or higher were in the extreme, and the norm was more likely 1.4:1, or 1.5:1. So, I’m not going to assume you’re going to get compression in the order of 2.5:1 just because the tape manufacturers say you will. (They also say tapes will reliably be OK on shelves for decades…)

A single LTO-7 tape, if we’re reading or writing at full speed, will take around 5.5 hours to go end-to-end. That’s very generously assuming no shoe-shining and that the read or write can be accomplished as a single process – either via multiplexing, or a single stream. Let’s say you get 2:1 compression on the tape. That’s 12 TB, but still takes 5.5 hours to complete (assuming a linear increase in tape speed. And at 600MB/s, that’s getting iffy – but again, bear with me.)

Drip, drip, drip.

Now, tape drives are always expensive – so you’re not going to see enough tape drives to load one tape per drive, fill the tape, and bang, your backup is done for your entire environment. Tape libraries exist to allow the business to automatically switch tapes as required, avoiding that 1:1 relationship between backup size and the number of drives used.

So let’s say you’ve got 3 tape drives. That means in 5.5 hours, using 2:1 compression and assuming you can absolutely stream at top speed for that compression ratio, you can recover 36 TB. (If you’ve used tape for as long as me, you’ll probably be chuckling at this point.)

So, that’s 36 TB in the first 5.5 hours. At this point, things are already getting a little absurd, so I’ll not worry about the time it takes to unload and load new media. The next 36 TB is done in another 5.5 hours. If we continue down that path, our first 288 TB will be recovered after 44 hours. We might then assume our final 12TB fitted snugly on a single remaining tape – which will take another 5.5 hours to read, getting us to a disaster recovery time of 49.5 hours.

Drip, drip, drip.

Let me be perfectly frank here: those numbers are ludicrous. The only way you have a hope of achieving that sort of read speed in a disaster recovery for that many tapes is if you are literally recovering a 300 TB filesystem filled almost exclusively with 2GB files that each can be compressed down by 50%. If your business production data doesn’t look like that, read on.

Realistically, in a disaster recovery situation, I’d suggest you’re going to look at at least 1.5x the end-to-end read/write time. For all intents and purposes, that assumes the tapes are written with say, 2-way multiplexing and we can conduct 2 reads across it, where each read is able to use filemark-seek-forward (FSFs) to jump over chunks of the tape.

5.5 hours per tape becomes 8.25 hours. (Even that, I’d say, is highly optimistic.) At that point, the 49.5 hour recovery is a slightly more believable 74.25 hour recovery. Slightly. That’s over 3 days. (Don’t forget, we still haven’t factored in tape load, tape rewind, and tape unload times into the overall scope.)

Drip, drip, drip.

What advantage do we get if we read from disk instead of tape? Well, if we’re going to go down that path, we’re going to get instant access (no load, unload or seek times), and concurrency (highly parallel reads – recovering fifty systems at a time, for instance).

But here’s the real kicker. Backup isn’t disaster recovery. And particularly, tape backup is disaster recovery in the same way that a hermetically sealed, 100% barricaded house is safety from a zombie apocalypse. (Eventually, you’re going to run out of time, i.e., food and water.)

This is, quite frankly, where the continuum comes into play:

More importantly, this is where picking the right data protection methodology for the SLAs (the RPOs and RTOs) is essential, viz.:

Storage Admin vs Backup Admin

Backup and recovery systems might be a fallback position when all other types of disaster recovery have been exhausted, but at the point you’ve decided to conduct a disaster recovery of your entire business from physical tape, you’re in a very precarious position. It might have been a valid option in the 90s, but in a modern data protection environment, there’s a plethora of higher level systems that should be built into a disaster recovery strategy. I’ve literally seen examples of companies when, faced with having to do a complete datacentre recovery from tape, taking a month or more to get back operational. (If ever.)

Like the old saying goes – if all you have is a hammer, everything looks like a nail. “Backup only” companies have a tendency to encourage you to think as backup as a disaster recovery function, or even to think of disaster recovery/business continuity as someone else’s problem. Talking disaster recovery is easiest when able to talk to someone across the entire continuum.

Otherwise you may very well find your business perfectly safely barricaded into a hermetically sealed location away from the zombie hoard, watching your food and water supplies continue to decrease hour by hour, day by day. Drip, drip, drip.

Planning holistic data protection strategies is more than just backup – and it’s certainly more than just tape. That’s why Data Protection: Ensuring Data Availability looks at the entire data storage protection spectrum. And that’s why I work for DellEMC, too. It’s not just backup. It’s about the entire continuum.

 

Dec 302017
 

With just a few more days of 2017 left, I thought it opportune making the last post of the year to summarise some of what we’ve seen in the field of data protection in 2017.

2017 Summary

It’s been a big year, in a lot of ways, particularly at DellEMC.

Towards the end of 2016, but definitely leading into 2017, NetWorker 9.1 was released. That meant 2017 started with a bang, courtesy of the new NetWorker Virtual Proxy (NVP, or vProxy) backup system. This replaced VBA, allowing substantial performance improvements, and some architectural simplification as well. I was able to generate some great stats right out of the gate with NVP under NetWorker 9.1, and that applied not just to Windows virtual machines but also to Linux ones, too. NetWorker 9.1 with NVP allows you to recover tens of thousands or more files from image level backup in just a few minutes.

In March I released the NetWorker 2016 usage survey report – the survey ran from December 1, 2016 to January 31, 2017. That reminds me – the 2017 Usage Survey is still running, so you’ve still got time to provide data to the report. I’ve been compiling these reports now for 7 years, so there’s a lot of really useful trends building up. (The 2016 report itself was a little delayed in 2017; I normally aim for it to be available in February, and I’ll do my best to ensure the 2017 report is out in February 2018.)

Ransomware and data destruction made some big headlines in 2017 – repeatedly. Gitlab hit 2017 running with a massive data loss in January, which they consequently blamed on a backup failure, when in actual fact it was a staggering process and people failure. It reminds one of the old manager #101 credo, “If you ASSuME, you make an ASS out of U and ME”. Gitlab’s issue may have at a very small level been a ‘backup failure’, but only in so much that everyone in the house thinking it was someone else’s turn to fill the tank of the car, and running out of petrol, is a ‘car failure’.

But it wasn’t just Gitlab. Next generation database users around the world – specifically, MongoDB – learnt the hard way that security isn’t properly, automatically enabled out of the box. Large numbers of MongoDB administrators around the world found their databases encrypted or lost as default security configurations were exploited on databases left accessible in the wild.

In fact, Ransomware became such a common headache in 2017 that it fell prey to IT’s biggest meme – the infographic. Do a quick Google search for “Ransomware Timeline” for instance, and you’ll find a plethora of options around infographics about Ransomware. (And who said Ransomware couldn’t get any worse?)

Appearing in February 2017 was Data Protection: Ensuring Data Availability. Yes, that’s right, I’m calling the release of my second book on data protection as a big event in the realm of data storage protection in 2017. Why? This is a topic which is insanely critical to business success. If you don’t have a good data protection process and strategy within your business, you could literally lose everything that defines the operational existence of your business. There’s three defining aspects I see in data protection considerations now:

  • Data is still growing
  • Product capability is still expanding to meet that growth
  • Too many businesses see data protection as a series of silos, unconnected – storage, virtualisation, databases, backup, cloud, etc. (Hint: They’re all connected.)

So on that basis, I do think a new book whose focus is to give a complete picture of the data storage protection landscape is important to anyone working in infrastructure.

And on the topic of stripping the silos away from data protection, 2017 well and truly saw DellEMC cement its lead in what I refer to as convergent data protection. That’s the notion of combining data protection techniques from across the continuum to provide new methods of ensuring SLAs are met, impact is eliminated, and data hops are minimised. ProtectPoint was first introduced to the world in 2015, and has evolved considerably since then. ProtectPoint allows primary storage arrays to integrate with data protection storage (e.g., VMAX3 to Data Domain) so that those really huge databases (think 10TB as a typical starting point) can have instantaneous, incremental-forever backups performed – all application integrated, but no impact on the database server itself. ProtectPoint though was just the starting position. In 2017 we saw the release of Hypervisor Direct, which draws a line in the sand on what Convergent Data Protection should be and do. Hypervisor direct is there for your big, virtualised systems with big databases, eliminating any risk of VM-stun during a backup (an architectural constraint of VMware itself) by integrating RecoverPoint for Virtual Machines with Data Domain Boost, all while still being fully application integrated. (Mark my words – hypervisor direct is a game changer.)

Ironically, in a world where target-based deduplication should be a “last resort”, we saw tech journalists get irrationally excited about a company heavy on marketing but light on functionality promote their exclusively target-deduplication data protection technology as somehow novel or innovative. Apparently, combining target based deduplication and needing to scale to potentially hundreds of 10Gbit ethernet ports is both! (In the same way that releasing a 3-wheeled Toyota Corolla for use by the trucking industry would be both ‘novel’ and ‘innovative’.)

Between VMworld and DellEMC World, there were some huge new releases by DellEMC this year though, by comparison. The Integrated Data Protection Appliance (IDPA) was announced at DellEMC world. IDPA is a hyperconverged backup environment – you get delivered to your datacentre a combined unit with data protection storage, control, reporting, monitoring, search and analytics that can be stood up and ready to start protecting your workloads in just a few hours. As part of the support programme you don’t have to worry about upgrades – it’s done as an atomic function of the system. And there’s no need to worry about software licensing vs hardware capacity: it’s all handled as a single, atomic function, too. For sure, you can still build your own backup systems, and many people will – but for businesses who want to hit the ground running in a new office or datacentre, or maybe replace some legacy three-tier backup architecture that’s limping along and costing hundreds of thousands a year just in servicing media servers (AKA “data funnel$”), IDPA is an ideal fit.

At DellEMC World, VMware running in AWS was announced – imagine that, just seamlessly moving virtual machines from your on-premises environment out to the world’s biggest public cloud as a simple operation, and managing the two seamlessly. That became a reality later in the year, and NetWorker and Avamar were the first products to support actual hypervisor level backup of VMware virtual machines running in a public cloud.

Thinking about public cloud, Data Domain Virtual Edition (DDVE) became available in both the Azure and AWS marketplaces for easy deployment. Just spin up a machine and get started with your protection. That being said, if you’re wanting to deploy backup in public cloud, make sure you check out my two-part article on why Architecture Matters: Part 1, and Part 2.

And still thinking about cloud – this time specifically about cloud object storage, you’ll want to remember the difference between Cloud Boost and Cloud Tier. Both can deliver exceptional capabilities to your backup environment, but they have different use cases. That’s something I covered off in this article.

There were some great announcements at re:Invent, AWS’s yearly conference, as well. Cloud Snapshot Manager was released, providing enterprise grade control over AWS snapshot policies. (Check out what I had to say about CSM here.) Also released in 2017 was DellEMC’s Data Domain Cloud Disaster Recovery, something I need to blog about ASAP in 2018 – that’s where you can actually have your on-premises virtual machine backups replicated out into a public cloud and instantiate them as a DR copy with minimal resources running in the cloud (e.g., no in-Cloud DDVE required).

2017 also saw the release of Enterprise Copy Data Analytics – imagine having a single portal that tracks your Data Domain fleet world wide, and provides predictive analysis to you about system health, capacity trending and insights into how your business is going with data protection. That’s what eCDA is.

NetWorker 9.2 and 9.2.1 came out as well during 2017 – that saw functionality such as integration with Data Domain Retention Lock, database integrated virtual machine image level backups, enhancements to the REST API, and a raft of other updates. Tighter integration with vRealize Automation, support for VMware image level backup in AWS, optimised object storage functionality and improved directives – the list goes on and on.

I’d be remiss if I didn’t mention a little bit of politics before I wrap up. Australia got marriage equality – I, myself, am finally now blessed with the challenge of working out how to plan a wedding (my boyfriend and I are intending to marry on our 22nd anniversary in late 2018 – assuming we can agree on wedding rings, of course), and more broadly, politics again around the world managed to remind us of the truth to that saying by the French Philosopher, Albert Camus: “A man without ethics is a wild beast loosed upon this world.” (OK, I might be having a pointed glance at Donald Trump over in America when I say that, but it’s still a pertinent thing to keep in mind across the political and geographic spectrums.)

2017 wasn’t just about introducing converged data protection appliances and convergent data protection, but it was also a year where more businesses started to look at hyperconverged administration teams as well. That’s a topic that will only get bigger in 2018.

The DellEMC data protection family got a lot of updates across the board that I haven’t had time to cover this year – Avamar 7.5, Boost for Enterprise Applications 4.5, Enterprise Copy Data Management (eCDM) 2, and DDOS 6.1! Now that I sit back and think about it, my January could be very busy just catching up on things I haven’t had a chance to blog about this year.

I saw some great success stories with NetWorker in 2017, something I hope to cover in more detail into 2018 and beyond. You can see some examples of great success stories here.

I also started my next pet project – reviewing ethical considerations in technology. It’s certainly not going to be just about backup. You’ll see the start of the project over at Fools Rush In.

And that’s where I’m going to leave 2017. It’s been a big year and I hope, for all of you, a successful year. 2018, I believe, will be even bigger again.

Basics – Prior Recovery Details

 Basics, NetWorker, Recovery  Comments Off on Basics – Prior Recovery Details
Dec 192017
 

If you need to find out details about what has recently been recovered with a NetWorker server, there’s a few different ways to achieve it.

NMC, of course, offers recovery reports. These are particularly good if you’ve got admins (e.g., security/audit people) who only access NetWorker via NMC – and good as a baseline report for auditing teams. Remember that NMC will retain records for reporting for a user defined period of time, via the Reports | Reports | Data Retention setting:

Reports Data Retention Menu

Data Retention Setting

The actual report you’ll usually want to run for recovery details is the ‘Recover Details’ report:

NMC Recover Details Report

The other way you can go about it is to use the command nsrreccomp, which retrieves details about completed recoveries from the NetWorker jobs database. Now, the jobs database is typically pruned every 72 hours in a default install (you can change the length of time on the jobs database). Getting a list of completed recoveries (successful or otherwise) is as simple as running the command nsrreccomp -L:

[root@orilla tmp]# nsrreccomp -L
name, start time, job id, completion status
recover, Tue Dec 19 10:28:25 2017(1513639705), 838396, succeeded
DNS_Serv_Rec_20171219, Tue Dec 19 10:39:19 2017(1513640359), 838404, failed
DNS_Server_Rec_20171219_2, Tue Dec 19 10:41:48 2017(1513640508), 838410, failed
DNS_Server_Rec_20171219_3, Tue Dec 19 10:43:55 2017(1513640635), 838418, failed
DNS_Server_Rec_20171219, Tue Dec 19 10:47:43 2017(1513640863), 838424, succeeded

You can see in the above list that I made a few recovery attempts that generated failures (deliberately picking a standalone ESX server that didn’t have a vProxy to service it as a recovery target, then twice forgetting to change my recovery destination) so that we could see the list includes successful and failed jobs.

You’ll note the second last field in each line of output is actually the Job ID associated with the recovery. You can actually use this with nsrreccomp to retrieve the full output of an individual job, viz.:

[root@orilla tmp]# nsrreccomp -R 838396
 Recovering 1 file from /tmp/ into /tmp/recovery
 Volumes needed (all on-line):
 Backup.01 at Backup_01
 Total estimated disk space needed for recover is 8 KB.
 Requesting 1 file(s), this may take a while...
 Recover start time: Tue 19 Dec 2017 10:28:25 AEDT
 Requesting 1 recover session(s) from server.
 129290:recover: Successfully established direct file retrieve session for save-set ID '1345653443' with adv_file volume 'Backup.01'.
 ./tostage.txt
 Received 1 file(s) from NSR server `orilla'
 Recover completion time: Tue 19 Dec 2017 10:28:25 AEDT

The show-text option can be used for any recovery performed. For a virtual machine recovery it’ll be quite verbose – a snippet is as follows:

[root@orilla tmp]# nsrreccomp -R 838424
Initiating virtual machine 'krell_20171219' recovery on vCenter caprica.turbamentis.int using vProxy langara.turbamentis.int.
152791:nsrvproxy_recover: vProxy Log Begins ===============================================
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] Logging to '/opt/emc/vproxy/runtime/logs/vrecoverd/RecoverVM-5e9719e5-61bf-4e56-9b68-b4931e2af5b2.log' on host 'langara'.
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] Release: '2.1.0-17_1', Build number: '1', Build date: '2017-06-21T21:02:28Z'
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] Changing log level from INFO to TRACE.
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 INFO: [@(#) Build number: 67] Created RecoverVM session for "krell_20171219", logging to "/opt/emc/vproxy/runtime/logs/vrecoverd/RecoverVM-5e9719e5-61bf-4e56-9b68-b4931e2af5b2.log"...
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] Starting restore of VM "krell_20171219". Logging at level "TRACE" ...
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] RecoverVmSessions supplied by client:
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] {
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] "Config": {
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] "SessionId": "5e9719e5-61bf-4e56-9b68-b4931e2af5b2",
159373:nsrvproxy_recover: vProxy Log: 2017/12/19 10:47:19 NOTICE: [@(#) Build number: 67] "LogTag": "@(#) Build number: 67",

Now, if you’d like more than 72 hours retention in your jobs database, you can set it to a longer period of time (though don’t get too excessive – you’re better off capturing jobs database details periodically and saving to an alternate location than trying to get the NetWorker server to retain a huge jobs database) via the server properties. This can be done using nsradmin or NMC – NMC shown below for reference:

NMC Server Properties

NMC Server Properties JobsDB Retention

There’s not much else to grabbing recovery results out of NetWorker – it’s certainly useful to know of both the NMC and command line options available to you. (Of course, if you want maximum flexibility on recovery reporting, you should definitely check out Data Protection Advisor (DPA), which is available automatically under any of the Data Protection Suite licensing models.)

 

Talking about Ransomware

 Architecture, Backup theory, General thoughts, Recovery, Security  Comments Off on Talking about Ransomware
Sep 062017
 

The “Wannacry” Ransomware strike saw a particularly large number of systems infected and garnered a great deal of media attention.

Ransomware Image

As you’d expect, many companies discussed ransomware and their solutions for it. There was also backlash from many quarters suggesting people were using a ransomware attack to unethically spruik their solutions. It almost seems to be the IT equivalent of calling lawyers “ambulance chasers”.

We are (albeit briefly, I am sure), between major ransomware outbreaks. So, logically that’ll mean it’s OK to talk about ransomware.

Now, there’s a few things to note about ransomware and defending against it. It’s not as simplistic as “I only have to do X and I’ll solve the problem”. It’s a multi-layered issue requiring user education, appropriate systems patching, appropriate security, appropriate data protection, and so on.

Focusing even on data protection, that’s a multi-layered approach as well. In order to have a data protection environment that can assuredly protect you from ransomware, you need to do the basics, such as operating system level protection for backup servers, storage nodes, etc. That’s just the beginning. The next step is making sure your backup environment itself follows appropriate security protocols. That’s something I’ve been banging on about for several years now. That’s not the full picture though. Once you’ve got operating systems and backup systems secured via best practices, you need to then look at hardening your backup environment. There’s a difference between standard security processes and hardened security processes, and if you’re worried about ransomware this is something you should be thinking about doing. Then, of course, if you really want to ensure you can recover your most critical data from a serious hactivism and ransomware (or outright data destruction) breach, you need to look at IRS as well.

But let’s step back, because I think it’s important to make a point here about when we can talk about ransomware.

I’ve worked in data protection my entire professional career. (Even when I was a system administrator for the first four years of it, I was the primary backup administrator as well. It’s always been a focus.)

If there’s one thing I’ve observed in my career in data protection is that having a “head in the sand” approach to data loss risk is a lamentably common thing. Even in 2017 I’m still hearing things like “We can’t back this environment up because the project which spun it up didn’t budget for backup”, and “We’ll worry about backup later”. Not to mention the old chestnut, “it’s out of warranty so we’ll do an Icarus support contract“.

Now the flipside of the above paragraph is this: if things go wrong in any of those situations, suddenly there’s a very real interest in talking about options to prevent a future issue.

It may be a career limiting move to say this, but I’m not in sales to make sales. I’m in sales to positively change things for my customers. I want to help customers resolve problems, and deliver better outcomes to their users. I’ve been doing data protection for over 20 years. The only reason someone stays in data protection that long is because they’re passionate about it, and the reason we’re passionate about it is because we are fundamentally averse to data loss.

So why do we want to talk about defending against or recovering from ransomware during a ransomware outbreak? It’s simple. At the point of a ransomware outbreak, there’s a few things we can be sure of:

  • Business attention is focused on ransomware
  • People are talking about ransomware
  • People are being directly impacted by ransomware

This isn’t ambulance chasing. This is about making the best of a bad situation – I don’t want businesses to lose data, or have it encrypted and see them have to pay a ransom to get it back – but if they are in that situation, I want them to know there are techniques and options to prevent it from striking them again. And at that point in time – during a ransomware attack – people are interested in understanding how to stop it from happening again.

Now, we have to still be considerate in how we discuss such situations. That’s a given. But it doesn’t mean the discussion can’t be had.

To me this is also an ethical consideration. Too often the focus on ethics in professional IT is around the basics: don’t break the law (note: law ≠ ethics), don’t be sexist, don’t be discriminatory, etc. That’s not really a focus on ethics, but a focus on professional conduct. Focusing on professional conduct is good, but there must also be a focus on the ethical obligations of protecting data. It’s my belief that if we fail to make the best of a bad situation to get an important message of data protection across, we’re failing our ethical obligations as data protection professionals.

Of course, in an ideal world, we’d never need to discuss how to mitigate or recover from a ransomware outbreak during said outbreak, because everyone would already be protected. But harking back to an earlier point, I’m still being told production systems were installed without consideration for data protection, so I think we’re a long way from that point.

So I’ll keep talking about protecting data from all sorts of loss situations, including ransomware, and I’ll keep having those discussions before, during and after ransomware outbreaks. That’s my job, and that’s my passion: data protection. It’s not gloating, it’s not ambulance chasing, it’s let’s make sure this doesn’t happen again.


On another note, sales are really great for my book, Data Protection: Ensuring Data Availability, released earlier this year. I have to admit, I may have squealed a little when I got my first royalty statement. So, if you’ve already purchased my book: you have my sincere thanks. If you’ve not, that means you’re missing out on an epic story of protecting data in the face of amazing odds. So check it out, it’s in eBook or Paperback format on Amazon (prior link), or if you’d prefer to, you can buy direct from the publisher. And thanks again for being such an awesome reader.

Jul 212017
 

I want to try something different with this post. Rather than the usual post with screen shots and descriptions, I wanted instead to do a demo video showing just how easy it is to do file level recovery (FLR) from NetWorker VMware Image Level Backup thanks to the new NVP or vProxy system in NetWorker 9.

The video below steps you through the entire FLR process for a Linux virtual machine. (If your YouTube settings don’t default to it, be sure to switch the video to High Def (720) or otherwise the text on the console and within NMC may be difficult to read.)

Don’t forget – if you find the information on the NetWorker Blog useful, I’m sure you’ll get good value out of my latest book, Data Protection: Ensuring Data Availability.

Jul 112017
 

NetWorker 9 modules for SQL, Exchange and Sharepoint now make use of ItemPoint to support granular recovery.bigstock Database

ItemPoint leverages NetWorker’s ability to live-mount a database or application backup from compatible media, such as Advanced File Type devices or Data Domain Boost.

I thought I’d step through the process of performing a table level recovery out of a SQL server backup – as you’ll see below, it’s actually remarkably straight-forward to run granular recoveries in the new configuration. For my lab setup, I installed the Microsoft 180 day evaluation* license of Windows 2012 R2, and in the same spirit, the 180 day evaluation license for SQL Server 2014 (Standard).

Next off, I created a database and within that database, a table. I grabbed a list of English-language dictionary words and populated a table with rows consisting of the words and a unique ID key – just for something simple to test with.

Installing NetWorker on the Client

After getting the database server and a database ready, the next process was to install the NetWorker client within the Windows instance in order to do backup and recovery. After installing the standard NetWorker filesystem client using the base NetWorker for Windows installer, I went on to install the NetWorker Module for Microsoft Applications, choosing the SQL option.

In case you haven’t installed a NMM v9 plugin yet, I thought I’d annotate/show the install process below.

After you’ve unpacked the NMM zip file, you’ll want to run the appropriate setup file – in this case, NWVSS.

NMM SQL Install 01

NMM SQL Install 01

You’ll have to do the EULA acceptance, of course.

NMM SQL Install 02

NMM SQL Install 02

After you’ve agreed and clicked Next, you’ll get to choose what options in NMM you want to install.

NMM SQL Install 03

NMM SQL Install 03

I chose to run the system configuration checker, and you definitely should too. This is an absolute necessity in my mind – the configuration checker will tell you if something isn’t going to work. It works through a gamut of tests to confirm that the system you’re attempting to install NMM on is compatible, and provides guidance if any of those tests aren’t passed. Obviously as well, since I wanted to do SQL backup and recovery, I also selected the Microsoft SQL option. After this, you click Check to start the configuration check process.

Depending on the size and scope of your system, the configuration checker may take a few minutes to run, but after it completes, you’ll get a summary report, such as below.

NMM SQL Install 04

NMM SQL Install 04

Make sure to scroll through the summary and note there’s no errors reported. (Errors will have a result of ‘ERROR’ and will be in red.) If there is an error reported, you can click the ‘Open Detailed Report…’ button to open up the full report and see what actions may be available to rectify the issue. In this case, the check was successful, so it was just a case of clicking ‘Next >’ to continue.

NMM SQL Install 05

NMM SQL Install 05

Next you have to choose whether to configure the Windows firewall. If you’re using a third party firewall product, you’ll typically want to do the firewall configuration manually and choose ‘Do not configure…’. Choose the appropriate option for your environment and click ‘Next >’ to continue again.

NMM SQL Install 06

NMM SQL Install 06

Here’s where you get to the additional options for the plugin install. I chose to enable the SQL Granular Recovery option, and enabled all the SQL Server Management Studio options, per the above. You’ll get a warning when you click Next here to ensure you’ve got a license for ItemPoint.

NMM SQL Install 07

NMM SQL Install 07

I verified I did have an ItemPoint license and clicked Yes to continue. If you’re going with granular recovery, you’ll be prompted next for the mount point directories to be used for those recoveries.

NMM SQL Install 08

NMM SQL Install 08

In this, I was happy to accept the default options and actually start the install by clicking the ‘Install >’ button.

NMM SQL Install 09

NMM SQL Install 09

The installer will then do its work, and when it completes you’ll get a confirmation window.

NMM SQL Install 10

NMM SQL Install 10

That’s the install done – the next step of course is configuring a client resource for the backup.

Configuring the Client in NMC

The next step is to create a client resource for the SQL backups. Within NMC, go into the configuration panel, right-click on Clients and choose to create a new client via the wizard. The sequence I went through was as follows.

NMM SQL Config 01

NMM SQL Config 01

Once you’ve typed the client name in, NetWorker is going to be able to reach out to the client daemons to coordinate configuration. My client was ‘win02’, and as you can see from the client type, a ‘Traditional’ client was the one to pick. Clicking ‘Next >’, you get to choose what sort of backup you want to configure.

NMM SQL Config 02

NMM SQL Config 02

At this point the NetWorker server has contacted the client nsrexecd process and identified what backup/recovery options there are installed on the client. I chose ‘SQL Server’ from the available applications list. ‘Next >’ to continue.

NMM SQL Config 03

NMM SQL Config 03

I didn’t need to change any options here (I wanted to configure a VDI backup rather than a VSS backup, so I left ‘Block Based Backup’ disabled). Clicking ‘Next >’ from here lets you choose the databases you want to backup.

NMM SQL Config 04

NMM SQL Config 04

I wanted to backup everything – the entire WIN02 instance, so I left WIN02 selected and clicked ‘Next >’ to continue the configuration.

NMM SQL Config 05

NMM SQL Config 05

Here you’ll be prompted for the accessing credentials for the SQL backups. Since I don’t run active directory at home, I was just using Windows authentication so in actual fact I entered the ‘Administrator’ username and the password, but you can change it to whatever you need to as part of the backup. Once you’ve got the correct authentication details entered, ‘Next >’ to continue.

NMM SQL Config 06

NMM SQL Config 06

Here’s where you get to choose SQL specific options for the backup. I elected to skip simple databases for incremental backups, and enabled 6-way striping for backups. ‘Next >’ to continue again.

NMM SQL Config 07

NMM SQL Config 07

The Wizard then prompts you to confirm your configuration options, and I was happy with them, so I clicked ‘Create’ to actually have the client resource created in NetWorker.

NMM SQL Config 08

NMM SQL Config 08

The resource was configured without issue, so I was able to click Finish to complete the wizard. After this, it was just a case of adding the client to an appropriate policy and then running that policy from within NMC’s monitoring tab.

NMM SQL Config 09

NMM SQL Config 09

And that was it – module installed, client resource configured, and backup completed. Next – recovery!

Doing a Granular Recovery

To do a granular recovery – a table recovery – I jumped across via remote desktop to the Windows host and launched SQL Management Studio. First thing, of course, was to authenticate.

NMM SQL GLR 01

NMM SQL GLR 01

Once I’d logged on, I clicked the NetWorker plugin option, highlighted below:

NMM SQL GLR 02

NMM SQL GLR 02

That brought up the NetWorker plugin dialog, and I went straight to the Table Restore tab.

NMM SQL GLR 03

NMM SQL GLR 03

In the table restore tab, I chose the NetWorker server, the SQL server host, the SQL instance, then picked the database I wanted to restore from, as well as the backup. (Because there was only one backup, that was a pretty simple choice.) Next was to click Run to initiate the recovery process. Don’t worry – the Run here refers to running the mount; nothing is actually recovered yet.

NMM SQL GLR 04

NMM SQL GLR 04

While the mounting process runs you’ll get output of the process as it is executing. As soon as the database backup is mounted, the ItemPoint wizard will be launched.

NMM SQL GLR 05

NMM SQL GLR 05

When ItemPoint launches, it’ll prompt via the Data Wizard for the source of the recovery. In this case, work with the NetWorker defaults, as the source type (Folder) and Source Folder will be automatically populated as a result of the mount operation previously performed.

NMM SQL GLR 06

NMM SQL GLR 06

You’ll be prompted to provide the SQL Server details here and whether you want to connect to a single database or the entire server. In this case, I went with just the database I wanted – the Silence database. Clicking Finish then opens up the data browser for you.

NMM SQL GLR 07

NMM SQL GLR 07

You’ll see the browser interface is pretty straight forward – expand the backup down to the Tables area so you can select the table you want to restore.

NMM SQL GLR 08

NMM SQL GLR 08

Within ItemPoint, you don’t so much restore a table as copy it out of the backup region. So you literally can right-click on the table you want and choose ‘Copy’.

NMM SQL GLR 09

NMM SQL GLR 09

Logically then the next thing you do is go to the Target area and choose to paste the table.

NMM SQL GLR 10

NMM SQL GLR 10

Because that table still existed in the database, I was prompted to confirm what the pasted table would be called – in this case, just dbo.ImportantData2. Clicking OK then kicks off the data copy operation.

NMM SQL GLR 11

NMM SQL GLR 11

Here you can see the progress indicator for the copy operation. It keeps you up to date on how many rows have been processed, and the amount of time it’s taken so far.

NMM SQL GLR 12

NMM SQL GLR 12

At the end of the copy operation, you’ll have details provided about how many rows were processed, when it was finished and how long it took to complete. In this case I pulled back 370,101 rows in 21 seconds. Clicking Close will return you to the NetWorker Plugin where the backup will be dismounted.

NMM SQL GLR 13

NMM SQL GLR 13

And there you have it. Clicking “Close” will close down the plugin in SQL Management Studio, and your table level recovery has been completed.

ItemPoint GLR for SQL Server is really quite straight forward, and I heartily recommend the investment in the ItemPoint aspect of the plugin so as to get maximum benefit out of your SQL, Exchange or SharePoint backups.


* I have to say, it really irks me that Microsoft don’t have any OS pricing for “non-production” use. I realise the why – that way too many licenses would be finagled into production use. But it makes maintaining a home lab environment a complete pain in the posterior. Which is why, folks, most of my posts end up being around Linux, since I can run CentOS for free. I’d happily pay a couple of hundred dollars for Windows server licenses for a lab environment, but $1000-$2000? Ugh. I only have limited funds for my home lab, and it’s no good exhausting your budget on software if you then don’t have hardware to run it on…

NetWorker 9.1 FLR Web Interface

 NVP, Recovery, vProxy  Comments Off on NetWorker 9.1 FLR Web Interface
Apr 042017
 

Hey, don’t forget, my new book is available. Jam packed with information about protecting across all types of RPOs and RTOs, as well as helping out on the procedural and governance side of things. Check it out today on Amazon! (Kindle version available, too.)


In my introductory NetWorker 9.1 post, I covered file level recovery (FLR) from VMware image level backup via NMC. I felt at the time that it was worthwhile covering FLR from within NMC as the VMware recovery integration in NMC was new with 9.1. But at the same time, the FLR Web interface for NetWorker has also had a revamp, and I want to quickly run through that now.

First, the most important aspect of FLR from the new NetWorker Virtual Proxy (NVP, aka “vProxy”) is not something you do by browsing to the Proxy itself. In this updated NetWorker architecture, the proxies are very much dumb appliances, completely disposable, with all the management intelligence coming from the NetWorker server itself.

Thus, to start a web based FLR session, you actually point your browser to:

https://nsrServer:9090/flr

The FLR web service now runs on the NetWorker server itself. (In this sense quite similarly to the FLR service for Hyper-V.)

The next major change is you no longer have to use the FLR interface from a system currently getting image based backups. In fact, in the example I’m providing today, I’m doing it from a laptop that isn’t even a member of the NetWorker datazone.

When you get to the service, you’ll be prompted to login:

01 Initial Login

For my test, I wanted to access via the Administration interface, so I switched to ‘Admin’ and logged on as the NetWorker owner:

02 Logging In as Administrator

After you login, you’re prompted to choose the vCenter environment you want to restore from:

03 Select vCenter

Selecting the vCenter server of course lets you then choose the protected virtual machine in that environment to be recovered:

04 Select VM and Backup

(Science fiction fans will perhaps be able to intuit my host naming convention for production systems in my home lab based on the first three virtual machine names.)

Once you’ve selected the virtual machine you want to recover from, you then get to choose the backup you want to recover – you’ll get a list of backups and clones if you’re cloning. In the above example I’ve got no clones of the specific virtual machine that’s been protected. Clicking ‘Next’ after you’ve selected the virtual machine and the specific backup will result in you being prompted to provide access credentials for the virtual machine. This is so that the FLR agent can mount the backup:

05 Provide Credentials for VM

Once you provide the login credentials (and they don’t have to be local – they can be an AD specified login by using the domain\account syntax), the backup will be mounted, then you’ll be prompted to select where you want to recover to:

06 Select Recovery Location

In this case I selected the same host, recovering back to C:\tmp.

Next you obviously need to select the file(s) and folder(s) you want to recover. In this case I just selected a single file:

07 Select Content to Recover

Once you’ve selected the file(s) and folder(s) you want to recover, click the Restore button to start the recovery. You’ll be prompted to confirm:

08 Confirm Recovery

The restore monitor is accessible via the bottom of the FLR interface, basically an upward-pointing arrow-head to expand. This gives you a view of a running, or in this case, a complete restore, since it was only a single file and took very little time to complete:

09 Recovery Success

My advice generally is that if you want to recover thousands or tens of thousands of files, you’re better off using the NMC interface (particularly if the NetWorker server doesn’t have a lot of RAM allocated to it), but for smaller collections of files the FLR web interface is more than acceptable.

And Flash-free, of course.

There you have it, the NetWorker 9.1 VMware FLR interface.


Hey, don’t forget, my new book is available. Jam packed with information about protecting across all types of RPOs and RTOs, as well as helping out on the procedural and governance side of things. Check it out today on Amazon! (Kindle version available, too.)


 

What to do on world backup day

 Backup theory, Best Practice, Recovery  Comments Off on What to do on world backup day
Mar 302017
 

World backup day is approaching. (A few years ago now, someone came up with the idea of designating one day of the year to recognise backups.) Funnily enough, I’m not a fan of world backup day, simply because we don’t backup for the sake of backing up, we backup to recover.

Every day should, in fact, be world backup day.

Something that isn’t done enough – isn’t celebrated enough, isn’t tested enough – are recoveries. For many organisations, recovery tests consist of actually doing a recovery when requested, and things like long term retention backups are never tested, and even more rarely recovered from.

bigStock Rescue

So this Friday, March 31, I’d like to suggest you don’t treat as World Backup Day, but World Recovery Test Day. Use the opportunity to run a recovery test within your organisation (following proper processes, of course!) – preferably a recovery that you don’t normally run in terms of day to day operations. People only request file recoveries? Sounds like a good reason to run an Exchange, SQL or Oracle recovery to me. Most recoveries are Exchange mail level recoveries? Excellent, you know they work, let’s run a recovery of a complete filesystem somewhere.

All your recoveries are done within a 30 day period of the backup being taken? That sounds like an excellent idea to do the recovery from an LTR backup written 2+ years ago, too.

Part of running a data protection environment is having routine tests to validate ongoing successful operations, and be able to confidently report back to the business that everything is OK. There’s another, personal and selfish aspect to it, too. It’s one I learnt more than a decade ago when I was still an on-call system administrator: having well-tested recoveries means that you can sleep easily at night, knowing that if the pager or mobile phone does shriek you into blurry-eyed wakefulness at 1am, you can in fact log onto the required server and run the recovery without an issue.

So this World Backup Day, do a recovery test.


The need to have an efficient and effective testing system is something I cover in more detail in Data Protection: Ensuring Data Availability. If you want to know more, feel free to check out the book on Amazon or CRC Press. Remember that it doesn’t matter how good the technology you deploy is if you don’t have the processes and training to use it.

Feb 122017
 

On January 31, GitLab suffered a significant issue resulting in a data loss situation. In their own words, the replica of their production database was deleted, the production database was then accidentally deleted, then it turned out their backups hadn’t run. They got systems back with snapshots, but not without permanently losing some data. This in itself is an excellent example of the need for multiple data protection strategies; your data protection should not represent a single point of failure within the business, so having layered approaches to achieve a variety of retention times, RPOs, RTOs and the potential for cascading failures is always critical.

To their credit, they’ve published a comprehensive postmortem of the issue and Root Cause Analysis (RCA) of the entire issue (here), and must be applauded for being so open with everything that went wrong – as well as the steps they’re taking to avoid it happening again.

Server on Fire

But I do think some of the statements in the postmortem and RCA require a little more analysis, as they’re indicative of some of the challenges that take place in data protection.

I’m not going to speak to the scenario that led to the production, rather than replica database, being deleted. This falls into the category of “ooh crap” system administration mistakes that sadly, many of us will make in our careers. As the saying goes: accidents happen. (I have literally been in the situation of accidentally deleting a production database rather than its replica, and I can well and truly sympathise with any system or application administrator making that mistake.)

Within GitLab’s RCA under “Problem 2: restoring GitLab.com took over 18 hours”, several statements were made that irk me as a long-term data protection specialist:

Why could we not use the standard backup procedure? – The standard backup procedure uses pg_dump to perform a logical backup of the database. This procedure failed silently because it was using PostgreSQL 9.2, while GitLab.com runs on PostgreSQL 9.6.

As evidenced by a later statement (see the next RCA statement below), the procedure did not fail silently; instead, GitLab chose to filter the output of the backup process in a way that they did not monitor. There is, quite simply, a significant difference between fail silently and silently ignored results. The latter is a far more accurate statement than the former. A command that fails silently is one that exits with no error condition or alert. Instead:

Why did the backup procedure fail silently? – Notifications were sent upon failure, but because of the Emails being rejected there was no indication of failure. The sender was an automated process with no other means to report any errors.

The pg_dump command didn’t fail silently, as previously asserted. It generated output which was silently ignored due to a system configuration error. Yes, a system failed to accept the emails, and a system therefore failed to send the emails, but at the end of the day, a human failed to see or otherwise check as to why the backup reports were not being received. This is actually a critical reason why we need zero error policies – in data protection, no error should be allowed to continue without investigation and rectification, and a change in or lack of reporting or monitoring data for data protection activities must be treated as an error for investigation.

Why were Azure disk snapshots not enabled? – We assumed our other backup procedures were sufficient. Furthermore, restoring these snapshots can take days.

Simple lesson: If you’re going to assume something in data protection, assume it’s not working, not that it is.

Why was the backup procedure not tested on a regular basis? – Because there was no ownership, as a result nobody was responsible for testing the procedure.

There are two sections of the answer that should serve as a dire warning: “there was no ownership”, “nobody was responsible”. This is a mistake many businesses make, but I don’t for a second believe there was no ownership. Instead, there was a failure to understand ownership. Looking at the “Team | GitLab” page, I see:

  • Dmitriy Zaporozhets, “Co-founder, Chief Technical Officer (CTO)”
    • From a technical perspective the buck stops with the CTO. The CTO does own the data protection status for the business from an IT perspective.
  • Sid Sijbrandij, “Co-founder, Chief Executive Officer (CEO)”
    • From a business perspective, the buck stops with the CEO. The CEO does own the data protection status for the business from an operational perspective, and from having the CTO reporting directly up.
  • Bruce Armstrong and Villi Iltchev, “Board of Directors”
    • The Board of Directors is responsible for ensuring the business is running legally, safely and financially securely. They indirectly own all procedures and processes within the business.
  • Stan Hu, “VP of Engineering”
    • Vice-President of Engineering, reporting to the CEO. If the CTO sets the technical direction of the company, an engineering or infrastructure leader is responsible for making sure the company’s IT works correctly. That includes data protection functions.
  • Pablo Carranza, “Production Lead”
    • Reporting to the Infrastructure Director (a position currently open). Data protection is a production function.
  • Infrastructure Director:
    • Currently assigned to Sid (see above), as an open position, the infrastructure director is another link in the chain of responsibility and ownership for data protection functions.

I’m not calling these people out to shame them, or rub salt into their wounds – mistakes happen. But I am suggesting GitLab has abnegated its collective responsibility by simply suggesting “there was no ownership”, when in fact, as evidenced by their “Team” page, there was. In fact, there was plenty of ownership, but it was clearly not appropriately understood along the technical lines of the business, and indeed right up into the senior operational lines of the business.

You don’t get to say that no-one owned the data protection functions. Only that no-one understood they owned the data protection functions. One day we might stop having these discussions. But clearly not today.

 

%d bloggers like this: