Mar 132017

The NetWorker usage report for 2016 is now complete and available here. Per previous years surveys, the survey ran from December 1, 2016 through to January 1, 2017.


There were some interesting statistics and trends arising from this survey. The percentages of businesses not using backup to disk in at least some form within their environment fell to just 1% of respondents. That’s 99% of respondents having some form of backup to disk within their environment!

More and more respondents are cloning within their environments – if you’re not cloning in yours, you’re falling behind the curve now in terms of ensuring your backup environment can’t be a single point of failure.

There’s plenty of other results and details in the survey report you may be interested in, including:

  • Changes to the number of respondents using dedicated backup administrators
  • Cloud adoption rates
  • Ransomware attacks
  • The likelihood of businesses using or planning to use object storage as part of their backup environment
  • and many more

You can download the survey from the link above.

Just a reminder: “Data Protection: Ensuring Data Availability” is out now, and you can buy it in both paperback and electronic format from Amazon, or in paperback from the publisher, CRC Press. If you’ve enjoyed or found my blog useful, I’m sure you’ll find value in my latest book, too!

One respondent from this year’s survey will be receiving a signed copy of the book directly from me, too! That winner has been contacted.

Build vs Buy

 Architecture, Backup theory, Best Practice  Comments Off on Build vs Buy
Feb 182017

Converged, and even more so, hyperconverged computing, is all premised around the notion of build vs buy. Are you better off having your IT staff build your infrastructure from the ground up, managing it in silos of teams, or are you do you want to buy tightly integrated kit, land it on the floor and start using it immediately?

Dell-EMC’s team use the analogy – do you build your car, or do you buy it? I think this is a good analogy: it speaks to how the vast majority of car users consume vehicle technology. They buy a complete, engineered car as a package, and drive it off the car sales lot complete. Sure, there’s tinkerers who might like to build a car from scratch, but they’re not the average consumer. For me it’s a bit like personal computing – I gave up years ago wanting to build my own computers. I’m not interested in buying CPUs, RAM, motherboards, power supplies, etc., dealing with the landmines of compatibility, drivers and physical installation before I can get a usable piece of equipment.

This is where many people believe IT is moving, and there’s some common sense in it – it’s about time to usefulness.

A question I’m periodically posed is – what has backup got to do with the build vs buy aspect of hyperconverged? For one, it’s not just backup – it’s data protection – but secondly, it has everything to do with hyperconverged.

If we return to that build vs buy example of – would you build a car or buy a car, let me ask a question of you as a car consumer – a buyer rather than a builder of a car. Would you get airbags included, or would you search around for third party airbags?


To be honest, I’m not aware of anyone who buys a car, drives it off the lot, and starts thinking, “Do I go to Airbags R Us, or Art’s Airbag Emporium to get my protection?”

That’s because the airbags come built-in.

For me at least, that’s the crux of the matter in the converged and hyper-converged market. Do you want third party airbags that you have to install and configure yourself, and hope they work with that integrated solution you’ve got bought, or do you want airbags included and installed as part of the purchase?

You buy a hyperconverged solution because you want integrated virtualisation, integrated storage, integrated configuration, integrated management, integrated compute, integrated networking. Why wouldn’t you also want integrated data protection? Integrated data protection that’s baked into the service catalogue and part of the kit as it lands on your floor. If it’s about time to usefulness it doesn’t stop at the primary data copy – it should also include the protection copies, too.

Airbags shouldn’t be treated as optional, after-market extras, and neither should data protection.

Feb 122017

On January 31, GitLab suffered a significant issue resulting in a data loss situation. In their own words, the replica of their production database was deleted, the production database was then accidentally deleted, then it turned out their backups hadn’t run. They got systems back with snapshots, but not without permanently losing some data. This in itself is an excellent example of the need for multiple data protection strategies; your data protection should not represent a single point of failure within the business, so having layered approaches to achieve a variety of retention times, RPOs, RTOs and the potential for cascading failures is always critical.

To their credit, they’ve published a comprehensive postmortem of the issue and Root Cause Analysis (RCA) of the entire issue (here), and must be applauded for being so open with everything that went wrong – as well as the steps they’re taking to avoid it happening again.

Server on Fire

But I do think some of the statements in the postmortem and RCA require a little more analysis, as they’re indicative of some of the challenges that take place in data protection.

I’m not going to speak to the scenario that led to the production, rather than replica database, being deleted. This falls into the category of “ooh crap” system administration mistakes that sadly, many of us will make in our careers. As the saying goes: accidents happen. (I have literally been in the situation of accidentally deleting a production database rather than its replica, and I can well and truly sympathise with any system or application administrator making that mistake.)

Within GitLab’s RCA under “Problem 2: restoring took over 18 hours”, several statements were made that irk me as a long-term data protection specialist:

Why could we not use the standard backup procedure? – The standard backup procedure uses pg_dump to perform a logical backup of the database. This procedure failed silently because it was using PostgreSQL 9.2, while runs on PostgreSQL 9.6.

As evidenced by a later statement (see the next RCA statement below), the procedure did not fail silently; instead, GitLab chose to filter the output of the backup process in a way that they did not monitor. There is, quite simply, a significant difference between fail silently and silently ignored results. The latter is a far more accurate statement than the former. A command that fails silently is one that exits with no error condition or alert. Instead:

Why did the backup procedure fail silently? – Notifications were sent upon failure, but because of the Emails being rejected there was no indication of failure. The sender was an automated process with no other means to report any errors.

The pg_dump command didn’t fail silently, as previously asserted. It generated output which was silently ignored due to a system configuration error. Yes, a system failed to accept the emails, and a system therefore failed to send the emails, but at the end of the day, a human failed to see or otherwise check as to why the backup reports were not being received. This is actually a critical reason why we need zero error policies – in data protection, no error should be allowed to continue without investigation and rectification, and a change in or lack of reporting or monitoring data for data protection activities must be treated as an error for investigation.

Why were Azure disk snapshots not enabled? – We assumed our other backup procedures were sufficient. Furthermore, restoring these snapshots can take days.

Simple lesson: If you’re going to assume something in data protection, assume it’s not working, not that it is.

Why was the backup procedure not tested on a regular basis? – Because there was no ownership, as a result nobody was responsible for testing the procedure.

There are two sections of the answer that should serve as a dire warning: “there was no ownership”, “nobody was responsible”. This is a mistake many businesses make, but I don’t for a second believe there was no ownership. Instead, there was a failure to understand ownership. Looking at the “Team | GitLab” page, I see:

  • Dmitriy Zaporozhets, “Co-founder, Chief Technical Officer (CTO)”
    • From a technical perspective the buck stops with the CTO. The CTO does own the data protection status for the business from an IT perspective.
  • Sid Sijbrandij, “Co-founder, Chief Executive Officer (CEO)”
    • From a business perspective, the buck stops with the CEO. The CEO does own the data protection status for the business from an operational perspective, and from having the CTO reporting directly up.
  • Bruce Armstrong and Villi Iltchev, “Board of Directors”
    • The Board of Directors is responsible for ensuring the business is running legally, safely and financially securely. They indirectly own all procedures and processes within the business.
  • Stan Hu, “VP of Engineering”
    • Vice-President of Engineering, reporting to the CEO. If the CTO sets the technical direction of the company, an engineering or infrastructure leader is responsible for making sure the company’s IT works correctly. That includes data protection functions.
  • Pablo Carranza, “Production Lead”
    • Reporting to the Infrastructure Director (a position currently open). Data protection is a production function.
  • Infrastructure Director:
    • Currently assigned to Sid (see above), as an open position, the infrastructure director is another link in the chain of responsibility and ownership for data protection functions.

I’m not calling these people out to shame them, or rub salt into their wounds – mistakes happen. But I am suggesting GitLab has abnegated its collective responsibility by simply suggesting “there was no ownership”, when in fact, as evidenced by their “Team” page, there was. In fact, there was plenty of ownership, but it was clearly not appropriately understood along the technical lines of the business, and indeed right up into the senior operational lines of the business.

You don’t get to say that no-one owned the data protection functions. Only that no-one understood they owned the data protection functions. One day we might stop having these discussions. But clearly not today.


Jan 112017

There are currently a significant number of vulnerable MongoDB databases which are being attacked by ransomware attackers, and even though the attacks are ongoing, it’s worth taking a moment or two to reflect on some key lessons that can be drawn from it.

If you’ve not heard of it, you may want to check out some of the details linked to above. The short summary though is that MongoDB’s default deployment model has been a rather insecure one, and it’s turned out there’s a lot of unsecured public-facing databases out there. A lot of them have been hit by hackers recently, with the contents of the databases deleted and the owners being told to pay a ransom to get their data back. As to whether that will get them their data back is of course, another issue.

Ransomware Image

The first lesson of course is that data protection is not a single topic. More so than a lot of other data loss situations, the MongoDB scenario points to the simple, root lesson for any IT environment: data protection is also a data security factor:

Data Protection

For the most part, when I talk about Data Protection I’m referring to storage protection – backup and recovery, snapshots, replication, continuous data protection, and so on. That’s the focus of my next book, as you might imagine. But a sister process in data protection has been and will always be data security. So in the first instance in the MongoDB attacks, we’re seeing the incoming threat vector entirely from the simple scenario of unsecured systems. A lackadaisical approach to security is exactly what’s happened – for developers and deployers alike – in the MongoDB space, and the result to date is estimated to be around 93TB of data wiped. That number will only go up.

The next lesson though is that backups are still needed. In The MongoDB attacks: 93 terabytes of data wiped out (linked again from above), Dissent writes that of 1138 victims analysed:

Only 13 report that they had recently backed up the now-wiped database; the rest reported no recent backups

That number is awful. Just over 11% of impacted sites had recent backups. That’s not data protection, that’s data recklessness. (And as the report mentions, 73% of the databases were flagged as being production.) In one instance:

A French healthcare research entity had its database with cancer research wiped out. They reported no recent backup.

That’s another lesson there: data protection isn’t just about bits and bytes, it’s about people’s lives. If we maintain data, we have an ethical obligation to protect it. What if that cancer data above held some clue, some key, to saving someone’s life? Data loss isn’t just data loss: it can lead to loss of money, loss of livelihood, or perhaps even loss of life.

Those details are from a sample of 118 sourced from a broader category of 27,000 hit systems.

So the next lesson is that even now, 2017, we’re still having to talk about backup as if it’s a new thing. During the late 90s I thought there was a light at the end of the tunnel for discussions about “do I need backup?”, and I’ve long since resigned myself to the fact I’ll likely still be having those conversations up until the day I retire, but it’s a chilling reminder of the ease at which systems can now be deployed without adequate protection. One of the common responses you’ll see to “we can’t back this up”, particularly in larger databases, is the time taken to complete a backup. That’s something Dell EMC has been focused on for a while now. There’s storage integrated data protection via ProtectPoint, and more recently, there’s BoostFS for Data Domain, giving systems distributed segment processing directly onto the database server for high speed deduplicated backups. (And yes, MongoDB was one of the systems in mind when BoostFS was developed.) If you’ve not heard of BoostFS yet, it was included in DDOS 6, released last year.

It’s not just backup though – for systems with higher criticality there should be multi-layered protection strategies: backups will give you potentially longer term retention, and off-platform protection, but if you need really fast recovery times with very low RPOs and RTOs, your system will likely need replication and snapshots as well. Data protection isn’t a “one size fits all” scenario that some might try to preach; it’s multi-layered and it can encompass a broad range of technology. (And if the data is super business critical you might even want to go the next level and add IRS protection for it, protecting yourself not only from conventional data loss, but also situations where your business is hacked as well.)

The fallout and the data loss from the MongoDB attacks will undoubtedly continue for some time. If one thing comes out of it, I’m hoping it’ll be a stronger understanding from businesses in 2017 that data protection is still a very real topic.


A speculative lesson: What’s the percentage of these MongoDB deployments that fall under the banner of ‘Shadow IT’? I.e., non-IT deployments of systems. By developers, other business groups, etc., within organisations? Does this also serve as a reminder of the risks that can be introduced when non-IT groups deploy IT systems without appropriate processes and rigour? We may never know the percentage breakdown between IT-led deployments and Shadow IT led deployments, but it’s certainly food for thought.

My cup runneth over

 Architecture, Backup theory, Best Practice  Comments Off on My cup runneth over
Nov 242016

How do you handle data protection storage capacity?

How do you handle growth – regular or unexpected – in your data protection volumes?

Hey, just as an aside, the NetWorker 2016 Usage Survey is up and running. If you can spare 5 minutes to complete it at the end of this article, that would be greatly appreciated!

Is your business reactive or proactive to data protection capacity requirements?


In the land of tape, dealing with capacity growth in data protection was both easy and insidiously obfuscated. Tape capacity management is basically a modern version of Hilbert’s Infinite Hotel Paradox – you sort-of, kind-of never run out of capacity because you always just buy another box of tapes. Problem solved, right? (No, more a case of the can kicked down the road.) Problem “solved” and you’ve got 1,000, 10,000, 50,000 tapes in a multitude of media types that you don’t even have tape drives to read any more.

Yet we like to focus on the real world now and tape isn’t the defacto standard any more for backup systems: it’s disk. Disk gives us great power, but with great power comes great responsibility (sorry, even though I’m not a Spiderman fan, I couldn’t resist. Tape is the opposite: tape gives us no power, and with no power comes no responsibility – yes, I’m also a Kickass fan.)

For businesses that still do disk-to-disk-to-tape, where disk is treated more like a staging area and excess data is written out to tape, the problem is seemingly solved because – you guessed it – you can always just buy another box of tapes and stage more data from disk backup storage to tape storage. Again, that’s kicking the can down the road. I’ve known businesses who have had company-wide data protection policies mandating up to 3 months of online recoverability from disk getting down to two weeks or less of data stored on disk because the data to be protected has continued to grow, no scaling has been done on the storage, and – you guessed it – tape was the interim solution.

Aside: When I first joined my first Unix system administration team in 1996, my team had just recently configuring an interim DNS server which they called tmp, because it was going to be quickly replaced by another server, which for the short term was called nc for new computer. When I left in 2000, tmp and nc were still there; in fact, nnc (yes, new-new-computer) was deployed shortly thereafter to replace nc and eventually, a year or two after I left, tmp was finally decommissioned.

Interim solutions have a tendency to stick. In fact, it’s a common story – capacity problem with data protection so let’s deploy an interim solution and solve it later. Later-later. Much later. Much later-later. Ad-infinitum.

There is, undoubtedly, a growing maturity in handling data protection storage management and capacity planning coming out of the pure disk and disk/cloud storage formats. While this is driven by necessity, it’s also and important demonstration that IT processes need to mature as the business matures as well.

If you’re new to pure disk based, or disk/cloud based data protection storage, you might want to stop and think carefully about your data protection policies and procurement processes/cycles so that you’re able to properly meet the requirements of the business. Here are a few tips I’ve learnt over the years…

80% is the new 100%

This one is easy. Don’t think of 100% capacity as being 100% capacity. Think of 80% as 100%. Why? Because you need runway to either procure storage, migrate data or get formal approval for changes to retention and backup policies. If you wait until you’re at 90, 95 or even 100% capacity, you’ve left your run too late and you’re just asking for many late or sleepless nights managing a challenge that could have been proactively dealt with earlier.

The key to management is measurement

I firmly believe you can’t manage something that has operational capacity restraints (e.g., “if we hit 100% capacity we can’t do more backups”) if you’re not actively measuring it. That doesn’t mean periodically logging into a console or running a “df -h” or whatever the “at a glance” look is for your data protection storage, it means capturing measurement data and having it available in both reports and dashboards so it is instantly visible.

The key to measurement is trending

You can capture all the data in the world and make it available in a dashboard, but if you don’t perform appropriate localised trending against that data to analyse it, you’re making your own good self the bottleneck (and weakest link) in the capacity management equation. You need to have trends produced as part of your reporting processes to understand how capacity is changing over time. These trends should be reflective of your own seasonal data variations or sampled over multiple time periods. Why? Well, if you have disk based data protection storage in your environment and do a linear forecast on capacity utilisation from day one, you’ll likely get a smoothing based on lower figures from earlier in the system lifecycle that could actually obfuscate more recent results. So you want to capture and trend that information for comparison, but you equally want to capture and trend shorter timeframes to ensure you have an understanding of shorter term changes. Trends based on the last six and three months usage profiles can be very useful in identifying what sort of capacity management challenges you’ve got based on short term changes in data usage profiles – a few systems for instance might be considerably spiking in utilisation, and if you’re still comparing against a 3-year timeframe dataset or something along those lines, the more recent profile may not be accurately represented in forecasts.

In short: measuring over multiple periods gives you the best accuracy.

Maximum is the new minimum

Linear forecasts of trending information are good if you’re just slowly, continually increasing your storage requirements. But if you’re either staging data (disk as staging) or running garbage collection (e.g., deduplication), it’s quite possible to get increasing sawtooth cycles in capacity utilisation on your data protection storage. And guess what? It doesn’t matter if your capacity requirements for the average utilisation are met if you’ll grow beyond the capacity requirements of the day before the oldest backups are deleted or garbage collection takes place. So make sure when you’re trending you’re looking at how you meet the changing maximum peaks, not the average sizes.

Know your windows

There’s three types of windows I’m referring to here – change, change freeze, and procurement.

You need to know them all intimately.

You’re at 95% capacity but you anticipated this and additional data protection storage has just arrived in your datacentre’s receiving bay, so you should be right to install it – right? What happens if you then have a week’s wait to have the change board consider your request for an outage – or datacentre access – to install the extra capacity? Will you be able to hold on that long? That’s knowing your change windows.

You know you’re going to run out of capacity in two months time if nothing is done, so you order additional data protection storage and it arrives on December 20. The only problem is a mandatory company change blackout started on December 19 and you literally cannot install anything, until January 20. Do you have enough capacity to survive? That’s knowing your freeze windows.

You know you’re at 80% capacity today and based on the trends you’ll be at 90% capacity in 3 weeks and 95% capacity in 4 weeks. How long does it take to get a purchase order approved? How long does it take the additionally purchased systems to arrive on-site? If it takes you 4 weeks to get purchase approval and another 3 weeks for it to arrive after the purchase order is sent, maybe 70%, not 80%, is your new 100%. That’s knowing your procurement windows.

Final thoughts

I want to stress – this isn’t a doom and gloom article, even if it seems I’m painting a blunt picture. What I’ve described above is expert tips – not from myself, but from my customers, and customers of colleagues and friends, whom I’ve seen manage data protection storage capacity well. If you follow at least the above guidelines, you’re going to have a far more successful – and more relaxed – time of it all.

And maybe you’ll get to spend Thanksgiving, Christmas, Ramadan, Maha Shivaratri, Summer Solstice, Melbourne Cup Day, Labour Day or whatever your local holidays and festivals are with your friends and families, rather than manually managing an otherwise completely manageable situation.

Hey, just as an aside, the NetWorker 2016 Usage Survey is up and running. If you can spare 5 minutes to complete it at the end of this article, that would be greatly appreciated!


Aug 092016

I’ve recently been doing some testing around Block Based Backups, and specifically recoveries from them. This has acted as an excellent reminder of two things for me:

  • Microsoft killing Technet is a real PITA.
  • You backup to recover, not backup to backup.

The first is just a simple gripe: running up an eval Windows server every time I want to run a simple test is a real crimp in my style, but $1,000+ licenses for a home lab just can’t be justified. (A “hey this is for testing only and I’ll never run a production workload on it” license would be really sweet, Microsoft.)

The second is the real point of the article: you don’t backup for fun. (Unless you’re me.)

iStock Racing

You ultimately backup to be able to get your data back, and that means deciding your backup profile based on your RTOs (recovery time objectives), RPOs (recovery time objectives) and compliance requirements. As a general rule of thumb, this means you should design your backup strategy to meet at least 90% of your recovery requirements as efficiently as possible.

For many organisations this means backup requirements can come down to something like the following: “All daily/weekly backups are retained for 5 weeks, and are accessible from online protection storage”. That’s why a lot of smaller businesses in particular get Data Domains sized for say, 5-6 weeks of daily/weekly backups and 2-3 monthly backups before moving data off to colder storage.

But while online is online is online, we have to think of local requirements, SLAs and flow-on changes for LTR/Compliance retention when we design backups.

This is something we can consider with things even as basic as the humble filesystem backup. These days there’s all sorts of things that can be done to improve the performance of dense filesystem (and dense-like) filesystem backups – by dense I’m referring to very large numbers of files in relatively small storage spaces. That’s regardless of whether it’s in local knots on the filesystem (e.g., a few directories that are massively oversubscribed in terms of file counts), or whether it’s just a big, big filesystem in terms of file count.

We usually think of dense filesystems in terms of the impact on backups – and this is not a NetWorker problem; this is an architectural problem that operating system vendors have not solved. Filesystems struggle to scale their operational performance for sequential walking of directory structures when the number of files starts exponentially increasing. (Case in point: Cloud storage is efficiently accessed at scale when it’s accessed via object storage, not file storage.)

So there’s a number of techniques that can be used to speed up filesystem backups. Let’s consider the three most readily available ones now (in terms of being built into NetWorker):

  • PSS (Parallel Save Streams) – Dynamically builds multiple concurrent sub-savestreams for individual savesets, speeding up the backup process by having multiple walking/transfer processes.
  • BBB (Block Based Backup) – Bypasses the filesystem entirely, performing a backup at the block level of a volume.
  • Image Based Backup – For virtual machines, a VBA based image level backup reads the entire virtual machine at the ESX/storage layer, bypassing the filesystem and the actual OS itself.

So which one do you use? The answer is a simple one: it depends.

It depends on how you need to recover, how frequently you might need to recover, what your recovery requirements are from longer term retention, and so on.

For virtual machines, VBA is usually the method of choice as it’s the most efficient backup method you can get, with very little impact on the ESX environment. It can recover a sufficient number of files in a single session for most use requirements – particularly if file services have been pushed (where they should be) into dedicated systems like NAS appliances. You can do all sorts of useful things with VBA backups – image level recovery, changed block tracking recovery (very high speed in-place image level recovery), instant access (when using a Data Domain), and of course file level recovery. But if your intent is to recover tens of thousands of files in a single go, VBA is not really what you want to use.

It’s the recovery that matters.

For compatible operating systems and volume management systems, Block Based Backups work regardless of whether you’re in a virtual machine or whether you’re on a physical machine. If you’re needing to backup a dense filesystem running on Windows or Linux that’s less than ~63TB, BBB could be a good, high speed method of achieving that backup. Equally, BBB can be used to recover large numbers of files in a single go, since you just mount the image and copy the data back. (I recently did a test where I dropped ~222,000 x 511 byte text files into a single directory on Windows 2008 R2 and copied them back from BBB without skipping a beat.)

BBB backups aren’t readily searchable though – there’s no client file index constructed. They work well for systems where content is of a relatively known quantity and users aren’t going to be asking for those “hey I lost this file somewhere in the last 3 weeks and I don’t know where I saved it” recoveries. It’s great for filesystems where it’s OK to mount and browse the backup, or where there’s known storage patterns for data.

It’s the recovery that matters.

PSS is fast, but in any smack-down test BBB and VBA backups will beat it for backup speed. So why would you use them? For a start, they’re available on a wider range of platforms – VBA requires ESX virtualised backups, BBB requires Windows or Linux and ~63TB or smaller filesystems, PSS will pretty much work on everything other than OpenVMS – and its recovery options work with any protection storage as well. Both BBB and VBA are optimised for online protection storage and being able to mount the backup. PSS is an extension of the classic filesystem agent and is less specific.

It’s the recovery that matters.

So let’s revisit that earlier question: which one do you use? The answer remains: it depends. You pick your backup model not on the basis of “one size fits all” (a flawed approach always in data protection), but your requirements around questions like:

  • How long will the backups be kept online for?
  • Where are you storing longer term backups? Online, offline, nearline or via cloud bursting?
  • Do you have more flexible SLAs for recovery from Compliance/LTR backups vs Operational/BAU backups? (Usually the answer will be yes, of course.)
  • What’s the required recovery model for the system you’re protecting? (You should be able to form broad groupings here based on system type/function.)
  • Do you have any externally imposed requirements (security, contractual, etc.) that may impact your recovery requirements?

Remember there may be multiple answers. Image level backups like BBB and VBA may be highly appropriate for operational recoveries, but for long term compliance your business may have needs that trigger filesystem/PSS backups for those monthlies and yearlies. (Effectively that comes down to making the LTR backups as robust in terms of future infrastructure changes as possible.) That sort of flexibility of choice is vital for enterprise data protection.

One final note: the choices, once made, shouldn’t stay rigidly inflexible. As a backup administrator or data protection architect, your role is to constantly re-evaluate changes in the technology you’re using to see how and where they might offer improvements to existing processes. (When it comes to release notes: constant vigilance!)

Betting the company

 Backup theory, Best Practice, Databases, General Technology  Comments Off on Betting the company
Jun 152016

Short of networking itself, backup and recovery systems touch more of your infrastructure than anything else. So it’s pretty common for any backup and recovery specialist to be asked how we can protect a ten or sometimes even twenty year old operating system or application.

Sure you can backup Windows 2012, but what about NT 4?

Sure you can backup Solaris 11, but what about Tru64 v5?

Sure you can backup Oracle 12, but what about Oracle 8?

These really are questions we get asked.

I get these questions. I even have an active Windows 2003 SMB server sitting in my home lab running as an RDP jump-point. My home lab.

Gambling the lot

So it’s probably time for me to admit: I’m not really speaking to backup administrators with this article, but the broader infrastructure teams and, probably more so, the risk officers within companies.

Invariably we get asked if we can backup AncientOS 1.1 or DefunctDatabase 3.2 because those systems are still in use within a business, and inevitably that’s because they’re in production use within a company. Sometimes they’re even running pseudo-mission critical services, but more often than not they’re just simply running essential services the business has deemed too costly to migrate to another platform.

I’m well aware of this. In 1999 I was the primary system administrator involved in a Y2K remediation project for a SAP deployment. The system as deployed was running on an early version of Oracle 8 as I recall (it might have been Oracle 7 – it was 17 years ago…), sitting on Tru64 with an old (even for then) version of SAP. The version of the operating system, the version of Oracle, the version of SAP and even things like the firmware in the DAS enclosures attached were all unsupported by the various vendors for Y2K.

The remediation process was tedious and slow because we had to do piecemeal upgrades of everything around SAP and beg for Y2K compliance exceptions from Oracle and Digital for specific components. Why? When the business had deployed SAP two years before, they’d spent $5,000,000 or so customizing it to the nth degree, and upgrading it would require a similarly horrifically expensive remediation customization project. It was, quite simply, easier and cheaper to risk periphery upgrades around the application.

It worked. (As I recall, the only system in the company that failed over the Y2K transition was the Access database put together at the last minute by some tech-boffin-project manager designed to track any Y2K incidents over the entire globe for the company. I’ve always found there to be beautiful irony in that.)

This is how these systems limp along within organisations. It costs too much to change them. It costs too much to upgrade them. It costs to much to replace them.

And so day by day, month by month, year by year, the business continues to bet that bad things won’t happen. And what’s the collateral for the bet? Well it could be the company itself. If it costs that much to change them, upgrade them or to replace them, what’s the cost going to be if they fail completely? There’s an old adage of a CEO and a CIO talking, and the CIO says: “Why are you paying all this money to train people? What if you train them and they leave?” To which the CEO responds, “What if we don’t train them and they stay?” I think this is a similar situation.

I understand. I sympathise – even empathise, but we’ve got to find a better way to resolve this problem, because it’s a lot more than just a backup problem. It’s even more than a data protection problem. It’s a data integrity problem, and that creates an operational integrity problem.

So why is the question “do you support X?” asked when the original vendor for X doesn’t even support it any more – and may not have done for a decade or more?

The question is not really whether we can supply backup agents or backup modules old enough to work with these systems unsupported by their vendor of origin, and whether you can get access to a knowledge-base that stretches back far enough to include details of those systems. Supply? Yes. Officially support? How much official support do you get from the vendor of origin?

I always think in these situations there’s a broader conversation to be had. Those legacy applications and operating systems are a sea anchor to your business at a time when you increasingly have to be able to steer and move the ship faster and with greater agility. Those scenarios where you’re reliant on technology so old it’s no longer supported are exactly those sorts of scenarios that are allowing startups and younger, more agile competitors to swoop in and take customers from you. And it’s those scenarios that also leave you exposed to an old 10GB ATA drive failing, or a random upgrade elsewhere in the company finally and unexpectedly resulting in that critical or essential system no longer being able to access the network.

So how do we solve the problem?

Sometimes there’s a simple workaround – virtualisation. If it’s an old x86 based platform, particularly Windows, there’s a good chance the system can at least be virtualised so it can at least run on modern hardware. That doesn’t solve the ‘supported’ problem, but it at least means greater protection: image level backups regardless of whether there’s an agent for the internal virtual machine, and snapshots and replication to reduce the requirements to ever have to consider a BMR. Usually being old, the amount of data on those systems is minimal, so that type of protection is not an issue.

But the real solution comes from being able to modernise the workload. We talk about platforms 1, 2 and 3 – platform 1 is the old mainframe approach to the world, platform 2 is the classic server/desktop architecture we’ve been living with for so long, and platform 3 is the new, mobile and cloud approach to IT. Some systems even get classified as platform ‘2.5’ – that interim step between the current and the new. What’s the betting that old curmudgeonly system that’s holding your business back from modernising is more like platform 1.5?

One way you can modernise is to look at getting innovative with software development. Increasing requirements for agility will drive more IT departments back to software development for platform 3 environments, so why not look at this as an opportunity to grow that development environment within your business? That’s where the EMC Federation can really swing in to help: Pivotal Labs is premised on new approaches to software development. Agile may seem like a buzz-word, but if you can cut software development down from 12-24 months to 6-12 weeks (or less!), doesn’t that mitigate many of the cost reasons to avoid dealing with the legacy platforms?

The other way of course is with traditional consulting approaches. Maybe there’s a way that legacy application can be adapted, or archived, in such a way that the business functions can be continued but the risk substantially reduced and the platform modernised. That’s where EMC’s consultancy services come in, where our content management services come in, and where our broad experience to hundreds of thousands of customer environments come in. Because I’ll be honest: your problems aren’t actually unique; you’re not the only business that’s dealing with legacy system components and while there may be industry-specific or even customer-specific aspects that are tricky, there’s a very, very good chance that somewhere, someone has gone through the same situation. The solution could very well be tailored specifically for your business, but the processes and tools that get used to get you to your solution don’t necessarily have to be bespoke.

It’s time to start thinking beyond whether those ancient and unsupported operating systems and applications can be backed up, but how they can be modernised so they stop holding the business back.

The Importance of Being Earnestly Automated

 Architecture, Best Practice, General Technology  Comments Off on The Importance of Being Earnestly Automated
Apr 132016

It was not long after I started in IT that I got the most important advice of my career. It came from a senior Unix system administrator in the team I’d just joined, and it shaped my career. In just eight words it stated the purpose of the system administrator, and I think IT as a whole:

The best system administrator is a lazy one.

On the face of it, it seems inappropriate advice: be lazy; yet that’s just the superficial reading of it. The real intent was this:

Automate everything you have to repeatedly do.


One of the reasons I was originally so blasé about Cloud was that it was old-hat. The same way that mainframe jockeys yawned and rolled their eyes when midrange people started talking about the wonders of virtualisation, I listened to people in IT extolling Cloud and found myself rolling my eyes – not just over the lack of data protection in early Cloud solutions – but to the stories about how Cloud was agile. And there’s no prizes for guessing where agility comes from: automation.

It surprises me twenty years on that the automation debate is still going on, and some people remain unconvinced.

There are three fundamental results of automation:

  • Repeatability
  • Reliability
  • Verifiability

When something is properly automated, it can be repeated easily and readily. That’s a fundamental tenet driving Cloud agility: you click on a button on a portal and hey presto!, a virtual machine is spun up and you receive an IP address to access it from. Or you click on a button on a portal and suddenly you’ve got yourself a SQL database or Exchange server or CRM system or any one of hundreds of different applications or business functions. If there’s human intervention at the back-end between when you click the button and when you get your service it’s not agile. It’s not Cloud. And it’s certainly not automated. Well, not fully or properly.

With repeatability becomes reliability – accuracy. It doesn’t matter whether the portal has been up for 1 hour or 1000 hours, it doesn’t matter whether it’s 01:00 or 13:00, and it doesn’t matter how many requests the portal has got: it’s not prone to error, it won’t miss a check-box because it’s rushed or tired or can’t remember what the correct option is. It doesn’t matter whether the computer doing the work in the background has never done it before because it’s just been added to the resource pool, or whether it’s done the process a million times before. Automation isn’t just about repeatability, it’s about reliable repeatability.

Equally as importantly, with automation – with repeatability – there comes verifiability. Not only can you reliably repeat the same activity time and time again, but every time it’s executed you can verify it was executed. You can monitor, measure and report. This can be from the simplest – verifying it was performed successfully or throwing an exception for a human to investigate – to the more complex, such as tracking and reporting the trends on how long it takes automated processes to complete, so you can see keep an eye on how the system is scaling.

Once you’ve got automation in place, you’ve freed up your IT staff from boring and repetitive duties. That’s not to remove them from their jobs, but to let the humans in your staff do the jobs humans do best: those involving dealing with the unexpected, or thinking of new solutions. Automated, repeatable tasks are best left to scripts and processes and even robots (when it comes to production). The purpose of being a lazy system administrator was not so that you could sit at your desk doing nothing all day, but so you could spend time handling exceptions and errors, designing new systems, working on new projects, and yes, automating new systems.

Automation is not just a Cloud thing. Automation is not just a system administration thing. Or a database/application administration thing. Or a build thing. Or a…

Automation is everything in IT, particularly in the infrastructure space. Cloud has well and truly raised the profile of automation, but the fundamental concept is not new. I’d go so far as to say that if your business isn’t focused on automation, you’re doing IT wrong.

Mar 092016

I’ve been working with backups for 20 years, and if there’s been one constant in 20 years I’d say that application owners (i.e., DBAs) have traditionally been reluctant to have other people (i.e., backup administrators) in control of the backup process for their databases. This leads to some environments where the DBAs maintain control of their backups, and others where the backup administrators maintain control of the database backups.


So the question that many people end up asking is: which way is the right way? The answer, in reality is a little fuzzy, or, it depends.

When we were primarily backing up to tape, there was a strong argument for backup administrators to be in control of the process. Tape drives were a rare commodity needing to be used by a plethora of systems in a backup environment, and with big demands placed on them. The sensible approach was to fold all database backups into a common backup scheduling system so resources could be apportioned efficiently and fairly.

DB Backups with Tape

Traditional backups to tape via a backup server

With limited tape resources and a variety of systems to protect, backup administrators needed to exert reasonably strong controls over what backed up when, and so in a number of organisations it was common to have database backups controlled within the backup product (e.g., NetWorker), with scheduling negotiated between the backup and database administrators. Where such processes have been established, they often continue – backups are, of course, a reasonably habitual process (and for good cause).

For some businesses though, DBAs might feel there was not enough control over the backup process – which might be agreed with based on the mission criticality of the applications running on top of the database, or because of the perceived licensing costs associated with using a plugin or module from the backup product to backup the database. So in these situations if a tape library or drives weren’t allocated directly to the database, the “dump and sweep” approach became quite common, viz.:

Dump and Sweep

Dump and Sweep

One of the most pervasive results of the “dump and sweep” methodology however is the amount of primary storage it uses. Due to it being much faster than tape, database administrators would often get significantly larger areas of storage – particularly as storage became cheaper – to conduct their dumps to. Instead of one or two days, it became increasingly common to have anywhere from 3-5 days of database dumps sitting on primary storage being swept up nightly by a filesystem backup agent.

Dump and sweep of course poses problems: in addition to needing large amounts of primary storage, the first backup for the database is on-platform – there’s no physical separation. That means the timing of getting the database backup completed before the filesystem sweep starts is critical. However, the timing for the dump is controlled by the DBA and dependent on the database load and the size of the database, whereas the timing of the filesystem backup is controlled by the backup administrator. This would see many environments spring up where over time the database grew to a size it wouldn’t get an off-platform backup for 24 hours – until the next filesystem backup happened. (E.g., a dump originally taking an hour to complete would be started at 19:00. The backup administrators would start the filesystem backup at 20:30, but over time the database backups would grow and wouldn’t complete until say, 21:00. Net result could be a partial or failed backup of the dump files the first night, with the second night being the first successful backup of the dump.)

Over time backup to disk entered popularity to overcome the overnight operational challenges of tape, then grew, and eventually the market has expanded to include deduplication storage, purpose built backup appliances and even when I’d normally consider to be integrated data protection appliances – ones where the intelligence (e.g., deduplication functionality) is extended out from the appliance to the individual systems being protected. That’s what we get, for instance, with Data Domain: the Boost functionality embedded in APIs on the client systems leveraging distributed segment processing to have everything being backed up participate in its own deduplication. The net result is one that scales better than the traditional 3-tier “client/server/{media server|storage node}” environment, because we’re scaling where it matters: out at the hosts being protected and up at protection storage, rather than adding a series of servers in the middle to manage bottlenecks. (I.e., we remove the bottlenecks.)

Even as large percentages of businesses switched to deduplicated storage – Data Domains mostly from a NetWorker perspective – and had the capability of leveraging distributed deduplication processes to speed up the backups, that legacy “dump and sweep” approach, if it had been in the business, often remained in the business.

We’re far enough into this now that I can revisit the two key schools of thought within data protection:

  • Backup administrators should schedule and control backups regardless of the application being backed up
  • Subject Matter Experts (SMEs) should have some control over their application backup process because they usually deeply understand how the business functions leveraging the application work

I’d suggest that the smaller the business, the more correct the first option is – or rather, when an environment is such that DBAs are contracted or outsourced in particular, having the backup administrator in charge of the backup process is probably more important to the business. But that creates a requirement for the backup administrator to know the ins and outs of backing up and recovering the application/database almost as deeply as a DBA themselves.

As businesses grow in size and as the number of mission critical systems sitting on top of databases/applications grow, there’s equally a strong opinion the second argument is correct: the SMEs need to be intimately involved in the backup and recovery process. Perhaps even more so, in a larger backup environment, you don’t want your backup administrators to actually be bottlenecks in a disaster situation (and they’d usually agree to this as well – it’s too stressful).

With centralised disk based protection storage – particularly deduplicating protection storage – we can actually get the best of both worlds now though. The backup administrators can be in control of the protection storage and set broad guidance on data protection at an architectural and policy level for much of the environment, but the DBAs can leverage that same protection storage and fold their backups into the overall requirements of their application. (This might be to even leverage third party job control systems to only trigger backups once batch jobs or data warehousing tasks have completed.)

Backup Process With Data Domain and Backup Server

Backup Process With Data Domain and Backup Server

That particular flow is great for businesses that have maintained centralised control over the backup process of databases and applications, but what about those where dump and sweep has been the design principle, and there’s a desire to keep a strong form of independence on the backup process, or where the overriding business goal is to absolutely limit the number of systems database administrators need to learn so they can focus on their job? They’re definitely legitimate approaches – particularly so in larger environments with more mission critical systems.

That’s why there’s the Data Domain Boost plugins for Applications and Databases – covering SAP, DB2, Oracle, SQL Server, etc. That gives a slightly different architecture, viz.:

DB Backups with Boost Plugin

DB Backups with Boost Plugin

In that model, the backup server (e.g., NetWorker) still controls and coordinates the majority of the backups in the environment, but the Boost Plugin for Databases/Applications is used on the database servers instead to allow complete integration between the DBA tools and the backup process.

So returning to the initial question – which way is right?

Well, that comes down to the real question: which way is right for your business? Pull any emotion or personal preferences out of the question and look at the real architectural requirements of the business, particularly relating to mission critical applications. Which way is the right way? Only your business can decide.

Here’s a thought I’ll leave you with though: there’s two critical components to being able to make the choice completely based on business requirements:

  • You need centralised protection storage where there aren’t the traditional (tape-inherited) limitations on concurrent device access
  • You need a data protection framework approach rather than a data protection monolith approach

The former allows you to make decisions without being impeded by arbitrary practical/physical limitations (e.g., “I can’t read from a tape and write to it at the same time”), and more importantly, the latter lets you build an adaptive data protection strategy using best of breed components at the different layers rather than squeezing everything into one box and making compromises at every step of the way. (NetWorker, as I’ve mentioned before, is a framework based backup product – but I’m talking more broadly here: framework based data protection environments.)

Happy choosing!

Client Load: Filesystem and Database Backups

 Backup theory, Best Practice, Databases  Comments Off on Client Load: Filesystem and Database Backups
Feb 032016

A question I get asked periodically is “can I backup my filesystem and database at the same time?”

As is often the case, the answer is: “it depends”.

Server on FireOr, to put it another way: it depends on what the specific client can handle at the time.

For the most part, backup products have a fairly basic design requirement: get the data from the source (let’s say “the client”, ignoring options like ProtectPoint for the moment) to the destination (protection storage) as quickly as possible. The faster the better, in fact. So if we want backups done as fast as possible, wouldn’t it make sense to backup the filesystem and any databases on the client at the same time? Well – the answer is “it depends”, and it comes down to the impact it has on the client and the compatibility of the client to the process.

First, let’s consider compatibility – if both the filesystem and database backup process use the same snapshot mechanism for instance, and only one can have a snapshot operational at any given time, that immediately rules out doing both at once. That’s the most obvious scenario, but the more subtle one almost comes back to the age-old parallelism problem: how fast is too fast?

If we’re simultaneously conducting a complete filesystem read (say, in the case of a full backup) and simultaneously reading an entire database and the database and filesystem we’re reading from both reside on the same physical LUN, there is the potential the two reads will be counter-productive: if the underlying physical LUN is in fact a single disk, you’re practically guaranteed that’s the case, for instance. We wouldn’t normally want RAID-less storage for pretty much anything in production, but just slipping RAID into the equation doesn’t guarantee we can achieve both reads simultaneously without impact to the client – particularly if the client is already doing other thingsProduction things.

Virtualisation doesn’t write a blank cheque, either; image level backup with databases in the image are a bit of a holy grail in the backup industry but even in those situations where it may be supported, it’s not supported for every database type; so it’s still more common than not to see situations where you have virtual/image level backups for the guest for crash consistency on the file and operating system components, and then an in-guest database agent running for that true guaranteed database recoverability. Do you want a database and image based backup happening at the same time? Your hypervisor is furiously reading the image file while the in-guest agent is furiously reading the database.

In each case that’s just at a per client level. Zooming out for a bit in a datacentre with hundreds or thousands of hosts all accessing shared storage via shared networking, usually via shared compute resources as well, how long is a piece of string becomes a exponentially increasing question as the number of shared resources and items sharing those resources start to come into play.

Unless you have an overflow of compute resources and SSD offering more IO than your systems can ever need, can I backup my filesystem and databases at the same time is very much a non-trivial question. In fact, it becomes a bit of an art, as does all performance tuning. So rather than directly answering the question, I’ll make a few suggestions to be considered along the way as you answer the question for your environment:

  • Recommendation: Particularly for traditional filesystem agent + traditional database agent backups, never start the two within five minutes, and preferably give half an hour gap between starts. I.e., overlap is OK, concurrency for starting should be avoided where possible.
  • Recommendation: Make sure the two functions can be concurrently executed. I.e., if one blocks the other from running at the same time, you have your answer.
  • Remember: It’s all parallelism. Rather than a former CEO leaping around stage shouting “developers, developers, developers!” imagine me leaping around shouting “parallelism, parallelism, parallelism!”* – at the end of the day each concurrent filesystem backup uses a unit of parallelism and each concurrent database backup uses a unit of parallelism, so if you exceed what the client can naturally do based on memory, CPU resources, network resources or disk resources, you have your answer.
  • Remember: Backup isn’t ABC, it’s CDECompression, Deduplication, Encryption: Each function will adjust the performance characteristics of the host you’re backing up – sometimes subtly, sometimes not so. Compression and encryption are easier to understand: if you’re doing either as a client-CPU function you’re likely going to be hammering the host. Deduplication gets trickier of course – you might be doing a bit more CPU processing on the host, but over a shorter period of time if the net result if a 50-99% reduction in the amount of data you’re sending.
  • Remember: You need the up-close and big picture view. It’s rare we have systems so isolated any more that you can consider this in the perspective of a single host. What’s the rest of the environment doing or likely to be doing?
  • Remember: ‘More magic’ is better than ‘magic’. (OK, it’s unrelated, but it’s always a good story to tell.)
  • Most importantly: Test. Once you’ve looked at your environment, once you’ve worked out the parallelism, once you’re happy the combined impact of a filesystem and database backup won’t go beyond the operational allowances on the host – particularly on anything remotely approaching mission critical – test it.

If you were hoping there was an easy answer, the only one I can give you is don’t, but that’s just making a blanket assumption you can never or should never do it. It’s the glib/easy answer – the real answer is: only you can answer the question.

But trust me: when you do, it’s immensely satisfying.

On another note: I’m pleased to say I made it into the EMC Elect programme for another year – that’s every year since it started! If you’re looking for some great technical people within the EMC community (partners, employees, customers) to keep an eye on, make sure you check out the announcement page.

* Try saying “parallelism, parallelism, parallelism!” three times fast when you had a speech impediment as a kid. It doesn’t look good.