May 232017
 

Introduction

A seemingly straight-forward question, what constitutes a successful backup may not engender the same response from everyone you ask. On the surface, you might suggest the answer is simply “a backup that completes without error”, and that’s part of the answer, but it’s not the complete answer.

Bullseye

Instead, I’m going to suggest there’s actually at least ten factors that go into making up a successful backup, and explain why each one of them is important.

The Rules

One – It finishes without a failure

This is the most simple explanation of a successful backup. One that literally finishes successfully. It makes sense, and it should be a given. If a backup fails to transfer the data it is meant to transfer during the process, it’s obviously not successful.

Now, there’s a caveat here, something I need to cover off. Sometimes you might encounter situations where a backup completes successfully  but triggers or produces a spurious error as it finishes. I.e., you’re told it failed, but it actually succeeded. Is that a successful backup? No. Not in a useful way, because it’s encouraging you to ignore errors or demanding manual cross-checking.

Two – Any warnings produced are acceptable

Sometimes warnings will be thrown during a backup. It could be that a file had to be re-read, or a file was opened at the time of backup (e.g., on a Unix/Linux system) and could only be partially read.

Some warnings are acceptable, some aren’t. Some warnings that are acceptable on one system may not be acceptable on another. Take for instance, log files. On a lot of systems, if a log file is being actively written to when the backup is running, it could be that the warning of an incomplete capture of the file is acceptable. If the host is a security logging system and compliance/auditing requirements dictate all security logs are to be recoverable, an open-file warning won’t be acceptable.

Three – The end-state is captured and reported on

I honestly can’t say the number of times over the years I’ve heard of situations where a backup was assumed to have been running successfully, then when a recovery is required there’s a flurry of activity to determine why the recovery can’t work … only to find the backup hadn’t been completing successfully for days, weeks, or even months. I really have dealt with support cases in the past where critical data that had to be recovered was unrecoverable due to a recurring backup failure – and one that had been going on, being reported in logs and completion notifications, day-in, day-out, for months.

So, a successful backup is also a backup here the end-state is captured and reported on. The logical result is that if the backup does fail, someone knows about it and is able to choose an action for it.

When I first started dealing with NetWorker, that meant checking the savegroup completion reports in the GUI. As I learnt more about the importance of automation, and systems scaled (my system administration team had a rule: “if you have to do it more than once, automate it”), I built parsers to automatically interpret savegroup completion results and provide emails that would highlight backup failures.

As an environment scales further, automated parsing needs to scale as well – hence the necessity of products like Data Protection Advisor, where you not only get simple dashboards for overnight success ratios with drill-downs, root cause analysis, and all the way up to SLA adherence reports and beyond.

In short, a backup needs to be reported on to be successful.

Four – The backup method allows for a successful recovery

A backup exists for one reason alone – to allow the retrieval and reconstruction of data in the event of loss or corruption. If the way in which the backup is run doesn’t allow for a successful recovery, then the backup should not be counted as a successful backup, either.

Open files are a good example of this – particularly if we move into the realm of databases. For instance, on a regular Linux filesystem (e.g., XFS or EXT4), it would be perfectly possible to configure a filesystem backup of an Oracle server. No database plugin, no communication with RMAN, just a rolling sweep of the filesystem, writing all content encountered to the backup device(s).

But it wouldn’t be recoverable. It’s a crash-consistent backup, not an application-consistent backup. So, a successful backup must be a backup that can be successfully recovered from, too.

Five – If an off-site/redundant copy is required, it is successfully performed

Ideally, every backup should get a redundant copy – a clone. Practically, this may not always be the case. The business may decide, for instance, that ‘bronze’ tiered backups – say, of dev/test systems, do not require backup replication. Ultimately this becomes a risk decision for the business and so long as the right role(s) have signed off against the risk, and it’s deemed to be a legally acceptable risk, then there may not be copies made of specific types of backups.

But for the vast majority of businesses, there will be backups for which there is a legal/compliance requirement for backup redundancy. As I’ve said before, your backups should not be a single point of failure within your data protection environment.

So, if a backup succeeds but its redundant copy fails, the backup should, to a degree, be considered to have failed. This doesn’t mean you have to necessarily do the backup again, but if redundancy is required, it means you do have to make sure the copy gets made. That then hearkens back to requirement three – the end state has to be captured and reported on. If you’re not capturing/reporting on end-state, it means you won’t be aware if the clone of the backup has succeeded or not.

Six – The backup completes within the required timeframe

You have a flight to catch at 9am. Because of heavy traffic, you don’t arrive at the airport until 1pm. Did you successfully make it to the airport?

It’s the same with backups. If, for compliance reasons you’re required to have backups complete within 8 hours, but they take 16 to run, have they successfully completed? They might exit without an error condition, but if SLAs have been breached, or legal requirements have not been met, it technically doesn’t matter that they finished without error. The time it took them to exit was, in fact, the error condition. Saying it’s a successful backup at this point is sophistry.

Seven – The backup does not prevent the next backup from running

This can happen one of two different ways. The first is actually a special condition of rule six – even if there are no compliance considerations, if a backup meant to run once a day takes longer than 24 hours to complete, then by extension, it’s going to prevent the next backup from running. This becomes a double failure – not only does the next backup run, but the next backup doesn’t run because the earlier backup is blocking it.

The second way is not necessarily related to backup timing – this is where a backup completes, but it leaves system in state that prevents next backup from running. This isn’t necessarily a common thing, but I have seen situations where for whatever reason, the way a backup finished prevented the next backup from running. Again, that becomes a double failure.

Eight – It does not require manual intervention to complete

There’s two effective categories of backups – those that are started automatically, and those that are started manually. A backup may in fact be started manually (e.g., in the case of an ad-hoc backup), but should still be able to complete without manual intervention.

As soon as manual intervention is required in the backup process, there’s a much greater risk of the backup not completing successfully, or within the required time-frame. This is, effectively, about designing the backup environment to reduce risk by eliminating human intervention. Think of it as one step removed from the classic challenge that if your backups are required but don’t start without human intervention, they likely won’t run. (A common problem with ‘strategies’ around laptop/desktop self-backup requirements.)

There can be workarounds for this – for example, if you need to trigger a database dump as part of the backup process (e.g., for a database without a plugin), then it could be a password needs to be entered, and the dump tool only accepts passwords interactively. Rather than having someone actually manually enter the password, the dump command could instead be automated with tools such as Expect.

Nine – It does not unduly impact access to the data it is protecting

(We’re in the home stretch now.)

A backup should be as light-touch as possible. The best example perhaps of a ‘heavy touch’ backup is a cold database backup. That’s where the database is shutdown for the duration of the backup, and it’s a perfect situation of a backup directly impacting/impeding access to the data being protected. Sometimes it’s more subtle though – high performance systems may have limited IO and system resources to handle the steaming of a backup, for instance. If system performance is degraded by the backup, then it should be considered the case the backup is unsuccessful.

I liken this to uptime vs availability. A server might be up, but if the performance of the system is so poor that users consider the service offered by the system, it’s not usable. That’s where, for instance, systems like ProtectPoint can be so important – in high performance systems it’s not just about getting a high speed backup, but limiting the load of the database server during the backup process.

Ten – It is predictably repeatable

Of course, there are ad-hoc backups that might only ever need to be run once, or backups that you may never need to run again (e.g., pre-decommissioning backup).

The vast majority of backups within an environment though will be repeated daily. Ideally, the result of each backup should be predictably repeatable. If the backup succeeds today, and there’s absolutely no changes to the systems or environment, for instance, then it should be reasonable to expect the backup will succeed tomorrow. That doesn’t ameliorate the requirement for end-state capturing and reporting; it does mean though that the backup results shouldn’t effectively be random.

In Summary

It’s easy to understand why the simplest answer (“it completes without error”) can be so easily assumed to be the whole answer to “what constitutes a successful backup?” There’s no doubt it forms part of the answer, but if we think beyond the basics, there are definitely a few other contributing factors to achieving really successful backups.

Consistency, impact, recovery usefulness and timeliness, as well as all the other rules outlined above also come into how we can define a truly successful backup. And remember, it’s not about making more work for us, it’s about preventing future problems.


If you’ve thought the above was useful, I’d suggest you check out my book, Data Protection: Ensuring Data Availability. Available in paperback and Kindle formats.

What to do on world backup day

 Backup theory, Best Practice, Recovery  Comments Off on What to do on world backup day
Mar 302017
 

World backup day is approaching. (A few years ago now, someone came up with the idea of designating one day of the year to recognise backups.) Funnily enough, I’m not a fan of world backup day, simply because we don’t backup for the sake of backing up, we backup to recover.

Every day should, in fact, be world backup day.

Something that isn’t done enough – isn’t celebrated enough, isn’t tested enough – are recoveries. For many organisations, recovery tests consist of actually doing a recovery when requested, and things like long term retention backups are never tested, and even more rarely recovered from.

bigStock Rescue

So this Friday, March 31, I’d like to suggest you don’t treat as World Backup Day, but World Recovery Test Day. Use the opportunity to run a recovery test within your organisation (following proper processes, of course!) – preferably a recovery that you don’t normally run in terms of day to day operations. People only request file recoveries? Sounds like a good reason to run an Exchange, SQL or Oracle recovery to me. Most recoveries are Exchange mail level recoveries? Excellent, you know they work, let’s run a recovery of a complete filesystem somewhere.

All your recoveries are done within a 30 day period of the backup being taken? That sounds like an excellent idea to do the recovery from an LTR backup written 2+ years ago, too.

Part of running a data protection environment is having routine tests to validate ongoing successful operations, and be able to confidently report back to the business that everything is OK. There’s another, personal and selfish aspect to it, too. It’s one I learnt more than a decade ago when I was still an on-call system administrator: having well-tested recoveries means that you can sleep easily at night, knowing that if the pager or mobile phone does shriek you into blurry-eyed wakefulness at 1am, you can in fact log onto the required server and run the recovery without an issue.

So this World Backup Day, do a recovery test.


The need to have an efficient and effective testing system is something I cover in more detail in Data Protection: Ensuring Data Availability. If you want to know more, feel free to check out the book on Amazon or CRC Press. Remember that it doesn’t matter how good the technology you deploy is if you don’t have the processes and training to use it.

Mar 132017
 

The NetWorker usage report for 2016 is now complete and available here. Per previous years surveys, the survey ran from December 1, 2016 through to January 1, 2017.

Survey

There were some interesting statistics and trends arising from this survey. The percentages of businesses not using backup to disk in at least some form within their environment fell to just 1% of respondents. That’s 99% of respondents having some form of backup to disk within their environment!

More and more respondents are cloning within their environments – if you’re not cloning in yours, you’re falling behind the curve now in terms of ensuring your backup environment can’t be a single point of failure.

There’s plenty of other results and details in the survey report you may be interested in, including:

  • Changes to the number of respondents using dedicated backup administrators
  • Cloud adoption rates
  • Ransomware attacks
  • The likelihood of businesses using or planning to use object storage as part of their backup environment
  • and many more

You can download the survey from the link above.

Just a reminder: “Data Protection: Ensuring Data Availability” is out now, and you can buy it in both paperback and electronic format from Amazon, or in paperback from the publisher, CRC Press. If you’ve enjoyed or found my blog useful, I’m sure you’ll find value in my latest book, too!

One respondent from this year’s survey will be receiving a signed copy of the book directly from me, too! That winner has been contacted.

Build vs Buy

 Architecture, Backup theory, Best Practice  Comments Off on Build vs Buy
Feb 182017
 

Converged, and even more so, hyperconverged computing, is all premised around the notion of build vs buy. Are you better off having your IT staff build your infrastructure from the ground up, managing it in silos of teams, or are you do you want to buy tightly integrated kit, land it on the floor and start using it immediately?

Dell-EMC’s team use the analogy – do you build your car, or do you buy it? I think this is a good analogy: it speaks to how the vast majority of car users consume vehicle technology. They buy a complete, engineered car as a package, and drive it off the car sales lot complete. Sure, there’s tinkerers who might like to build a car from scratch, but they’re not the average consumer. For me it’s a bit like personal computing – I gave up years ago wanting to build my own computers. I’m not interested in buying CPUs, RAM, motherboards, power supplies, etc., dealing with the landmines of compatibility, drivers and physical installation before I can get a usable piece of equipment.

This is where many people believe IT is moving, and there’s some common sense in it – it’s about time to usefulness.

A question I’m periodically posed is – what has backup got to do with the build vs buy aspect of hyperconverged? For one, it’s not just backup – it’s data protection – but secondly, it has everything to do with hyperconverged.

If we return to that build vs buy example of – would you build a car or buy a car, let me ask a question of you as a car consumer – a buyer rather than a builder of a car. Would you get airbags included, or would you search around for third party airbags?

Airbags

To be honest, I’m not aware of anyone who buys a car, drives it off the lot, and starts thinking, “Do I go to Airbags R Us, or Art’s Airbag Emporium to get my protection?”

That’s because the airbags come built-in.

For me at least, that’s the crux of the matter in the converged and hyper-converged market. Do you want third party airbags that you have to install and configure yourself, and hope they work with that integrated solution you’ve got bought, or do you want airbags included and installed as part of the purchase?

You buy a hyperconverged solution because you want integrated virtualisation, integrated storage, integrated configuration, integrated management, integrated compute, integrated networking. Why wouldn’t you also want integrated data protection? Integrated data protection that’s baked into the service catalogue and part of the kit as it lands on your floor. If it’s about time to usefulness it doesn’t stop at the primary data copy – it should also include the protection copies, too.

Airbags shouldn’t be treated as optional, after-market extras, and neither should data protection.

Feb 122017
 

On January 31, GitLab suffered a significant issue resulting in a data loss situation. In their own words, the replica of their production database was deleted, the production database was then accidentally deleted, then it turned out their backups hadn’t run. They got systems back with snapshots, but not without permanently losing some data. This in itself is an excellent example of the need for multiple data protection strategies; your data protection should not represent a single point of failure within the business, so having layered approaches to achieve a variety of retention times, RPOs, RTOs and the potential for cascading failures is always critical.

To their credit, they’ve published a comprehensive postmortem of the issue and Root Cause Analysis (RCA) of the entire issue (here), and must be applauded for being so open with everything that went wrong – as well as the steps they’re taking to avoid it happening again.

Server on Fire

But I do think some of the statements in the postmortem and RCA require a little more analysis, as they’re indicative of some of the challenges that take place in data protection.

I’m not going to speak to the scenario that led to the production, rather than replica database, being deleted. This falls into the category of “ooh crap” system administration mistakes that sadly, many of us will make in our careers. As the saying goes: accidents happen. (I have literally been in the situation of accidentally deleting a production database rather than its replica, and I can well and truly sympathise with any system or application administrator making that mistake.)

Within GitLab’s RCA under “Problem 2: restoring GitLab.com took over 18 hours”, several statements were made that irk me as a long-term data protection specialist:

Why could we not use the standard backup procedure? – The standard backup procedure uses pg_dump to perform a logical backup of the database. This procedure failed silently because it was using PostgreSQL 9.2, while GitLab.com runs on PostgreSQL 9.6.

As evidenced by a later statement (see the next RCA statement below), the procedure did not fail silently; instead, GitLab chose to filter the output of the backup process in a way that they did not monitor. There is, quite simply, a significant difference between fail silently and silently ignored results. The latter is a far more accurate statement than the former. A command that fails silently is one that exits with no error condition or alert. Instead:

Why did the backup procedure fail silently? – Notifications were sent upon failure, but because of the Emails being rejected there was no indication of failure. The sender was an automated process with no other means to report any errors.

The pg_dump command didn’t fail silently, as previously asserted. It generated output which was silently ignored due to a system configuration error. Yes, a system failed to accept the emails, and a system therefore failed to send the emails, but at the end of the day, a human failed to see or otherwise check as to why the backup reports were not being received. This is actually a critical reason why we need zero error policies – in data protection, no error should be allowed to continue without investigation and rectification, and a change in or lack of reporting or monitoring data for data protection activities must be treated as an error for investigation.

Why were Azure disk snapshots not enabled? – We assumed our other backup procedures were sufficient. Furthermore, restoring these snapshots can take days.

Simple lesson: If you’re going to assume something in data protection, assume it’s not working, not that it is.

Why was the backup procedure not tested on a regular basis? – Because there was no ownership, as a result nobody was responsible for testing the procedure.

There are two sections of the answer that should serve as a dire warning: “there was no ownership”, “nobody was responsible”. This is a mistake many businesses make, but I don’t for a second believe there was no ownership. Instead, there was a failure to understand ownership. Looking at the “Team | GitLab” page, I see:

  • Dmitriy Zaporozhets, “Co-founder, Chief Technical Officer (CTO)”
    • From a technical perspective the buck stops with the CTO. The CTO does own the data protection status for the business from an IT perspective.
  • Sid Sijbrandij, “Co-founder, Chief Executive Officer (CEO)”
    • From a business perspective, the buck stops with the CEO. The CEO does own the data protection status for the business from an operational perspective, and from having the CTO reporting directly up.
  • Bruce Armstrong and Villi Iltchev, “Board of Directors”
    • The Board of Directors is responsible for ensuring the business is running legally, safely and financially securely. They indirectly own all procedures and processes within the business.
  • Stan Hu, “VP of Engineering”
    • Vice-President of Engineering, reporting to the CEO. If the CTO sets the technical direction of the company, an engineering or infrastructure leader is responsible for making sure the company’s IT works correctly. That includes data protection functions.
  • Pablo Carranza, “Production Lead”
    • Reporting to the Infrastructure Director (a position currently open). Data protection is a production function.
  • Infrastructure Director:
    • Currently assigned to Sid (see above), as an open position, the infrastructure director is another link in the chain of responsibility and ownership for data protection functions.

I’m not calling these people out to shame them, or rub salt into their wounds – mistakes happen. But I am suggesting GitLab has abnegated its collective responsibility by simply suggesting “there was no ownership”, when in fact, as evidenced by their “Team” page, there was. In fact, there was plenty of ownership, but it was clearly not appropriately understood along the technical lines of the business, and indeed right up into the senior operational lines of the business.

You don’t get to say that no-one owned the data protection functions. Only that no-one understood they owned the data protection functions. One day we might stop having these discussions. But clearly not today.

 

Jan 112017
 

There are currently a significant number of vulnerable MongoDB databases which are being attacked by ransomware attackers, and even though the attacks are ongoing, it’s worth taking a moment or two to reflect on some key lessons that can be drawn from it.

If you’ve not heard of it, you may want to check out some of the details linked to above. The short summary though is that MongoDB’s default deployment model has been a rather insecure one, and it’s turned out there’s a lot of unsecured public-facing databases out there. A lot of them have been hit by hackers recently, with the contents of the databases deleted and the owners being told to pay a ransom to get their data back. As to whether that will get them their data back is of course, another issue.

Ransomware Image

The first lesson of course is that data protection is not a single topic. More so than a lot of other data loss situations, the MongoDB scenario points to the simple, root lesson for any IT environment: data protection is also a data security factor:

Data Protection

For the most part, when I talk about Data Protection I’m referring to storage protection – backup and recovery, snapshots, replication, continuous data protection, and so on. That’s the focus of my next book, as you might imagine. But a sister process in data protection has been and will always be data security. So in the first instance in the MongoDB attacks, we’re seeing the incoming threat vector entirely from the simple scenario of unsecured systems. A lackadaisical approach to security is exactly what’s happened – for developers and deployers alike – in the MongoDB space, and the result to date is estimated to be around 93TB of data wiped. That number will only go up.

The next lesson though is that backups are still needed. In The MongoDB attacks: 93 terabytes of data wiped out (linked again from above), Dissent writes that of 1138 victims analysed:

Only 13 report that they had recently backed up the now-wiped database; the rest reported no recent backups

That number is awful. Just over 11% of impacted sites had recent backups. That’s not data protection, that’s data recklessness. (And as the report mentions, 73% of the databases were flagged as being production.) In one instance:

A French healthcare research entity had its database with cancer research wiped out. They reported no recent backup.

That’s another lesson there: data protection isn’t just about bits and bytes, it’s about people’s lives. If we maintain data, we have an ethical obligation to protect it. What if that cancer data above held some clue, some key, to saving someone’s life? Data loss isn’t just data loss: it can lead to loss of money, loss of livelihood, or perhaps even loss of life.

Those details are from a sample of 118 sourced from a broader category of 27,000 hit systems.

So the next lesson is that even now, 2017, we’re still having to talk about backup as if it’s a new thing. During the late 90s I thought there was a light at the end of the tunnel for discussions about “do I need backup?”, and I’ve long since resigned myself to the fact I’ll likely still be having those conversations up until the day I retire, but it’s a chilling reminder of the ease at which systems can now be deployed without adequate protection. One of the common responses you’ll see to “we can’t back this up”, particularly in larger databases, is the time taken to complete a backup. That’s something Dell EMC has been focused on for a while now. There’s storage integrated data protection via ProtectPoint, and more recently, there’s BoostFS for Data Domain, giving systems distributed segment processing directly onto the database server for high speed deduplicated backups. (And yes, MongoDB was one of the systems in mind when BoostFS was developed.) If you’ve not heard of BoostFS yet, it was included in DDOS 6, released last year.

It’s not just backup though – for systems with higher criticality there should be multi-layered protection strategies: backups will give you potentially longer term retention, and off-platform protection, but if you need really fast recovery times with very low RPOs and RTOs, your system will likely need replication and snapshots as well. Data protection isn’t a “one size fits all” scenario that some might try to preach; it’s multi-layered and it can encompass a broad range of technology. (And if the data is super business critical you might even want to go the next level and add IRS protection for it, protecting yourself not only from conventional data loss, but also situations where your business is hacked as well.)

The fallout and the data loss from the MongoDB attacks will undoubtedly continue for some time. If one thing comes out of it, I’m hoping it’ll be a stronger understanding from businesses in 2017 that data protection is still a very real topic.

[Edit/Addendum]

A speculative lesson: What’s the percentage of these MongoDB deployments that fall under the banner of ‘Shadow IT’? I.e., non-IT deployments of systems. By developers, other business groups, etc., within organisations? Does this also serve as a reminder of the risks that can be introduced when non-IT groups deploy IT systems without appropriate processes and rigour? We may never know the percentage breakdown between IT-led deployments and Shadow IT led deployments, but it’s certainly food for thought.

My cup runneth over

 Architecture, Backup theory, Best Practice  Comments Off on My cup runneth over
Nov 242016
 

How do you handle data protection storage capacity?

How do you handle growth – regular or unexpected – in your data protection volumes?


Hey, just as an aside, the NetWorker 2016 Usage Survey is up and running. If you can spare 5 minutes to complete it at the end of this article, that would be greatly appreciated!


Is your business reactive or proactive to data protection capacity requirements?

Glass

In the land of tape, dealing with capacity growth in data protection was both easy and insidiously obfuscated. Tape capacity management is basically a modern version of Hilbert’s Infinite Hotel Paradox – you sort-of, kind-of never run out of capacity because you always just buy another box of tapes. Problem solved, right? (No, more a case of the can kicked down the road.) Problem “solved” and you’ve got 1,000, 10,000, 50,000 tapes in a multitude of media types that you don’t even have tape drives to read any more.

Yet we like to focus on the real world now and tape isn’t the defacto standard any more for backup systems: it’s disk. Disk gives us great power, but with great power comes great responsibility (sorry, even though I’m not a Spiderman fan, I couldn’t resist. Tape is the opposite: tape gives us no power, and with no power comes no responsibility – yes, I’m also a Kickass fan.)

For businesses that still do disk-to-disk-to-tape, where disk is treated more like a staging area and excess data is written out to tape, the problem is seemingly solved because – you guessed it – you can always just buy another box of tapes and stage more data from disk backup storage to tape storage. Again, that’s kicking the can down the road. I’ve known businesses who have had company-wide data protection policies mandating up to 3 months of online recoverability from disk getting down to two weeks or less of data stored on disk because the data to be protected has continued to grow, no scaling has been done on the storage, and – you guessed it – tape was the interim solution.

Aside: When I first joined my first Unix system administration team in 1996, my team had just recently configuring an interim DNS server which they called tmp, because it was going to be quickly replaced by another server, which for the short term was called nc for new computer. When I left in 2000, tmp and nc were still there; in fact, nnc (yes, new-new-computer) was deployed shortly thereafter to replace nc and eventually, a year or two after I left, tmp was finally decommissioned.

Interim solutions have a tendency to stick. In fact, it’s a common story – capacity problem with data protection so let’s deploy an interim solution and solve it later. Later-later. Much later. Much later-later. Ad-infinitum.

There is, undoubtedly, a growing maturity in handling data protection storage management and capacity planning coming out of the pure disk and disk/cloud storage formats. While this is driven by necessity, it’s also and important demonstration that IT processes need to mature as the business matures as well.

If you’re new to pure disk based, or disk/cloud based data protection storage, you might want to stop and think carefully about your data protection policies and procurement processes/cycles so that you’re able to properly meet the requirements of the business. Here are a few tips I’ve learnt over the years…

80% is the new 100%

This one is easy. Don’t think of 100% capacity as being 100% capacity. Think of 80% as 100%. Why? Because you need runway to either procure storage, migrate data or get formal approval for changes to retention and backup policies. If you wait until you’re at 90, 95 or even 100% capacity, you’ve left your run too late and you’re just asking for many late or sleepless nights managing a challenge that could have been proactively dealt with earlier.

The key to management is measurement

I firmly believe you can’t manage something that has operational capacity restraints (e.g., “if we hit 100% capacity we can’t do more backups”) if you’re not actively measuring it. That doesn’t mean periodically logging into a console or running a “df -h” or whatever the “at a glance” look is for your data protection storage, it means capturing measurement data and having it available in both reports and dashboards so it is instantly visible.

The key to measurement is trending

You can capture all the data in the world and make it available in a dashboard, but if you don’t perform appropriate localised trending against that data to analyse it, you’re making your own good self the bottleneck (and weakest link) in the capacity management equation. You need to have trends produced as part of your reporting processes to understand how capacity is changing over time. These trends should be reflective of your own seasonal data variations or sampled over multiple time periods. Why? Well, if you have disk based data protection storage in your environment and do a linear forecast on capacity utilisation from day one, you’ll likely get a smoothing based on lower figures from earlier in the system lifecycle that could actually obfuscate more recent results. So you want to capture and trend that information for comparison, but you equally want to capture and trend shorter timeframes to ensure you have an understanding of shorter term changes. Trends based on the last six and three months usage profiles can be very useful in identifying what sort of capacity management challenges you’ve got based on short term changes in data usage profiles – a few systems for instance might be considerably spiking in utilisation, and if you’re still comparing against a 3-year timeframe dataset or something along those lines, the more recent profile may not be accurately represented in forecasts.

In short: measuring over multiple periods gives you the best accuracy.

Maximum is the new minimum

Linear forecasts of trending information are good if you’re just slowly, continually increasing your storage requirements. But if you’re either staging data (disk as staging) or running garbage collection (e.g., deduplication), it’s quite possible to get increasing sawtooth cycles in capacity utilisation on your data protection storage. And guess what? It doesn’t matter if your capacity requirements for the average utilisation are met if you’ll grow beyond the capacity requirements of the day before the oldest backups are deleted or garbage collection takes place. So make sure when you’re trending you’re looking at how you meet the changing maximum peaks, not the average sizes.

Know your windows

There’s three types of windows I’m referring to here – change, change freeze, and procurement.

You need to know them all intimately.

You’re at 95% capacity but you anticipated this and additional data protection storage has just arrived in your datacentre’s receiving bay, so you should be right to install it – right? What happens if you then have a week’s wait to have the change board consider your request for an outage – or datacentre access – to install the extra capacity? Will you be able to hold on that long? That’s knowing your change windows.

You know you’re going to run out of capacity in two months time if nothing is done, so you order additional data protection storage and it arrives on December 20. The only problem is a mandatory company change blackout started on December 19 and you literally cannot install anything, until January 20. Do you have enough capacity to survive? That’s knowing your freeze windows.

You know you’re at 80% capacity today and based on the trends you’ll be at 90% capacity in 3 weeks and 95% capacity in 4 weeks. How long does it take to get a purchase order approved? How long does it take the additionally purchased systems to arrive on-site? If it takes you 4 weeks to get purchase approval and another 3 weeks for it to arrive after the purchase order is sent, maybe 70%, not 80%, is your new 100%. That’s knowing your procurement windows.

Final thoughts

I want to stress – this isn’t a doom and gloom article, even if it seems I’m painting a blunt picture. What I’ve described above is expert tips – not from myself, but from my customers, and customers of colleagues and friends, whom I’ve seen manage data protection storage capacity well. If you follow at least the above guidelines, you’re going to have a far more successful – and more relaxed – time of it all.

And maybe you’ll get to spend Thanksgiving, Christmas, Ramadan, Maha Shivaratri, Summer Solstice, Melbourne Cup Day, Labour Day or whatever your local holidays and festivals are with your friends and families, rather than manually managing an otherwise completely manageable situation.


Hey, just as an aside, the NetWorker 2016 Usage Survey is up and running. If you can spare 5 minutes to complete it at the end of this article, that would be greatly appreciated!


 

Aug 092016
 

I’ve recently been doing some testing around Block Based Backups, and specifically recoveries from them. This has acted as an excellent reminder of two things for me:

  • Microsoft killing Technet is a real PITA.
  • You backup to recover, not backup to backup.

The first is just a simple gripe: running up an eval Windows server every time I want to run a simple test is a real crimp in my style, but $1,000+ licenses for a home lab just can’t be justified. (A “hey this is for testing only and I’ll never run a production workload on it” license would be really sweet, Microsoft.)

The second is the real point of the article: you don’t backup for fun. (Unless you’re me.)

iStock Racing

You ultimately backup to be able to get your data back, and that means deciding your backup profile based on your RTOs (recovery time objectives), RPOs (recovery time objectives) and compliance requirements. As a general rule of thumb, this means you should design your backup strategy to meet at least 90% of your recovery requirements as efficiently as possible.

For many organisations this means backup requirements can come down to something like the following: “All daily/weekly backups are retained for 5 weeks, and are accessible from online protection storage”. That’s why a lot of smaller businesses in particular get Data Domains sized for say, 5-6 weeks of daily/weekly backups and 2-3 monthly backups before moving data off to colder storage.

But while online is online is online, we have to think of local requirements, SLAs and flow-on changes for LTR/Compliance retention when we design backups.

This is something we can consider with things even as basic as the humble filesystem backup. These days there’s all sorts of things that can be done to improve the performance of dense filesystem (and dense-like) filesystem backups – by dense I’m referring to very large numbers of files in relatively small storage spaces. That’s regardless of whether it’s in local knots on the filesystem (e.g., a few directories that are massively oversubscribed in terms of file counts), or whether it’s just a big, big filesystem in terms of file count.

We usually think of dense filesystems in terms of the impact on backups – and this is not a NetWorker problem; this is an architectural problem that operating system vendors have not solved. Filesystems struggle to scale their operational performance for sequential walking of directory structures when the number of files starts exponentially increasing. (Case in point: Cloud storage is efficiently accessed at scale when it’s accessed via object storage, not file storage.)

So there’s a number of techniques that can be used to speed up filesystem backups. Let’s consider the three most readily available ones now (in terms of being built into NetWorker):

  • PSS (Parallel Save Streams) – Dynamically builds multiple concurrent sub-savestreams for individual savesets, speeding up the backup process by having multiple walking/transfer processes.
  • BBB (Block Based Backup) – Bypasses the filesystem entirely, performing a backup at the block level of a volume.
  • Image Based Backup – For virtual machines, a VBA based image level backup reads the entire virtual machine at the ESX/storage layer, bypassing the filesystem and the actual OS itself.

So which one do you use? The answer is a simple one: it depends.

It depends on how you need to recover, how frequently you might need to recover, what your recovery requirements are from longer term retention, and so on.

For virtual machines, VBA is usually the method of choice as it’s the most efficient backup method you can get, with very little impact on the ESX environment. It can recover a sufficient number of files in a single session for most use requirements – particularly if file services have been pushed (where they should be) into dedicated systems like NAS appliances. You can do all sorts of useful things with VBA backups – image level recovery, changed block tracking recovery (very high speed in-place image level recovery), instant access (when using a Data Domain), and of course file level recovery. But if your intent is to recover tens of thousands of files in a single go, VBA is not really what you want to use.

It’s the recovery that matters.

For compatible operating systems and volume management systems, Block Based Backups work regardless of whether you’re in a virtual machine or whether you’re on a physical machine. If you’re needing to backup a dense filesystem running on Windows or Linux that’s less than ~63TB, BBB could be a good, high speed method of achieving that backup. Equally, BBB can be used to recover large numbers of files in a single go, since you just mount the image and copy the data back. (I recently did a test where I dropped ~222,000 x 511 byte text files into a single directory on Windows 2008 R2 and copied them back from BBB without skipping a beat.)

BBB backups aren’t readily searchable though – there’s no client file index constructed. They work well for systems where content is of a relatively known quantity and users aren’t going to be asking for those “hey I lost this file somewhere in the last 3 weeks and I don’t know where I saved it” recoveries. It’s great for filesystems where it’s OK to mount and browse the backup, or where there’s known storage patterns for data.

It’s the recovery that matters.

PSS is fast, but in any smack-down test BBB and VBA backups will beat it for backup speed. So why would you use them? For a start, they’re available on a wider range of platforms – VBA requires ESX virtualised backups, BBB requires Windows or Linux and ~63TB or smaller filesystems, PSS will pretty much work on everything other than OpenVMS – and its recovery options work with any protection storage as well. Both BBB and VBA are optimised for online protection storage and being able to mount the backup. PSS is an extension of the classic filesystem agent and is less specific.

It’s the recovery that matters.

So let’s revisit that earlier question: which one do you use? The answer remains: it depends. You pick your backup model not on the basis of “one size fits all” (a flawed approach always in data protection), but your requirements around questions like:

  • How long will the backups be kept online for?
  • Where are you storing longer term backups? Online, offline, nearline or via cloud bursting?
  • Do you have more flexible SLAs for recovery from Compliance/LTR backups vs Operational/BAU backups? (Usually the answer will be yes, of course.)
  • What’s the required recovery model for the system you’re protecting? (You should be able to form broad groupings here based on system type/function.)
  • Do you have any externally imposed requirements (security, contractual, etc.) that may impact your recovery requirements?

Remember there may be multiple answers. Image level backups like BBB and VBA may be highly appropriate for operational recoveries, but for long term compliance your business may have needs that trigger filesystem/PSS backups for those monthlies and yearlies. (Effectively that comes down to making the LTR backups as robust in terms of future infrastructure changes as possible.) That sort of flexibility of choice is vital for enterprise data protection.

One final note: the choices, once made, shouldn’t stay rigidly inflexible. As a backup administrator or data protection architect, your role is to constantly re-evaluate changes in the technology you’re using to see how and where they might offer improvements to existing processes. (When it comes to release notes: constant vigilance!)

Betting the company

 Backup theory, Best Practice, Databases, General Technology  Comments Off on Betting the company
Jun 152016
 

Short of networking itself, backup and recovery systems touch more of your infrastructure than anything else. So it’s pretty common for any backup and recovery specialist to be asked how we can protect a ten or sometimes even twenty year old operating system or application.

Sure you can backup Windows 2012, but what about NT 4?

Sure you can backup Solaris 11, but what about Tru64 v5?

Sure you can backup Oracle 12, but what about Oracle 8?

These really are questions we get asked.

I get these questions. I even have an active Windows 2003 SMB server sitting in my home lab running as an RDP jump-point. My home lab.

Gambling the lot

So it’s probably time for me to admit: I’m not really speaking to backup administrators with this article, but the broader infrastructure teams and, probably more so, the risk officers within companies.

Invariably we get asked if we can backup AncientOS 1.1 or DefunctDatabase 3.2 because those systems are still in use within a business, and inevitably that’s because they’re in production use within a company. Sometimes they’re even running pseudo-mission critical services, but more often than not they’re just simply running essential services the business has deemed too costly to migrate to another platform.

I’m well aware of this. In 1999 I was the primary system administrator involved in a Y2K remediation project for a SAP deployment. The system as deployed was running on an early version of Oracle 8 as I recall (it might have been Oracle 7 – it was 17 years ago…), sitting on Tru64 with an old (even for then) version of SAP. The version of the operating system, the version of Oracle, the version of SAP and even things like the firmware in the DAS enclosures attached were all unsupported by the various vendors for Y2K.

The remediation process was tedious and slow because we had to do piecemeal upgrades of everything around SAP and beg for Y2K compliance exceptions from Oracle and Digital for specific components. Why? When the business had deployed SAP two years before, they’d spent $5,000,000 or so customizing it to the nth degree, and upgrading it would require a similarly horrifically expensive remediation customization project. It was, quite simply, easier and cheaper to risk periphery upgrades around the application.

It worked. (As I recall, the only system in the company that failed over the Y2K transition was the Access database put together at the last minute by some tech-boffin-project manager designed to track any Y2K incidents over the entire globe for the company. I’ve always found there to be beautiful irony in that.)

This is how these systems limp along within organisations. It costs too much to change them. It costs too much to upgrade them. It costs to much to replace them.

And so day by day, month by month, year by year, the business continues to bet that bad things won’t happen. And what’s the collateral for the bet? Well it could be the company itself. If it costs that much to change them, upgrade them or to replace them, what’s the cost going to be if they fail completely? There’s an old adage of a CEO and a CIO talking, and the CIO says: “Why are you paying all this money to train people? What if you train them and they leave?” To which the CEO responds, “What if we don’t train them and they stay?” I think this is a similar situation.

I understand. I sympathise – even empathise, but we’ve got to find a better way to resolve this problem, because it’s a lot more than just a backup problem. It’s even more than a data protection problem. It’s a data integrity problem, and that creates an operational integrity problem.

So why is the question “do you support X?” asked when the original vendor for X doesn’t even support it any more – and may not have done for a decade or more?

The question is not really whether we can supply backup agents or backup modules old enough to work with these systems unsupported by their vendor of origin, and whether you can get access to a knowledge-base that stretches back far enough to include details of those systems. Supply? Yes. Officially support? How much official support do you get from the vendor of origin?

I always think in these situations there’s a broader conversation to be had. Those legacy applications and operating systems are a sea anchor to your business at a time when you increasingly have to be able to steer and move the ship faster and with greater agility. Those scenarios where you’re reliant on technology so old it’s no longer supported are exactly those sorts of scenarios that are allowing startups and younger, more agile competitors to swoop in and take customers from you. And it’s those scenarios that also leave you exposed to an old 10GB ATA drive failing, or a random upgrade elsewhere in the company finally and unexpectedly resulting in that critical or essential system no longer being able to access the network.

So how do we solve the problem?

Sometimes there’s a simple workaround – virtualisation. If it’s an old x86 based platform, particularly Windows, there’s a good chance the system can at least be virtualised so it can at least run on modern hardware. That doesn’t solve the ‘supported’ problem, but it at least means greater protection: image level backups regardless of whether there’s an agent for the internal virtual machine, and snapshots and replication to reduce the requirements to ever have to consider a BMR. Usually being old, the amount of data on those systems is minimal, so that type of protection is not an issue.

But the real solution comes from being able to modernise the workload. We talk about platforms 1, 2 and 3 – platform 1 is the old mainframe approach to the world, platform 2 is the classic server/desktop architecture we’ve been living with for so long, and platform 3 is the new, mobile and cloud approach to IT. Some systems even get classified as platform ‘2.5’ – that interim step between the current and the new. What’s the betting that old curmudgeonly system that’s holding your business back from modernising is more like platform 1.5?

One way you can modernise is to look at getting innovative with software development. Increasing requirements for agility will drive more IT departments back to software development for platform 3 environments, so why not look at this as an opportunity to grow that development environment within your business? That’s where the EMC Federation can really swing in to help: Pivotal Labs is premised on new approaches to software development. Agile may seem like a buzz-word, but if you can cut software development down from 12-24 months to 6-12 weeks (or less!), doesn’t that mitigate many of the cost reasons to avoid dealing with the legacy platforms?

The other way of course is with traditional consulting approaches. Maybe there’s a way that legacy application can be adapted, or archived, in such a way that the business functions can be continued but the risk substantially reduced and the platform modernised. That’s where EMC’s consultancy services come in, where our content management services come in, and where our broad experience to hundreds of thousands of customer environments come in. Because I’ll be honest: your problems aren’t actually unique; you’re not the only business that’s dealing with legacy system components and while there may be industry-specific or even customer-specific aspects that are tricky, there’s a very, very good chance that somewhere, someone has gone through the same situation. The solution could very well be tailored specifically for your business, but the processes and tools that get used to get you to your solution don’t necessarily have to be bespoke.

It’s time to start thinking beyond whether those ancient and unsupported operating systems and applications can be backed up, but how they can be modernised so they stop holding the business back.

The Importance of Being Earnestly Automated

 Architecture, Best Practice, General Technology  Comments Off on The Importance of Being Earnestly Automated
Apr 132016
 

It was not long after I started in IT that I got the most important advice of my career. It came from a senior Unix system administrator in the team I’d just joined, and it shaped my career. In just eight words it stated the purpose of the system administrator, and I think IT as a whole:

The best system administrator is a lazy one.

On the face of it, it seems inappropriate advice: be lazy; yet that’s just the superficial reading of it. The real intent was this:

Automate everything you have to repeatedly do.

Automation

One of the reasons I was originally so blasé about Cloud was that it was old-hat. The same way that mainframe jockeys yawned and rolled their eyes when midrange people started talking about the wonders of virtualisation, I listened to people in IT extolling Cloud and found myself rolling my eyes – not just over the lack of data protection in early Cloud solutions – but to the stories about how Cloud was agile. And there’s no prizes for guessing where agility comes from: automation.

It surprises me twenty years on that the automation debate is still going on, and some people remain unconvinced.

There are three fundamental results of automation:

  • Repeatability
  • Reliability
  • Verifiability

When something is properly automated, it can be repeated easily and readily. That’s a fundamental tenet driving Cloud agility: you click on a button on a portal and hey presto!, a virtual machine is spun up and you receive an IP address to access it from. Or you click on a button on a portal and suddenly you’ve got yourself a SQL database or Exchange server or CRM system or any one of hundreds of different applications or business functions. If there’s human intervention at the back-end between when you click the button and when you get your service it’s not agile. It’s not Cloud. And it’s certainly not automated. Well, not fully or properly.

With repeatability becomes reliability – accuracy. It doesn’t matter whether the portal has been up for 1 hour or 1000 hours, it doesn’t matter whether it’s 01:00 or 13:00, and it doesn’t matter how many requests the portal has got: it’s not prone to error, it won’t miss a check-box because it’s rushed or tired or can’t remember what the correct option is. It doesn’t matter whether the computer doing the work in the background has never done it before because it’s just been added to the resource pool, or whether it’s done the process a million times before. Automation isn’t just about repeatability, it’s about reliable repeatability.

Equally as importantly, with automation – with repeatability – there comes verifiability. Not only can you reliably repeat the same activity time and time again, but every time it’s executed you can verify it was executed. You can monitor, measure and report. This can be from the simplest – verifying it was performed successfully or throwing an exception for a human to investigate – to the more complex, such as tracking and reporting the trends on how long it takes automated processes to complete, so you can see keep an eye on how the system is scaling.

Once you’ve got automation in place, you’ve freed up your IT staff from boring and repetitive duties. That’s not to remove them from their jobs, but to let the humans in your staff do the jobs humans do best: those involving dealing with the unexpected, or thinking of new solutions. Automated, repeatable tasks are best left to scripts and processes and even robots (when it comes to production). The purpose of being a lazy system administrator was not so that you could sit at your desk doing nothing all day, but so you could spend time handling exceptions and errors, designing new systems, working on new projects, and yes, automating new systems.

Automation is not just a Cloud thing. Automation is not just a system administration thing. Or a database/application administration thing. Or a build thing. Or a…

Automation is everything in IT, particularly in the infrastructure space. Cloud has well and truly raised the profile of automation, but the fundamental concept is not new. I’d go so far as to say that if your business isn’t focused on automation, you’re doing IT wrong.