Would you buy a dangerbase?

 Backup theory, Policies  Comments Off on Would you buy a dangerbase?
Jun 072017

Databases. They’re expensive, aren’t they?

What if I sold you a Dangerbase instead?

What’s a dangerbase!? I’m glad you asked. A dangerbase is functionally almost exactly the same as a database, except it may be a little bit more lax when it comes to controls. Referential integrity might slip. Occasionally an insert might accidentally trigger a background delete. Nothing major though. It’s twenty percent less of the cost with only four times the risk of one of those pesky ‘databases’! (Oh, you might need 15% more infrastructure to run it on, but you don’t have to worry about that until implementation.)

Dangerbases. They’re the next big thing. They have a marketshare that’s doubling every two years! Two years! (Admittedly that means they’re just at 0.54% marketshare at the moment, but that’s double what it was last year!)

A dangerbase is a stupid idea. Who’d trust storing their mission critical data in a dangerbase? The idea is preposterous.

Sadly, dangerbases get considered all too often in the world of data protection.

Destroyed Bridge

What’s a dangerbase in the world of data protection? Here’s just some examples:

  • Relying solely on an on-platform protection mechanism. Accidents happen. Malicious activities happen. You need to always ensure you’ve got a copy of your data outside of the original production platform it is created and maintained on, regardless of what protection you’ve got in place there. And you should at least have one instance of each copy in a different physical location to the original.
  • Not duplicating your backups. Whether you call it a clone or a copy or a duplication doesn’t matter to me here – it’s the effect we’re looking for, not individual product nomenclature. If your backup isn’t copied, it means your backup represents a single point of failure in the recovery process.
  • Using post-process deduplication. (That’s something I covered in detail recently.)
  • Relying solely on RAID when you’re doing deduplication. Data Invulnerability Architecture (DIA) isn’t just a buzzterm, it’s essential in a deduplication environment.
  • Turning your databases into dangerbases by doing “dump and sweep”. Plugins have existed for decades. Dump and sweep is an expensive waste of primary storage space and introduces a variety of risk into your data protection environment.
  • Not having a data lifecycle policy! Without it, you don’t have control over capacity growth within your environment. Without that, you’re escalating your primary storage costs unnecessarily, and placing strain on your data protection environment – strain that can easily break it.
  • Not having a data protection advocate, or data protection architect, within your organisation. If data is the lifeblood of a company’s operations, and information is money, then failing to have a data protection architect/advocate within the organisation is like not bothering with having finance people.
  • Not having a disaster recovery policy that integrates into a business continuity policy. DR is just one aspect of business continuity, but if it doesn’t actually slot into the business continuity process smoothly, it’s as likely going to hinder than help the company.
  • Not understanding system dependencies. I’ve been talking about system dependency maps or tables for years. Regardless of what structure you use, the net effect is the same: the only way you can properly protect your business services is to know what IT systems they rely on, and what IT systems those IT systems rely on, and so on, until you’re at the root level.

That’s just a few things, but hopefully you understand where I’m coming from.

I’ve been living and breathing data protection for more than twenty years. It’s not just a job, it’s genuinely something I’m passionate about. It’s something everyone in IT needs to be passionate about, because it can literally make the difference between your company surviving or failing in a disaster situation.

In my book, I cover all sorts of considerations and details from a technical side of the equation, but the technology in any data protection solution is just one aspect of a very multi-faceted approach to ensuring data availability. If you want to take data protection within your business up to the next level – if you want to avoid having the data protection equivalent of a dangerbase in your business – check my book out. (And in the book there’s a lot more detail about integrating into IT governance and business continuity, a thorough coverage of how to work out system dependencies, and all sorts of details around data protection advocates and the groups that they should work with.)

May 232017


A seemingly straight-forward question, what constitutes a successful backup may not engender the same response from everyone you ask. On the surface, you might suggest the answer is simply “a backup that completes without error”, and that’s part of the answer, but it’s not the complete answer.


Instead, I’m going to suggest there’s actually at least ten factors that go into making up a successful backup, and explain why each one of them is important.

The Rules

One – It finishes without a failure

This is the most simple explanation of a successful backup. One that literally finishes successfully. It makes sense, and it should be a given. If a backup fails to transfer the data it is meant to transfer during the process, it’s obviously not successful.

Now, there’s a caveat here, something I need to cover off. Sometimes you might encounter situations where a backup completes successfully  but triggers or produces a spurious error as it finishes. I.e., you’re told it failed, but it actually succeeded. Is that a successful backup? No. Not in a useful way, because it’s encouraging you to ignore errors or demanding manual cross-checking.

Two – Any warnings produced are acceptable

Sometimes warnings will be thrown during a backup. It could be that a file had to be re-read, or a file was opened at the time of backup (e.g., on a Unix/Linux system) and could only be partially read.

Some warnings are acceptable, some aren’t. Some warnings that are acceptable on one system may not be acceptable on another. Take for instance, log files. On a lot of systems, if a log file is being actively written to when the backup is running, it could be that the warning of an incomplete capture of the file is acceptable. If the host is a security logging system and compliance/auditing requirements dictate all security logs are to be recoverable, an open-file warning won’t be acceptable.

Three – The end-state is captured and reported on

I honestly can’t say the number of times over the years I’ve heard of situations where a backup was assumed to have been running successfully, then when a recovery is required there’s a flurry of activity to determine why the recovery can’t work … only to find the backup hadn’t been completing successfully for days, weeks, or even months. I really have dealt with support cases in the past where critical data that had to be recovered was unrecoverable due to a recurring backup failure – and one that had been going on, being reported in logs and completion notifications, day-in, day-out, for months.

So, a successful backup is also a backup here the end-state is captured and reported on. The logical result is that if the backup does fail, someone knows about it and is able to choose an action for it.

When I first started dealing with NetWorker, that meant checking the savegroup completion reports in the GUI. As I learnt more about the importance of automation, and systems scaled (my system administration team had a rule: “if you have to do it more than once, automate it”), I built parsers to automatically interpret savegroup completion results and provide emails that would highlight backup failures.

As an environment scales further, automated parsing needs to scale as well – hence the necessity of products like Data Protection Advisor, where you not only get simple dashboards for overnight success ratios with drill-downs, root cause analysis, and all the way up to SLA adherence reports and beyond.

In short, a backup needs to be reported on to be successful.

Four – The backup method allows for a successful recovery

A backup exists for one reason alone – to allow the retrieval and reconstruction of data in the event of loss or corruption. If the way in which the backup is run doesn’t allow for a successful recovery, then the backup should not be counted as a successful backup, either.

Open files are a good example of this – particularly if we move into the realm of databases. For instance, on a regular Linux filesystem (e.g., XFS or EXT4), it would be perfectly possible to configure a filesystem backup of an Oracle server. No database plugin, no communication with RMAN, just a rolling sweep of the filesystem, writing all content encountered to the backup device(s).

But it wouldn’t be recoverable. It’s a crash-consistent backup, not an application-consistent backup. So, a successful backup must be a backup that can be successfully recovered from, too.

Five – If an off-site/redundant copy is required, it is successfully performed

Ideally, every backup should get a redundant copy – a clone. Practically, this may not always be the case. The business may decide, for instance, that ‘bronze’ tiered backups – say, of dev/test systems, do not require backup replication. Ultimately this becomes a risk decision for the business and so long as the right role(s) have signed off against the risk, and it’s deemed to be a legally acceptable risk, then there may not be copies made of specific types of backups.

But for the vast majority of businesses, there will be backups for which there is a legal/compliance requirement for backup redundancy. As I’ve said before, your backups should not be a single point of failure within your data protection environment.

So, if a backup succeeds but its redundant copy fails, the backup should, to a degree, be considered to have failed. This doesn’t mean you have to necessarily do the backup again, but if redundancy is required, it means you do have to make sure the copy gets made. That then hearkens back to requirement three – the end state has to be captured and reported on. If you’re not capturing/reporting on end-state, it means you won’t be aware if the clone of the backup has succeeded or not.

Six – The backup completes within the required timeframe

You have a flight to catch at 9am. Because of heavy traffic, you don’t arrive at the airport until 1pm. Did you successfully make it to the airport?

It’s the same with backups. If, for compliance reasons you’re required to have backups complete within 8 hours, but they take 16 to run, have they successfully completed? They might exit without an error condition, but if SLAs have been breached, or legal requirements have not been met, it technically doesn’t matter that they finished without error. The time it took them to exit was, in fact, the error condition. Saying it’s a successful backup at this point is sophistry.

Seven – The backup does not prevent the next backup from running

This can happen one of two different ways. The first is actually a special condition of rule six – even if there are no compliance considerations, if a backup meant to run once a day takes longer than 24 hours to complete, then by extension, it’s going to prevent the next backup from running. This becomes a double failure – not only does the next backup run, but the next backup doesn’t run because the earlier backup is blocking it.

The second way is not necessarily related to backup timing – this is where a backup completes, but it leaves system in state that prevents next backup from running. This isn’t necessarily a common thing, but I have seen situations where for whatever reason, the way a backup finished prevented the next backup from running. Again, that becomes a double failure.

Eight – It does not require manual intervention to complete

There’s two effective categories of backups – those that are started automatically, and those that are started manually. A backup may in fact be started manually (e.g., in the case of an ad-hoc backup), but should still be able to complete without manual intervention.

As soon as manual intervention is required in the backup process, there’s a much greater risk of the backup not completing successfully, or within the required time-frame. This is, effectively, about designing the backup environment to reduce risk by eliminating human intervention. Think of it as one step removed from the classic challenge that if your backups are required but don’t start without human intervention, they likely won’t run. (A common problem with ‘strategies’ around laptop/desktop self-backup requirements.)

There can be workarounds for this – for example, if you need to trigger a database dump as part of the backup process (e.g., for a database without a plugin), then it could be a password needs to be entered, and the dump tool only accepts passwords interactively. Rather than having someone actually manually enter the password, the dump command could instead be automated with tools such as Expect.

Nine – It does not unduly impact access to the data it is protecting

(We’re in the home stretch now.)

A backup should be as light-touch as possible. The best example perhaps of a ‘heavy touch’ backup is a cold database backup. That’s where the database is shutdown for the duration of the backup, and it’s a perfect situation of a backup directly impacting/impeding access to the data being protected. Sometimes it’s more subtle though – high performance systems may have limited IO and system resources to handle the steaming of a backup, for instance. If system performance is degraded by the backup, then it should be considered the case the backup is unsuccessful.

I liken this to uptime vs availability. A server might be up, but if the performance of the system is so poor that users consider the service offered by the system, it’s not usable. That’s where, for instance, systems like ProtectPoint can be so important – in high performance systems it’s not just about getting a high speed backup, but limiting the load of the database server during the backup process.

Ten – It is predictably repeatable

Of course, there are ad-hoc backups that might only ever need to be run once, or backups that you may never need to run again (e.g., pre-decommissioning backup).

The vast majority of backups within an environment though will be repeated daily. Ideally, the result of each backup should be predictably repeatable. If the backup succeeds today, and there’s absolutely no changes to the systems or environment, for instance, then it should be reasonable to expect the backup will succeed tomorrow. That doesn’t ameliorate the requirement for end-state capturing and reporting; it does mean though that the backup results shouldn’t effectively be random.

In Summary

It’s easy to understand why the simplest answer (“it completes without error”) can be so easily assumed to be the whole answer to “what constitutes a successful backup?” There’s no doubt it forms part of the answer, but if we think beyond the basics, there are definitely a few other contributing factors to achieving really successful backups.

Consistency, impact, recovery usefulness and timeliness, as well as all the other rules outlined above also come into how we can define a truly successful backup. And remember, it’s not about making more work for us, it’s about preventing future problems.

If you’ve thought the above was useful, I’d suggest you check out my book, Data Protection: Ensuring Data Availability. Available in paperback and Kindle formats.

What to do on world backup day

 Backup theory, Best Practice, Recovery  Comments Off on What to do on world backup day
Mar 302017

World backup day is approaching. (A few years ago now, someone came up with the idea of designating one day of the year to recognise backups.) Funnily enough, I’m not a fan of world backup day, simply because we don’t backup for the sake of backing up, we backup to recover.

Every day should, in fact, be world backup day.

Something that isn’t done enough – isn’t celebrated enough, isn’t tested enough – are recoveries. For many organisations, recovery tests consist of actually doing a recovery when requested, and things like long term retention backups are never tested, and even more rarely recovered from.

bigStock Rescue

So this Friday, March 31, I’d like to suggest you don’t treat as World Backup Day, but World Recovery Test Day. Use the opportunity to run a recovery test within your organisation (following proper processes, of course!) – preferably a recovery that you don’t normally run in terms of day to day operations. People only request file recoveries? Sounds like a good reason to run an Exchange, SQL or Oracle recovery to me. Most recoveries are Exchange mail level recoveries? Excellent, you know they work, let’s run a recovery of a complete filesystem somewhere.

All your recoveries are done within a 30 day period of the backup being taken? That sounds like an excellent idea to do the recovery from an LTR backup written 2+ years ago, too.

Part of running a data protection environment is having routine tests to validate ongoing successful operations, and be able to confidently report back to the business that everything is OK. There’s another, personal and selfish aspect to it, too. It’s one I learnt more than a decade ago when I was still an on-call system administrator: having well-tested recoveries means that you can sleep easily at night, knowing that if the pager or mobile phone does shriek you into blurry-eyed wakefulness at 1am, you can in fact log onto the required server and run the recovery without an issue.

So this World Backup Day, do a recovery test.

The need to have an efficient and effective testing system is something I cover in more detail in Data Protection: Ensuring Data Availability. If you want to know more, feel free to check out the book on Amazon or CRC Press. Remember that it doesn’t matter how good the technology you deploy is if you don’t have the processes and training to use it.

Mar 222017

It’s fair to say I’m a big fan of Queen. They shaped my life – the only band to have even a remotely similar effect on me was ELO. (Yes, I’m an Electric Light Orchestra fan. Seriously, if you haven’t listened to the Eldorado or Time operatic albums in the dark you haven’t lived.)

Queen taught me a lot: the emotional perils of travelling at near-relativistic speeds and returning home, that maybe immorality isn’t what fantasy makes it seem like, and, amongst a great many other things, that you need to take a big leap from time to time to avoid getting stuck in a rut.

But you can find more prosaic meanings in Queen, too, if you want to. One of them deals with long term retention. We get that lesson from one of the choruses for Too much love will kill you:

Too much love will kill you,

Just as sure as none at all

Hang on, you may be asking, what’s that got to do with long term retention?

Replace ‘love’ with ‘data’ and you’ve got it.


I’m a fan of the saying:

It’s always better to backup a bit too much than not quite enough.

In fact, it’s something I mention again in my book, Data Protection: Ensuring Data Availability. Perhaps more than once. (I’ve mentioned my book before, right? If you like my blog or want to know more about data protection, you should buy the book. I highly recommend it…)

That’s something that works quite succinctly for what I’d call operational backups: your short term retention policies. They’re going to be the backups where you’re keeping say, weekly fulls and daily incrementals for (typically) between 4-6 weeks for most businesses. For those sorts of backups, you definitely want to err on the side of caution when choosing what to backup.

Now, that’s not to say you don’t err on the side of caution when you’re thinking about long term retention, but caution definitely becomes a double-edged sword: the caution of making sure you’re backing up what you are required to, but also the caution of making sure you’re not wasting money.

Let’s start with a simpler example: do you backup your non-production systems? For a lot of environments, the answer is ‘yes’ (and that’s good). So if the answer is ‘yes’, let me ask the follow-up: do you apply the same retention policies for your non-production backups as you do for your production backups? And if the answer to that is ‘yes’, then my final question is this: why? Specifically, are you doing it because it’s (a) habit, (b) what you inherited, or (c) because there’s a mandated and sensible reason for doing so? My guess is that in 90% of scenarios, the answer is (a) or (b), not (c). That’s OK, you’re in the same boat as the rest of the industry.

Let’s say you have 10TB of production data, and 5TB of non-production data. Not worrying about deduplication for the moment, if you’re doing weekly fulls and daily incrementals, with a 3.5% daily change (because I want to hurt my brain with mathematics tonight – trust me, I still count on my fingers, and 3.5 on your fingers is hard) with a 5 week retention period then you’re generating:

  • 5 x (10+5) TB in full backups
  • 30 x ((10+5) x 0.035) TB in incremental backups

That’s 75 TB (full) + 15.75 TB (incr) of backups generated for 15TB of data over a 5 week period. Yes, we’ll use deduplication because it’s so popular with NetWorker and shrink that number quite nicely thank-you, but 90.75 TB of logical backups over 5 weeks for 15TB of data is the end number we get at.

But do you really need to generate that many backups? Do you really need to keep five weeks worth of non-production backups? What if instead you’re generating:

  • 5 x 10 TB in full production backups
  • 2 x 5 TB in full non-prod backups
  • 30 x 10 x 0.035 TB in incremental production backups
  • 12 x 5 x 0.035 TB in incremental non-prod backups

That becomes 50TB (full prod) + 10 TB (full non-prod) + 10.5 TB (incr prod) + 2.1 TB (incr non-prod) over any 5 week period, or 72.6 TB instead of 90.75 TB – a saving of 20%.

(If you’re still pushing your short-term operational backups to tape, your skin is probably crawling at the above suggestion: “I’ll need more tape drives!” Well, yes you would, because tape is inflexible. So using backup to disk means you can start saving on media, because you don’t need to make sure you have enough tape drives for every potential pool that would be written to at any given time.)

A 20% saving on operational backups for 15TB of data might not sound like a lot, but now let’s start thinking about long term retention (LTR).

There’s two particular ways we see long term retention data handled: monthlies kept for the entire LTR period, or keeping monthlies for 12-13 months and just keeping end-of-calendar-year (EoCY) + end-of-financial-year (EoFY) for the LTR period. I’d suggest that the knee-jerk reaction by many businesses is to keep monthlies for the entire time. That doesn’t necessarily have to be the case though – and this is the sort of thing that should also be investigated: do you legally need to keep all your monthly backups for your LTR, or do you just need to keep those EoCY and EoFY backups for that period? That alone might be a huge saving.

Let’s assume though that you’re keeping those monthly backups for your entire LTR period. We’ll assume you’re also not in engineering, where you need to keep records for the lifetime of the product, or biosciences, where you need to keep records for the lifetime of the patient (and longer), and just stick with the tried-and-trusted 7 year retention period seen almost everywhere.

For LTR, we also have to consider yearly growth. I’m going to cheat and assume 10% year on year growth, but the growth only kicks in once a year. (In reality for many businesses it’s more like a true compound annual growth, ammortized monthly, which does change things around a bit.)

So let’s go back to those numbers. We’ve already established what we need for operational backups, but what do we need for LTR?

If we’re not differentiating between prod and non-prod (and believe me, that’s common for LTR), then our numbers look like this:

  • Year 1: 12 x 15 TB
  • Year 2: 12 x 16.5 TB
  • Year 3: 12 x 18.15 TB
  • Year 4: 12 x 19.965 TB
  • Year 5: 12 x 21.9615 TB
  • Year 6: 12 x 24.15765 TB
  • Year 7: 12 x 26.573415 TB

Total? 1,707.69 TB of LTR for a 7 year period. (And even as data ages out, that will still grow as the YoY growth continues.)

But again, do you need to keep non-prod backups for LTR? What if we didn’t – what would those numbers look like?

  • Year 1: 12 x 10 TB
  • Year 2: 12 x 11 TB
  • Year 3: 12 x 12.1 TB
  • Year 4: 12 x 13.31 TB
  • Year 5: 12 x 14.641 TB
  • Year 6: 12 x 16.1051 TB
  • Year 7: 12  17.71561 TB

That comes down to just 1,138 TB over 7 years – a 33% saving in LTR storage.

We got that saving just by looking at splitting off non-production data from production data for our retention policies. What if we were to do more? Do you really need to keep all of your production data for an entire 7-year LTR period? If we’re talking a typical organisation looking at 7 year retention periods, we’re usually only talking about critical systems that face compliance requirements – maybe some financial databases, one section of a fileserver, and email. What if that was just 1 TB of the production data? (I’d suggest that for many companies, a guesstimate of 10% of production data being the data required – legally required – for compliance retention is pretty accurate.)

Well then your LTR data requirements would be just 113.85 TB over 7 years, and that’s a saving of 93% of LTR storage requirements (pre-deduplication) over a 7 year period for an initial 15 TB of data.

I’m all for backing up a little bit too much than not enough, but once we start looking at LTR, we have to take that adage with a grain of salt. (I’ll suggest that in my experience, it’s something that locks a lot of companies into using tape for LTR.)

Too much data will kill you,

Just as sure as none at all

That’s the lesson we get from Queen for LTR.

…Now if you’ll excuse me, now I’ve talked a bit about Queen, I need to go and listen to their greatest song of all time, March of the Black Queen.

Mar 102017

In 2008 I published “Enterprise Systems Backup and Recovery: A corporate insurance policy”. It dealt pretty much exclusively, as you might imagine, with backup and recovery concepts. Other activities like snapshots, replication, etc., were outside the scope of the book. Snapshots, as I recall, were mainly covered as an appendix item.

Fast forward almost a decade and there’s a new book on the marketplace, “Data Protection: Ensuring Data Availability” by yours truly, and it is not just focused on backup and recovery. There’s snapshots, replication, continuous data protection, archive, etc., all covered. Any reader of my blogs will know though that I don’t just think of the technology: there’s the business aspects to it as well, the process, training and people side of the equation. There’s two other titles I bandied with: “Backup is dead, long live backup”, and “Icarus Fell: Understanding risk in the modern IT environment”.

You might be wondering why in 2017 there’s a need for a book dedicated to data protection.

Puzzle Pieces

We’ve come a long way in data protection, but we’re now actually teetering on an interesting precipice, one which we need to understand and manage very carefully. In fact, one which has resulted in significant data loss situations for many companies world-wide.

IT has shifted from the datacentre to – well, anywhere. There’s still a strong datacentre focus. The estimates from various industry analysts is that around 70% of IT infrastructure spend is still based in the datacentre. That number is shrinking, but IT infrastructure is not; instead, it’s morphing. ‘Shadow IT’ is becoming more popular – business units going off on their own and deploying systems without necessarily talking to their IT departments. To be fair, Shadow IT always existed – it’s just back in the 90s and early 00s, it required the business units to actually buy the equipment. Now they just need to provide a credit card to a cloud provider.

Businesses are also starting to divest themselves of IT activities that aren’t their “bread and butter”, so to speak. A financial company or a hospital doesn’t make money from running an email system, so they outsource that email – and increasingly it’s to someone like Microsoft via Office 365.

Simply put, IT has become significantly more commoditised, accessible and abstracted over the past decade. All of this is good for the business, except it brings the business closer to that precipice I mentioned before.

What precipice? Risk. We’re going from datacentres where we don’t lose data because we’re deploying on highly resilient systems with 5 x 9s availability, robust layers of data protection and formal processes into situations where data is pushed out of the datacentre, out of the protection of the business. The old adage, “never assume, you make an ass out of u and me” is finding new ground in this modern approach to IT. Business groups trying to do a little data analytics rent a database at an hourly rate from a cloud provider and find good results, so they start using it more and more. But don’t think about data protection because they’ve never had to before. That led to things like the devastating data losses encountered by MongoDB users. Startups with higher level IT ideas are offering services without any understanding of the fundamental requirements of infrastructure protection. Businesses daily are finding that because they’ve spread their data over such a broad area, the attack vector has staggeringly increased, and hackers are turning that into a profitable business.

So returning to one of my first comments … you might be wondering why in 2017 there’s a need for a book dedicated to data protection? It’s simple: the requirement for data protection never goes away, regardless of whose infrastructure you’re using, or where your data resides. IT is standing on the brink of a significant evolution in how services are offered and consumed, and in so many situations it’s like a return to the early 90s. “Oh yeah, we bought a new server for a new project, it’s gone live. Does anyone know how we back it up?” It’s a new generation of IT and business users that need to be educated about data protection. Business is also demanding a return on investment for as much IT spend as possible, and that means data protection also needs to evolve to offer something back to the business other than saving you when the chips are down.

That’s why I’ve got a new book out about data protection: because the problem has not gone away. IT has evolved, but so has risk. That means data protection technology, data protection processes, and the way that we talk about data protection has to evolve as well. Otherwise we, as IT professionals, have failed in our professional duties.

I’m a passionate believer that we can always find a way to protect data. We think of it as business data, but it’s also user data. Customer data. If you work in IT for an airline it’s not just a flight bookings database you’re protecting, but the travel plans, the holiday plans, the emergency trips to sick relatives or getting to a meeting on time that you’re protecting, too. If you work in IT at a university, you’re not just protecting details that can be used for student billing, but also the future hopes and dreams of every student to pass through.

Let’s be passionate about data protection together. Let’s have that conversation with the business and help them understand how data protection doesn’t go away just because infrastructure it evolving. Let’s help the business understand that data protection isn’t a budget sink-hole, but it can improve processes and deliver real returns to the business. Let’s make sure that data, no matter where it is, is adequately protected and we can avoid that precipice.

“Data Protection: Ensuring Data Availability” is available now from the a variety of sellers, including my publisher and Amazon. Come on a journey with me and discover why backup is dead, long live backup.

Build vs Buy

 Architecture, Backup theory, Best Practice  Comments Off on Build vs Buy
Feb 182017

Converged, and even more so, hyperconverged computing, is all premised around the notion of build vs buy. Are you better off having your IT staff build your infrastructure from the ground up, managing it in silos of teams, or are you do you want to buy tightly integrated kit, land it on the floor and start using it immediately?

Dell-EMC’s team use the analogy – do you build your car, or do you buy it? I think this is a good analogy: it speaks to how the vast majority of car users consume vehicle technology. They buy a complete, engineered car as a package, and drive it off the car sales lot complete. Sure, there’s tinkerers who might like to build a car from scratch, but they’re not the average consumer. For me it’s a bit like personal computing – I gave up years ago wanting to build my own computers. I’m not interested in buying CPUs, RAM, motherboards, power supplies, etc., dealing with the landmines of compatibility, drivers and physical installation before I can get a usable piece of equipment.

This is where many people believe IT is moving, and there’s some common sense in it – it’s about time to usefulness.

A question I’m periodically posed is – what has backup got to do with the build vs buy aspect of hyperconverged? For one, it’s not just backup – it’s data protection – but secondly, it has everything to do with hyperconverged.

If we return to that build vs buy example of – would you build a car or buy a car, let me ask a question of you as a car consumer – a buyer rather than a builder of a car. Would you get airbags included, or would you search around for third party airbags?


To be honest, I’m not aware of anyone who buys a car, drives it off the lot, and starts thinking, “Do I go to Airbags R Us, or Art’s Airbag Emporium to get my protection?”

That’s because the airbags come built-in.

For me at least, that’s the crux of the matter in the converged and hyper-converged market. Do you want third party airbags that you have to install and configure yourself, and hope they work with that integrated solution you’ve got bought, or do you want airbags included and installed as part of the purchase?

You buy a hyperconverged solution because you want integrated virtualisation, integrated storage, integrated configuration, integrated management, integrated compute, integrated networking. Why wouldn’t you also want integrated data protection? Integrated data protection that’s baked into the service catalogue and part of the kit as it lands on your floor. If it’s about time to usefulness it doesn’t stop at the primary data copy – it should also include the protection copies, too.

Airbags shouldn’t be treated as optional, after-market extras, and neither should data protection.

Jan 242017

In 2013 I undertook the endeavour to revisit some of the topics from my first book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy”, and expand it based on the changes that had happened in the industry since the publication of the original in 2008.

A lot had happened since that time. At the point I was writing my first book, deduplication was an emerging trend, but tape was still entrenched in the datacentre. While backup to disk was an increasingly common scenario, it was (for the most part) mainly used as a staging activity (“disk to disk to tape”), and backup to disk use was either dumb filesystems or Virtual Tape Libraries (VTL).

The Cloud, seemingly ubiquitous now, was still emerging. Many (myself included) struggled to see how the Cloud was any different from outsourcing with a bit of someone else’s hardware thrown in. Now, core tenets of Cloud computing that made it so popular (e.g., agility and scaleability) have been well and truly adopted as essential tenets of the modern datacentre, as well. Indeed, for on-premises IT to compete against Cloud, on-premises IT has increasingly focused on delivering a private-Cloud or hybrid-Cloud experience to their businesses.

When I started as a Unix System Administrator in 1996, at least in Australia, SANs were relatively new. In fact, I remember around 1998 or 1999 having a couple of sales executives from this company called EMC come in to talk about their Symmetrix arrays. At the time the datacentre I worked in was mostly DAS with a little JBOD and just the start of very, very basic SANs.

When I was writing my first book the pinnacle of storage performance was the 15,000 RPM drive, and flash memory storage was something you (primarily) used in digital cameras only, with storage capacities measured in the hundreds of megabytes more than gigabytes (or now, terabytes).

When the first book was published, x86 virtualisation was well and truly growing into the datacentre, but traditional Unix platforms were still heavily used. Their decline and fall started when Oracle acquired Sun and killed low-cost Unix, with Linux and Windows gaining the ascendency – with virtualisation a significant driving force by adding an economy of scale that couldn’t be found in the old model. (Ironically, it had been found in an older model – the mainframe. Guess what folks, mainframe won.)

When the first book was published, we were still thinking of silo-like infrastructure within IT. Networking, compute, storage, security and data protection all as seperate functions – separately administered functions. But business, having spent a decade or two hammering into IT the need for governance and process, became hamstrung by IT governance and process and needed things done faster, cheaper, more efficiently. Cloud was one approach – hyperconvergence in particular was another: switch to a more commodity, unit-based approach, using software to virtualise and automate everything.

Where are we now?

Cloud. Virtualisation. Big Data. Converged and hyperconverged systems. Automation everywhere (guess what? Unix system administrators won, too). The need to drive costs down – IT is no longer allowed to be a sunk cost for the business, but has to deliver innovation and for many businesses, profit too. Flash systems are now offering significantly more IOPs than a traditional array could – Dell EMC for instance can now drop a 5RU system into your datacentre capable of delivering 10,000,000+ IOPs. To achieve ten million IOPs on a traditional spinning-disk array you’d need … I don’t even want to think about how many disks, rack units, racks and kilowatts of power you’d need.

The old model of backup and recovery can’t cut it in the modern environment.

The old model of backup and recovery is dead. Sort of. It’s dead as a standalone topic. When we plan or think about data protection any more, we don’t have the luxury of thinking of backup and recovery alone. We need holistic data protection strategies and a whole-of-infrastructure approach to achieving data continuity.

And that, my friends, is where Data Protection: Ensuring Data Availability is born from. It’s not just backup and recovery any more. It’s not just replication and snapshots, or continuous data protection. It’s all the technology married with business awareness, data lifecycle management and the recognition that Professor Moody in Harry Potter was right, too: “constant vigilance!”

Data Protection: Ensuring Data Availability

This isn’t a book about just backup and recovery because that’s just not enough any more. You need other data protection functions deployed holistically with a business focus and an eye on data management in order to truly have an effective data protection strategy for your business.

To give you an idea of the topics I’m covering in this book, here’s the chapter list:

  1. Introduction
  2. Contextualizing Data Protection
  3. Data Lifecycle
  4. Elements of a Protection System
  5. IT Governance and Data Protection
  6. Monitoring and Reporting
  7. Business Continuity
  8. Data Discovery
  9. Continuous Availability and Replication
  10. Snapshots
  11. Backup and Recovery
  12. The Cloud
  13. Deduplication
  14. Protecting Virtual Infrastructure
  15. Big Data
  16. Data Storage Protection
  17. Tape
  18. Converged Infrastructure
  19. Data Protection Service Catalogues
  20. Holistic Data Protection Strategies
  21. Data Recovery
  22. Choosing Protection Infrastructure
  23. The Impact of Flash on Data Protection
  24. In Closing

There’s a lot there – you’ll see the first eight chapters are not about technology, and for a good reason: you must have a grasp on the other bits before you can start considering everything else, otherwise you’re just doing point-solutions, and eventually just doing point-solutions will cost you more in time, money and risk than they give you in return.

I’m pleased to say that Data Protection: Ensuring Data Availability is released next month. You can find out more and order direct from the publisher, CRC Press, or order from Amazon, too. I hope you find it enjoyable.

Nov 302016

Folks, it’s that time of the year again! Each year I run a survey to gauge NetWorker usage patterns – how many clients you’ve got, what plugins you’re using, whether you’re using deduplication, and a plethora of other questions. The survey runs from December 1 (ish) through to January 31 the next year. (This year I’m kicking it off on November 30, just because I have time.)

Take the survey!

That gets assembled into a report in February of the following year, reporting trends across the various years the NetWorker survey has been conducted. If you’d like to see what the reports look like, you can view last year’s report here.

You can fill out the survey anonymously if you’d like, but if you submit your email address at the end you’ll be in the running for a copy of my upcoming book, Data Protection: Ensuring Data Availability, due out February 2017. (Last year’s winner hasn’t been forgotten – the book just got delayed.) If you submit your email address, it will not be used for any purpose other than to notify you if you’re the winner.

The survey is closed now. Results will be published in February 2017.

My cup runneth over

 Architecture, Backup theory, Best Practice  Comments Off on My cup runneth over
Nov 242016

How do you handle data protection storage capacity?

How do you handle growth – regular or unexpected – in your data protection volumes?

Hey, just as an aside, the NetWorker 2016 Usage Survey is up and running. If you can spare 5 minutes to complete it at the end of this article, that would be greatly appreciated!

Is your business reactive or proactive to data protection capacity requirements?


In the land of tape, dealing with capacity growth in data protection was both easy and insidiously obfuscated. Tape capacity management is basically a modern version of Hilbert’s Infinite Hotel Paradox – you sort-of, kind-of never run out of capacity because you always just buy another box of tapes. Problem solved, right? (No, more a case of the can kicked down the road.) Problem “solved” and you’ve got 1,000, 10,000, 50,000 tapes in a multitude of media types that you don’t even have tape drives to read any more.

Yet we like to focus on the real world now and tape isn’t the defacto standard any more for backup systems: it’s disk. Disk gives us great power, but with great power comes great responsibility (sorry, even though I’m not a Spiderman fan, I couldn’t resist. Tape is the opposite: tape gives us no power, and with no power comes no responsibility – yes, I’m also a Kickass fan.)

For businesses that still do disk-to-disk-to-tape, where disk is treated more like a staging area and excess data is written out to tape, the problem is seemingly solved because – you guessed it – you can always just buy another box of tapes and stage more data from disk backup storage to tape storage. Again, that’s kicking the can down the road. I’ve known businesses who have had company-wide data protection policies mandating up to 3 months of online recoverability from disk getting down to two weeks or less of data stored on disk because the data to be protected has continued to grow, no scaling has been done on the storage, and – you guessed it – tape was the interim solution.

Aside: When I first joined my first Unix system administration team in 1996, my team had just recently configuring an interim DNS server which they called tmp, because it was going to be quickly replaced by another server, which for the short term was called nc for new computer. When I left in 2000, tmp and nc were still there; in fact, nnc (yes, new-new-computer) was deployed shortly thereafter to replace nc and eventually, a year or two after I left, tmp was finally decommissioned.

Interim solutions have a tendency to stick. In fact, it’s a common story – capacity problem with data protection so let’s deploy an interim solution and solve it later. Later-later. Much later. Much later-later. Ad-infinitum.

There is, undoubtedly, a growing maturity in handling data protection storage management and capacity planning coming out of the pure disk and disk/cloud storage formats. While this is driven by necessity, it’s also and important demonstration that IT processes need to mature as the business matures as well.

If you’re new to pure disk based, or disk/cloud based data protection storage, you might want to stop and think carefully about your data protection policies and procurement processes/cycles so that you’re able to properly meet the requirements of the business. Here are a few tips I’ve learnt over the years…

80% is the new 100%

This one is easy. Don’t think of 100% capacity as being 100% capacity. Think of 80% as 100%. Why? Because you need runway to either procure storage, migrate data or get formal approval for changes to retention and backup policies. If you wait until you’re at 90, 95 or even 100% capacity, you’ve left your run too late and you’re just asking for many late or sleepless nights managing a challenge that could have been proactively dealt with earlier.

The key to management is measurement

I firmly believe you can’t manage something that has operational capacity restraints (e.g., “if we hit 100% capacity we can’t do more backups”) if you’re not actively measuring it. That doesn’t mean periodically logging into a console or running a “df -h” or whatever the “at a glance” look is for your data protection storage, it means capturing measurement data and having it available in both reports and dashboards so it is instantly visible.

The key to measurement is trending

You can capture all the data in the world and make it available in a dashboard, but if you don’t perform appropriate localised trending against that data to analyse it, you’re making your own good self the bottleneck (and weakest link) in the capacity management equation. You need to have trends produced as part of your reporting processes to understand how capacity is changing over time. These trends should be reflective of your own seasonal data variations or sampled over multiple time periods. Why? Well, if you have disk based data protection storage in your environment and do a linear forecast on capacity utilisation from day one, you’ll likely get a smoothing based on lower figures from earlier in the system lifecycle that could actually obfuscate more recent results. So you want to capture and trend that information for comparison, but you equally want to capture and trend shorter timeframes to ensure you have an understanding of shorter term changes. Trends based on the last six and three months usage profiles can be very useful in identifying what sort of capacity management challenges you’ve got based on short term changes in data usage profiles – a few systems for instance might be considerably spiking in utilisation, and if you’re still comparing against a 3-year timeframe dataset or something along those lines, the more recent profile may not be accurately represented in forecasts.

In short: measuring over multiple periods gives you the best accuracy.

Maximum is the new minimum

Linear forecasts of trending information are good if you’re just slowly, continually increasing your storage requirements. But if you’re either staging data (disk as staging) or running garbage collection (e.g., deduplication), it’s quite possible to get increasing sawtooth cycles in capacity utilisation on your data protection storage. And guess what? It doesn’t matter if your capacity requirements for the average utilisation are met if you’ll grow beyond the capacity requirements of the day before the oldest backups are deleted or garbage collection takes place. So make sure when you’re trending you’re looking at how you meet the changing maximum peaks, not the average sizes.

Know your windows

There’s three types of windows I’m referring to here – change, change freeze, and procurement.

You need to know them all intimately.

You’re at 95% capacity but you anticipated this and additional data protection storage has just arrived in your datacentre’s receiving bay, so you should be right to install it – right? What happens if you then have a week’s wait to have the change board consider your request for an outage – or datacentre access – to install the extra capacity? Will you be able to hold on that long? That’s knowing your change windows.

You know you’re going to run out of capacity in two months time if nothing is done, so you order additional data protection storage and it arrives on December 20. The only problem is a mandatory company change blackout started on December 19 and you literally cannot install anything, until January 20. Do you have enough capacity to survive? That’s knowing your freeze windows.

You know you’re at 80% capacity today and based on the trends you’ll be at 90% capacity in 3 weeks and 95% capacity in 4 weeks. How long does it take to get a purchase order approved? How long does it take the additionally purchased systems to arrive on-site? If it takes you 4 weeks to get purchase approval and another 3 weeks for it to arrive after the purchase order is sent, maybe 70%, not 80%, is your new 100%. That’s knowing your procurement windows.

Final thoughts

I want to stress – this isn’t a doom and gloom article, even if it seems I’m painting a blunt picture. What I’ve described above is expert tips – not from myself, but from my customers, and customers of colleagues and friends, whom I’ve seen manage data protection storage capacity well. If you follow at least the above guidelines, you’re going to have a far more successful – and more relaxed – time of it all.

And maybe you’ll get to spend Thanksgiving, Christmas, Ramadan, Maha Shivaratri, Summer Solstice, Melbourne Cup Day, Labour Day or whatever your local holidays and festivals are with your friends and families, rather than manually managing an otherwise completely manageable situation.

Hey, just as an aside, the NetWorker 2016 Usage Survey is up and running. If you can spare 5 minutes to complete it at the end of this article, that would be greatly appreciated!


Nov 162016

Years ago when NetWorker Management Console was first introduced, Australians (and no doubt people in other countries with a similarly named tax law) found themselves either amused or annoyed having to type commands such as:

# /etc/init.d/gst start

Who would want to start a goods and services tax, after all? In the case of NetWorker, GST didn’t stand for a tax on purchases, but the master control software for NMC.

It’s amusing then to be back in the realm of using an overloaded three letter acronym which for many (in this case) US citizens refers to the tax-man – IRS. In this case though, IRS stands for Isolated Recovery Site.

Our view of ‘disaster recovery’ situations by and large hasn’t changed much over the two decades I’ve been working in IT. While we’ve moved from active/passive datacentres to active/active datacentres as being the norm, the types of situations that might lead to invoking disaster recovery and transitioning services from one location to another have remained largely the same, such as:

  • Site loss
  • Site access loss
  • Catastrophic hardware failure
  • Disaster recovery testing

In fact, they’re pretty much the key four reasons why we need to invoke DR – either granularly or for an entire datacentre.

The concept of an IRS is not to provide assistance in any of the above four situations. (In theory it could be utilised partly for any of the above, in practice that’s what your normal disaster recovery datacentre is about, regardless of whether it’s in an active/active or active/passive configuration with your primary.)


Deploying an IRS solution within your environment is about protecting you from modern threat vectors. It represents a business maturity that accepts any, many or all of the following scenarios:

  • Users not understanding what they’re doing represent a threat vector that can no longer be casually protected against by using anti-virus software and firewalls
  • Administrators can make mistakes – not just little ‘boo-boos’, but catastrophic mistakes
  • On-platform protection should only form part of a holistic data protection environment
  • It is no longer a case of keeping malicious individuals out of your IT infrastructure, but also recognising they may already be inside
  • Protests are no longer confined to letter writing campaigns, boycotts and demonstrations

Before I explain some of those situations, it would be helpful to provide a high level overview of what one kind of IRS layout might look like:

Basic High Level IRS

The key things to understand in an IRS configuration such as the above are:

  • Your tertiary data copy (the IRS copy) is not, in the conventional sense of the word, connected to your network
  • You either use physical network separation (with periodic plugging of cables in) or automated control of network separation, with control accessible only within the IRS bunker
  • The Data Domain in your IRS bunker will optimally be configured with governance and retention lock
  • Your primary backup environment will not be aware of the tertiary Data Domain

IRS is not for traditional Business As Usual (BAU) or disaster recovery. You will still run those standard recovery operations out of your primary and/or disaster recovery sites in the same way as you would normally.

So what are some of the examples where you might resort to an IRS copy?

  • Tired/or disgruntled admin triggers deletion of primary and DR storage, including snapshots
  • Ransomware infects a primary file server, encrypting data and flooding the snapshot pool to the point the system can’t be recovered from
  • Hactivists penetrate the network and prior to deleting production system data, delete backup data.

These aren’t ‘example’ use cases, they’ve happened. In the first two if you’re using off-platform protection, you’re probably safe – but if you’re not, you’ve lost data. In the third example, there have been several examples over the last few years where this penetration has successfully been carried out by hactivists.

Maybe you feel your environment is not of interest from hactivists. If you work in the finance industry, you’re wrong. If you work in government, you’re wrong. OK, maybe you don’t work in either of those areas.

With the increasing availability of tools and broader surface area for malicious individuals or groups to strike with, hactivism isn’t limited to just the ‘conventional’ high profile industry verticals. Maybe you’re a pharmaceutical company that purchased the patent on a cheap drug then enraged communities by increasing prices by 400 times. Maybe you’re a theatre chain showing a movie a certain group has taken significant offence at. Maybe you’re a retail company selling products containing palm oil, or toilet paper not sourced from environmentally sustainable forests. Maybe you’re an energy company. Maybe you’re a company doing a really good job but have a few ex-employees with an axe to grind. If you’ve ever read an online forum thread, you’ll probably recognise that some people are trollish enough to do it just for the fun of it.

Gone are the days where you worried about hactivism if you happened to be running a nuclear enrichment programme.

IRS is about protecting you from those sorts of scenarios. By keeping at least a core of your critical data on a tertiary, locked down Data Domain that’s not accessible via the network, you’re not only leveraging the industry leading Data Invulnerability Architecture (DIA) to ensure what’s written is what’s read, you’re also ensuring that tertiary copy is off platform to the rest of your environment.

And the great thing is, products like NetWorker are basically designed from the ground up to be used in an IRS configuration. NetWorker’s long and rich history of command automation means you can build into that Control & Verification service area whatever you need to take read-write snapshots of replicated data, DR an isolated NetWorker server and perform automated test recoveries.

One last point – something I’ve discussed with a few customers recently – you might be having an ahah! moment and point to a box of tapes somewhere and say “There’s my IRS solution!” I can answer that with one simple question: If you went to your business and said you could scrap a disaster recovery site and instead rely on tape to perform all the required recoveries, what would they say? Tape isn’t an IRS option except perhaps for the most lackadaisical data protection environments. (I’d suggest it’d even be an Icarus IRS solution – trusting that wax won’t melt when you fly your business too close to the sun.)

There’s some coverage of IRS in my upcoming book, Data Protection: Ensuring Data Availability, and of course, you can read up on Dell EMC’s IRS offerings too. A good starting point is this solution overview. If you’re in IT – Infrastructure or Security – have a chat to your risk officers and ask them what they think about those sorts of challenges outlined above. Chances are they’re already worried about them, and you could very well be bringing them the solution that’ll let everyone sleep easily at night. You might one day find yourself saying “I love the IRS”.