May 232017


A seemingly straight-forward question, what constitutes a successful backup may not engender the same response from everyone you ask. On the surface, you might suggest the answer is simply “a backup that completes without error”, and that’s part of the answer, but it’s not the complete answer.


Instead, I’m going to suggest there’s actually at least ten factors that go into making up a successful backup, and explain why each one of them is important.

The Rules

One – It finishes without a failure

This is the most simple explanation of a successful backup. One that literally finishes successfully. It makes sense, and it should be a given. If a backup fails to transfer the data it is meant to transfer during the process, it’s obviously not successful.

Now, there’s a caveat here, something I need to cover off. Sometimes you might encounter situations where a backup completes successfully  but triggers or produces a spurious error as it finishes. I.e., you’re told it failed, but it actually succeeded. Is that a successful backup? No. Not in a useful way, because it’s encouraging you to ignore errors or demanding manual cross-checking.

Two – Any warnings produced are acceptable

Sometimes warnings will be thrown during a backup. It could be that a file had to be re-read, or a file was opened at the time of backup (e.g., on a Unix/Linux system) and could only be partially read.

Some warnings are acceptable, some aren’t. Some warnings that are acceptable on one system may not be acceptable on another. Take for instance, log files. On a lot of systems, if a log file is being actively written to when the backup is running, it could be that the warning of an incomplete capture of the file is acceptable. If the host is a security logging system and compliance/auditing requirements dictate all security logs are to be recoverable, an open-file warning won’t be acceptable.

Three – The end-state is captured and reported on

I honestly can’t say the number of times over the years I’ve heard of situations where a backup was assumed to have been running successfully, then when a recovery is required there’s a flurry of activity to determine why the recovery can’t work … only to find the backup hadn’t been completing successfully for days, weeks, or even months. I really have dealt with support cases in the past where critical data that had to be recovered was unrecoverable due to a recurring backup failure – and one that had been going on, being reported in logs and completion notifications, day-in, day-out, for months.

So, a successful backup is also a backup here the end-state is captured and reported on. The logical result is that if the backup does fail, someone knows about it and is able to choose an action for it.

When I first started dealing with NetWorker, that meant checking the savegroup completion reports in the GUI. As I learnt more about the importance of automation, and systems scaled (my system administration team had a rule: “if you have to do it more than once, automate it”), I built parsers to automatically interpret savegroup completion results and provide emails that would highlight backup failures.

As an environment scales further, automated parsing needs to scale as well – hence the necessity of products like Data Protection Advisor, where you not only get simple dashboards for overnight success ratios with drill-downs, root cause analysis, and all the way up to SLA adherence reports and beyond.

In short, a backup needs to be reported on to be successful.

Four – The backup method allows for a successful recovery

A backup exists for one reason alone – to allow the retrieval and reconstruction of data in the event of loss or corruption. If the way in which the backup is run doesn’t allow for a successful recovery, then the backup should not be counted as a successful backup, either.

Open files are a good example of this – particularly if we move into the realm of databases. For instance, on a regular Linux filesystem (e.g., XFS or EXT4), it would be perfectly possible to configure a filesystem backup of an Oracle server. No database plugin, no communication with RMAN, just a rolling sweep of the filesystem, writing all content encountered to the backup device(s).

But it wouldn’t be recoverable. It’s a crash-consistent backup, not an application-consistent backup. So, a successful backup must be a backup that can be successfully recovered from, too.

Five – If an off-site/redundant copy is required, it is successfully performed

Ideally, every backup should get a redundant copy – a clone. Practically, this may not always be the case. The business may decide, for instance, that ‘bronze’ tiered backups – say, of dev/test systems, do not require backup replication. Ultimately this becomes a risk decision for the business and so long as the right role(s) have signed off against the risk, and it’s deemed to be a legally acceptable risk, then there may not be copies made of specific types of backups.

But for the vast majority of businesses, there will be backups for which there is a legal/compliance requirement for backup redundancy. As I’ve said before, your backups should not be a single point of failure within your data protection environment.

So, if a backup succeeds but its redundant copy fails, the backup should, to a degree, be considered to have failed. This doesn’t mean you have to necessarily do the backup again, but if redundancy is required, it means you do have to make sure the copy gets made. That then hearkens back to requirement three – the end state has to be captured and reported on. If you’re not capturing/reporting on end-state, it means you won’t be aware if the clone of the backup has succeeded or not.

Six – The backup completes within the required timeframe

You have a flight to catch at 9am. Because of heavy traffic, you don’t arrive at the airport until 1pm. Did you successfully make it to the airport?

It’s the same with backups. If, for compliance reasons you’re required to have backups complete within 8 hours, but they take 16 to run, have they successfully completed? They might exit without an error condition, but if SLAs have been breached, or legal requirements have not been met, it technically doesn’t matter that they finished without error. The time it took them to exit was, in fact, the error condition. Saying it’s a successful backup at this point is sophistry.

Seven – The backup does not prevent the next backup from running

This can happen one of two different ways. The first is actually a special condition of rule six – even if there are no compliance considerations, if a backup meant to run once a day takes longer than 24 hours to complete, then by extension, it’s going to prevent the next backup from running. This becomes a double failure – not only does the next backup run, but the next backup doesn’t run because the earlier backup is blocking it.

The second way is not necessarily related to backup timing – this is where a backup completes, but it leaves system in state that prevents next backup from running. This isn’t necessarily a common thing, but I have seen situations where for whatever reason, the way a backup finished prevented the next backup from running. Again, that becomes a double failure.

Eight – It does not require manual intervention to complete

There’s two effective categories of backups – those that are started automatically, and those that are started manually. A backup may in fact be started manually (e.g., in the case of an ad-hoc backup), but should still be able to complete without manual intervention.

As soon as manual intervention is required in the backup process, there’s a much greater risk of the backup not completing successfully, or within the required time-frame. This is, effectively, about designing the backup environment to reduce risk by eliminating human intervention. Think of it as one step removed from the classic challenge that if your backups are required but don’t start without human intervention, they likely won’t run. (A common problem with ‘strategies’ around laptop/desktop self-backup requirements.)

There can be workarounds for this – for example, if you need to trigger a database dump as part of the backup process (e.g., for a database without a plugin), then it could be a password needs to be entered, and the dump tool only accepts passwords interactively. Rather than having someone actually manually enter the password, the dump command could instead be automated with tools such as Expect.

Nine – It does not unduly impact access to the data it is protecting

(We’re in the home stretch now.)

A backup should be as light-touch as possible. The best example perhaps of a ‘heavy touch’ backup is a cold database backup. That’s where the database is shutdown for the duration of the backup, and it’s a perfect situation of a backup directly impacting/impeding access to the data being protected. Sometimes it’s more subtle though – high performance systems may have limited IO and system resources to handle the steaming of a backup, for instance. If system performance is degraded by the backup, then it should be considered the case the backup is unsuccessful.

I liken this to uptime vs availability. A server might be up, but if the performance of the system is so poor that users consider the service offered by the system, it’s not usable. That’s where, for instance, systems like ProtectPoint can be so important – in high performance systems it’s not just about getting a high speed backup, but limiting the load of the database server during the backup process.

Ten – It is predictably repeatable

Of course, there are ad-hoc backups that might only ever need to be run once, or backups that you may never need to run again (e.g., pre-decommissioning backup).

The vast majority of backups within an environment though will be repeated daily. Ideally, the result of each backup should be predictably repeatable. If the backup succeeds today, and there’s absolutely no changes to the systems or environment, for instance, then it should be reasonable to expect the backup will succeed tomorrow. That doesn’t ameliorate the requirement for end-state capturing and reporting; it does mean though that the backup results shouldn’t effectively be random.

In Summary

It’s easy to understand why the simplest answer (“it completes without error”) can be so easily assumed to be the whole answer to “what constitutes a successful backup?” There’s no doubt it forms part of the answer, but if we think beyond the basics, there are definitely a few other contributing factors to achieving really successful backups.

Consistency, impact, recovery usefulness and timeliness, as well as all the other rules outlined above also come into how we can define a truly successful backup. And remember, it’s not about making more work for us, it’s about preventing future problems.

If you’ve thought the above was useful, I’d suggest you check out my book, Data Protection: Ensuring Data Availability. Available in paperback and Kindle formats.

Jan 112017

There are currently a significant number of vulnerable MongoDB databases which are being attacked by ransomware attackers, and even though the attacks are ongoing, it’s worth taking a moment or two to reflect on some key lessons that can be drawn from it.

If you’ve not heard of it, you may want to check out some of the details linked to above. The short summary though is that MongoDB’s default deployment model has been a rather insecure one, and it’s turned out there’s a lot of unsecured public-facing databases out there. A lot of them have been hit by hackers recently, with the contents of the databases deleted and the owners being told to pay a ransom to get their data back. As to whether that will get them their data back is of course, another issue.

Ransomware Image

The first lesson of course is that data protection is not a single topic. More so than a lot of other data loss situations, the MongoDB scenario points to the simple, root lesson for any IT environment: data protection is also a data security factor:

Data Protection

For the most part, when I talk about Data Protection I’m referring to storage protection – backup and recovery, snapshots, replication, continuous data protection, and so on. That’s the focus of my next book, as you might imagine. But a sister process in data protection has been and will always be data security. So in the first instance in the MongoDB attacks, we’re seeing the incoming threat vector entirely from the simple scenario of unsecured systems. A lackadaisical approach to security is exactly what’s happened – for developers and deployers alike – in the MongoDB space, and the result to date is estimated to be around 93TB of data wiped. That number will only go up.

The next lesson though is that backups are still needed. In The MongoDB attacks: 93 terabytes of data wiped out (linked again from above), Dissent writes that of 1138 victims analysed:

Only 13 report that they had recently backed up the now-wiped database; the rest reported no recent backups

That number is awful. Just over 11% of impacted sites had recent backups. That’s not data protection, that’s data recklessness. (And as the report mentions, 73% of the databases were flagged as being production.) In one instance:

A French healthcare research entity had its database with cancer research wiped out. They reported no recent backup.

That’s another lesson there: data protection isn’t just about bits and bytes, it’s about people’s lives. If we maintain data, we have an ethical obligation to protect it. What if that cancer data above held some clue, some key, to saving someone’s life? Data loss isn’t just data loss: it can lead to loss of money, loss of livelihood, or perhaps even loss of life.

Those details are from a sample of 118 sourced from a broader category of 27,000 hit systems.

So the next lesson is that even now, 2017, we’re still having to talk about backup as if it’s a new thing. During the late 90s I thought there was a light at the end of the tunnel for discussions about “do I need backup?”, and I’ve long since resigned myself to the fact I’ll likely still be having those conversations up until the day I retire, but it’s a chilling reminder of the ease at which systems can now be deployed without adequate protection. One of the common responses you’ll see to “we can’t back this up”, particularly in larger databases, is the time taken to complete a backup. That’s something Dell EMC has been focused on for a while now. There’s storage integrated data protection via ProtectPoint, and more recently, there’s BoostFS for Data Domain, giving systems distributed segment processing directly onto the database server for high speed deduplicated backups. (And yes, MongoDB was one of the systems in mind when BoostFS was developed.) If you’ve not heard of BoostFS yet, it was included in DDOS 6, released last year.

It’s not just backup though – for systems with higher criticality there should be multi-layered protection strategies: backups will give you potentially longer term retention, and off-platform protection, but if you need really fast recovery times with very low RPOs and RTOs, your system will likely need replication and snapshots as well. Data protection isn’t a “one size fits all” scenario that some might try to preach; it’s multi-layered and it can encompass a broad range of technology. (And if the data is super business critical you might even want to go the next level and add IRS protection for it, protecting yourself not only from conventional data loss, but also situations where your business is hacked as well.)

The fallout and the data loss from the MongoDB attacks will undoubtedly continue for some time. If one thing comes out of it, I’m hoping it’ll be a stronger understanding from businesses in 2017 that data protection is still a very real topic.


A speculative lesson: What’s the percentage of these MongoDB deployments that fall under the banner of ‘Shadow IT’? I.e., non-IT deployments of systems. By developers, other business groups, etc., within organisations? Does this also serve as a reminder of the risks that can be introduced when non-IT groups deploy IT systems without appropriate processes and rigour? We may never know the percentage breakdown between IT-led deployments and Shadow IT led deployments, but it’s certainly food for thought.

May 112016

Backing up data from an NFS mount-point is not ideal, but sometimes we don’t have a choice.

NFS Backup

There’s a few reasons you might end up in this situation – you might need to backup data on a particularly old system that no longer has a NetWorker client available (or perhaps never did), or you might need to backup a consumer-grade NAS that doesn’t support NDMP.

In this case, it’s the latter I’m doing having rejigged my home test lab. Having real data to test with is always good, and rather than using my filesystem generator tool I decided to backup my Synology NAS over NFS, with the fileshares directly mounted on the backup server. A backup is all well and good, but being able to recover the data is always important. While I’m not worried about ACLs/etc, I did want to know I was successfully backing up the data, so I ran a recovery test and was reminded of an old chestnut in how recoveries work.

[root@orilla Documents]# recover -s orilla
4181:recover: Path /synology/pmdg/Documents is within othalla:/volume1/pmdg
53362:recover: Cannot start session with server orilla: Client '' is not properly configured on the NetWorker Server or ''(if not a virtual host) is not in the aliases list for client ''.
88866:nsrd: Client '' is not properly configured on the NetWorker Server
or ''(if not a virtual host) is not in the aliases list for client ''.

Basically what the recovery error is saying that NetWorker has detected the path we’re sitting on/trying to recover from actually resides on a different host, and that host doesn’t appear to be a valid NetWorker client. Luckily, there’s a simple solution. (While the best solution might be a budget request with the home change board to buy a small Unity system, I’d just spent my remaining budget on home lab server upgrades, so I felt it best not to file that request.)

In this case the NFS mount was on the NetWorker server itself, so all I had to do was to tell NetWorker I wanted to recover from the NetWorker client:

root@orilla Documents]# recover -s orilla -c orilla
Current working directory is /synology/pmdg/Documents/
recover> add "Stop, Collaborate and Listen.pdf"
1 file(s) marked for recovery
recover> relocate /tmp
recover> recover
Recovering 1 file from /synology/pmdg/Documents/ into /tmp
Volumes needed (all on-line):
  Backup.01 at Backup_01
Total estimated disk space needed for recover is 1532 KB.
Requesting 1 file(s), this may take a while...
Recover start time: Sun 08 May 2016 18:28:46 AEST
Requesting 1 recover session(s) from server.
129290:recover: Successfully established direct file retrieve session for save-set ID '2922310001' with adv_file volume 'Backup.01'.
./Stop, Collaborate and Listen.pdf
Received 1 file(s) from NSR server `orilla'
Recover completion time: Sun 08 May 2016 18:28:46 AEST
recover> quit

And that’s how simple the process is.

While ideally we shouldn’t be doing this sort of backup – a double network transfer is hardly bandwidth efficient, it’s always good to keep it in your repertoire just in case you need it.

Oct 272015

NetWorker 9 introduces a new action that can be incorporated into workflows, Check Connectivity. You can use this prior to a backup action to confirm that you have connectivity to a host before starting the backup. Now, you may think this is a little odd, since NetWorker effectively checks connectivity as part of the backup process, but that’s if you’re looking at Check Connectivity on a per-host basis. Used optimally, Check Connectivity allows you to easily streamline the process of confirming that all hosts are available before starting the backup.

This option is important when we consider multi-host applications and services within environments where it’s actually deemed critical that the backup either run for everything or nothing. That way you can’t (in theory) capture logically inconsistent backups of the environment – for example, getting a backup of an application server but not the database that runs in conjunction with it.

In the example policy below I’ve created a simple workflow that does the following:

  • Checks client connectivity
  • If that’s successful:
    • Executes a backup of the hosts in question to the AFTD_Backup pool
    • Clones those backups to the AFTD_Clone pool

Multihost Workflow and Policy

I’ll step through the check connectivity activity so you can see what it looks like:

Check Connectivity Action Screen 1

Check Connectivity Action Screen 1

Check Connectivity Action Screen 2

Check Connectivity Action Screen 2

This is probably the most important option in the check connectivity action: “Succeed only after all clients succeed” – in other words, the action will fail if any of the clients we want to backup can’t be contacted.

Check Connectivity Action Screen 3

Check Connectivity Action Screen 3

Check Connectivity Action Screen 4

Check Connectivity Action Screen 4

It’s a pretty simple action, as you can see.

Zooming in on a little on the workflow visualisation, you’ll see it in more detail here:

Multihost Workflow Visualisation

Multihost Workflow Visualisation

By the way, I’m loving the option to edit components of the workflow and actions from the visualisation, i.e.:

Multihost Workflow Visualisation Pool Edit

Multihost Workflow Visualisation Pool Edit

In order to test and demonstrate the check connectivity action, I configured 6 backup clients:

  • test01
  • test02
  • test03
  • test04
  • test05
  • test06

On the first test, I made sure NetWorker was running on all 6 clients, and the backup/clone actions were permitted to execute after a successful connectivity test:

Multihost Workflow Executing Successfully

Multihost Workflow Executing Successfully

Now, after that finished, I shutdown the NetWorker services on one of the clients, test06, to see how this would impact the check connectivity action:

Stopping NetWorker on a Client

Stopping NetWorker on a Client

With NetWorker stopped, the workflow failed as a result of the connectivity check failing for one of the hosts. The high level failure looked like this:

Multihost Workflow Failure

Multihost Workflow Failure

Double-clicking on the check connectivity action results in the Monitoring view of NMC showed me the following:

Check Connectivity Error Dialog

Check Connectivity Error Dialog

To see the messages in more detail I just copied and pasted it into Notepad, which revealed the full details of the connectivity testing:

Multihost Workflow Check Connectivity Results

Multihost Workflow Check Connectivity Results

And there you have it. For sure, I’ve done this sort of multi-host connectivity testing in the past using NetWorker 8 and 7 (actually, even using NetWorker 6), but it’s always required nesting savegroups where the parent savegroup executes a pre-command to check via rpcinfo the availability of each host in the child savegroup before using nsradmin to invoke the child savegroup. It’s a somewhat messy approach and requires executing at least some form of backup in the parent savegroup (otherwise NetWorker declares the parent group a failure). The new functionality is simple, straight forward and is easily incorporated into a workflow.

If you have the requirement in your environment to ensure all or no clients in a group backup, this is an excellent reason to upgrade to NetWorker 9. If you’re already on NetWorker 9, keep an eye out for where you can incorporate this into your policies and workflows.

Oct 262015

As mentioned in my introductory post about it, NetWorker 9 introduces the option to perform Block Based Backups (BBB) for Linux systems. (This was introduced in NetWorker 8 for Windows, and has actually had its functionality extended for Windows in v9 as well, with the option to now perform BBB for Hyper-V and Exchange systems.)

BBB is a highly efficient mechanism for backing up without worrying about the cost of walking the filesystem. Years ago I showed just how much filesystem density can have a massive detrimental impact on the performance of a backup. While often the backup product is blamed for being “slow”, the fault sits completely with operating system and filesystem vendors for having not produced structures that scale sufficiently.

BBB gets us past that problem by side-stepping the filesystem and reading directly from the underlying disk or LUN. Instead of walking files, we just have to traverse the blocks. In cases where filesystems are really dense, the cost of walking the filesystem can increase the run-time of the backup by an order of magnitude or more. Taking that out of the picture allows businesses to protect these filesystems much faster than via conventional means.

Since BBB needs to integrate at a reasonably low level within a system structure in order to successfully operate, NetWorker currently supports only the following systems:

  • CentOS 7
  • RedHat Enterprise Linux v5 and higher
  • SLES Linux 11 SP1 and higher

In all cases, you need to be running LVM2 or Veritas Volume Manager (VxVM), and be using ext3 or ext4 filesystems.

To demonstrate the benefits of BBB in Linux, I’ve setup a test SLES 11 host and used my genfs2 utility on it to generate a really (nastily) dense filesystem. I actually aborted the utility when I had 1,000,000+ files on the filesystem – consuming just 11GB of space:

genfs2 run

genfs2 run

I then configured a client resource and policy/workflow to do a conventional backup of the /testfs filesystem. That’s without any form of performance enhancement. From NetWorker’s perspective, this resulted in about 8.5GB of backup, and with 1,178,358 files (and directories) total took 36 minutes and 37 seconds to backup. (That’s actually not too bad, all things considered – but my lab environment was pretty much quiesced other than the test components.)

Conventional Backup Performance

Conventional Backup Performance

Next, I switched over to parallel savestreams – which has become more capable in NetWorker 9 given NetWorker will now dynamically rebalance remaining backups all the way through to the end of the backup. (Previously the split was effectively static, meaning you could have just one or two savestreams left running by themselves after others had completed. I’ll cover dynamic parallel savestreams in more detail in a later post.)

With dynamic parallel savestreams in play, the backup time dropped by over ten minutes – a total runtime of 23 minutes and 46 seconds:

Dynamic Parallel Savestream Runtime

Dynamic Parallel Savestream Runtime

The next test, of course, involves enabling BBB for the backup. So long as you’ve met the compatibility requirements, this is just a trivial checkbox selection:

Enabling Block Based Backup

Enabling Block Based Backup

With BBB enabled the workflow executed in just 6 minutes and 48 seconds:

Block Based Backup Performance

Block Based Backup Performance

That’s a substantially shorter runtime – the backups have dropped from over 36 minutes for a single savestream to under 7 minutes using BBB and bypassing the filesystem. While Dynamic Parallel Savestreams did make a substantial difference (shaving almost a third from the backup time), BBB was the undisputed winner for maximising backup performance.

One final point – if you’re doing BBB to Data Domain, NetWorker now automatically executes a synthetic full (using the Data Domain virtual synthetic full functionality) at the end of every incremental backup BBB you perform:

Automatic virtual synthetic full

Automatic virtual synthetic full

The advantage of this is that recovery from BBB is trivial – just point your recovery process (either command line, or via NMC) at the date you want to recover from, and you have visibility of the entire filesystem at that time. If you’re wondering what FLR from BBB looks like on Linux, by the way, it’s pretty straight forward. Once you identify the saveset (based on date – remember, it’ll contain everything), you can just fire up the recovery utility and get:



Logging in using another terminal session, it’s just a simple case of browsing to the directory indicated above and copying the files/data you want:

BBB FLR directory listing

BBB FLR directory listing

And there you have it. If you’ve got highly dense Linux filesystems, you might want to give serious thought towards upgrading to NetWorker 9 so you can significantly increase the performance of their backup. NetWorker + Linux + BBB is a winning combination.

Basics – Running VMware Protection Policies from the Command Line

 Basics, NetWorker, VBA  Comments Off on Basics – Running VMware Protection Policies from the Command Line
Mar 102015

If you’ve been adapting VMware Protection Policies via VBA in your environment (like so many businesses have been!), you’ll likely reach a point where you want to be able to run a protection policy from the command line. Two immediate example scenarios would be:

  • Quick start of a policy via remote access*
  • External scheduler control

(* May require remote command line access. You can tell I’m still a Unix fan, right?)

Long-term users of NetWorker will know a group can be initiated from the backup server by using the savegrp command. When EMC introduced VMware Protection Policies, they also introduced a new command, nsrpolicy.

The simplest way to invoke a policy is as follows:

# nsrpolicy -p policyName

For example:

[root@centaur ~]# nsrpolicy -p SqueezeProtect
99528:nsrpolicy: Starting Vmware Protection Policy 'SqueezeProtect'.
97452:nsrpolicy: Starting action 'SqueezeProtect/SqueezeBackup' with command: 'nsrvba_save -s centaur -j 544001 -L incr -p SqueezeProtect -a SqueezeBackup'.
97457:nsrpolicy: Action 'SqueezeProtect/SqueezeBackup's log will be in /nsr/logs/policy/SqueezeProtect/544002.
97461:nsrpolicy: Action 'SqueezeProtect/SqueezeBackup' succeeded.
99529:nsrpolicy: Vmware Protection Policy 'SqueezeProtect' succeeded.

There you go – it’s that easy.

Jan 102015

One of the great new features in NetWorker is the integration of Instant Access, whereby virtual machines backed up with the VBA appliance to Data Domain systems may be instantly accessed from the Data Domain without needing to actually recover them. This allows you to quickly startup a failed service even as you’re migrating the virtual machine to a production datastore, or pull one or two essential files out of the virtual machine without needing to resort to a file level recovery.

To see this in action, I configured a lab virtual machine for backups then did an Instant Access operation on it.


In the above screen shot, I picked a VM that hadn’t been used for VBA backups previously, test03, and added it to the Data Domain backup policy, DDBackup. I was then able to run the policy to get a brand new backup of the VM:


Of course, because the virtual machine was a plain CentOS Linux install, much like other Linux VMs that had been backed up, the first full backup was still remarkably quick. Once that was completed, the bulk of the work shifted across to the vSphere Web Client:


You’ll need to follow your standard enterprise operational practices for logon, obviously. In this case being a lab server I’m using the virtual vCenter Appliance, and I logged on as the root user. Next stop, the EBR plugin:


Once logged in, go to Restore and drill down to the virtual machine backup instance you want to recover:


With the virtual machine backup instance selected, if the backup target was a Data Domain running the right DDOS (5.4 or higher), you’ll be able to initiate the Instant Access option:


The Instant Access wizard is pretty straight forward and doesn’t really require much thought, other than what the ‘restored’ virtual machine will be named and where in the cluster it’ll be made available:


Having nominated the name and location you can continue onto final confirmation of the operation:


When ready, you can complete Finish and before you know it, you’ll see this:


Now, here’s the kicker. By the time you’ve clicked OK and switched back to say, the vSphere Windows client, your VM will likely be waiting for you:


There it is in the ‘Test Clients’ pool. It really takes almost no time at all: Instant access is not a lie. You can see the temporary datastore that the VBA appliance has provided for the recovery if you go up to your storage resources, too:


In this case, because the virtual machine I ‘restored’ wasn’t running any services that publish their presence, it was safe to run both virtual machines at the same time, since the ‘restored’ virtual machine gets reconfigured to use DHCP, thus getting a different IP address to the original:


In the above, the top console is for the original virtual machine, and the bottom console is for the one made available via Instant Access.

At this point, you’ve got a couple of options – you can either pull out the files you want from the virtual machine using normal operating system access techniques, or you can keep the virtual machine running and migrate it to a production datastore. The migration works in the same way as any normal VMware migration runs, so for this case I just powered down the virtual machine and removed it from Inventory:


Once you’ve done that, your only other task is to drop the temporary datastore so that VBA cleans up after itself. I’ve found the simplest way to do this is to switch back to the Web GUI and go to do another instant restore of the same virtual machine. This will trigger the following prompt:


At that point, you can just hit Unmount, then subsequently cancel the operation.

And there you have it – Instant Access. It really is that quick and simple.

Hey, now you’ve finished this article, would you mind quickly filling in the NetWorker Usage Survey if you haven’t already done so? It’ll only take 5 minutes of your time. You can get to the survey here.


Oct 232014

Long-term NetWorker administrators may remember that NetWorker used to have a somewhat odd mechanism of dealing with renamed directories. Nowadays the default option for any new client is to enable backup renamed directories, and this is a good thing, even though it might end up using a bit more backup media.

To explain the difference between then and now, and why the new default is so much better, I first have to setup a scenario.

Consider a client that has a directory called /renaming/backup, and underneath that directory there’s another directory called /renaming/backup/alpha. The named saveset for this client will be /renaming/backup, which will capture all subdirectories.

In our scenario, there will first be a full backup of /renaming/backup, then the alpha directory will be renamed to beta, and a new backup taken.

Temporarily reinstating the old mechanism by turning off “backup renamed directories” for this client, there’s a considerable difference between the full backup done with a /renaming/backup/alpha subdirectory and the subsequent incremental with alpha renamed to beta. First, the full:

Backup Renamed Directories Off, Full Backup

After that backup was taken, I renamed the alpha directory to beta and re-ran the backup. Here’s what the savegroup completion looked like:

Backup renamed directories off, incremental backup after directory rename

You’ll note there that the size of the incremental backup of /renaming/backup was just 2KB, which would be around consistent with backing up the details associated with a changed directory, but nothing underneath that directory.

And that was sort of the problem with the old method, right there. A recovery following that second backup of /renaming/backup would yield odd results:

[root@centaur backup]# recover
Current working directory is /renaming/backup/
recover> add beta
1 file(s) marked for recovery
recover> relocate /renaming/recovery/backup-rename-off
recover> recover
Recovering 1 file from /renaming/backup/ into /renaming/recovery/backup-rename-off
Volumes needed (all on-line):
        centaur.003 at AFTD-01
Total estimated disk space needed for recover is 36 KB.
Requesting 1 file(s), this may take a while...
Recover start time: Mon 20 Oct 2014 19:52:42 AEDT
Requesting 1 recover session(s) from server.
91651:recover: Successfully established AFTD DFA session for recovering save-set ID '4114926770'.
Received 1 file(s) from NSR server `centaur'
Recover completion time: Mon 20 Oct 2014 19:52:42 AEDT

[root@centaur backup]# cd ..
[root@centaur renaming]# ls
backup  recovery
[root@centaur renaming]# cd recovery
[root@centaur recovery]# ls
[root@centaur recovery]# cd backup-rename-off/
[root@centaur backup-rename-off]# ls
[root@centaur backup-rename-off]# cd beta
[root@centaur beta]# ls
<crickets chirping>

To recover the contents of the beta directory, one had to instead switch to a browse time before the rename happened, and recover the old directory name. As you might imagine, this required rather intimate knowledge for the recovery operator of when directories had been renamed.

Jumping forward to now, we have a much more agreeable mechanism. After a delete of all the backups for the client, the backup process takes up a bit more space, but results in a simpler, more reliable recovery. First, I renamed /renaming/backup/beta back to /renaming/backup/alpha. Then, the full backup:

Full backup with backup renamed directories on

After that backup completed, I renamed /renaming/backup/alpha to /renaming/backup/beta and re-ran the backup – again, as an incremental:

Backup renamed directories on, incr backup after renaming a directory

You’ll notice in this scenario the incremental is as big as the previous full, since the alpha (or beta) directory is the only subdirectory of /renaming/backup.

However, that little hit on the backup space is more than made up for by a simplified recovery process. Executing a recovery after the second backup completes yields the following results:

[root@centaur backup]# recover
Current working directory is /renaming/backup/
recover> ls
recover> add beta
11 file(s) marked for recovery
recover> relocate /renaming/recovery/backup-renamed-on
recover> recover
Recovering 11 files within /renaming/backup/ into /renaming/recovery/backup-renamed-on
Volumes needed (all on-line):
        centaur.003 at AFTD-01
Total estimated disk space needed for recover is 1047 MB.
Requesting 11 file(s), this may take a while...
Recover start time: Mon 20 Oct 2014 20:03:29 AEDT
Requesting 1 recover session(s) from server.
91651:recover: Successfully established AFTD DFA session for recovering save-set ID '4014264195'.
Received 11 file(s) from NSR server `centaur'
Recover completion time: Mon 20 Oct 2014 20:03:36 AEDT

If the version of NetWorker you’re using was setup more recently, it’s more than likely clients you’ve created have backup renamed directories already turned on. If you’re working with an older version of NetWorker, or a NetWorker server that has been in use since version 7.4 or older, it’s possible legacy clients still have the option turned off.

I heartily recommend all filesystem clients always have backup renamed directories enabled.

May 312013

iStock Cloud Touch Small

The other day I stumbled across a link to an article, Why you should stop buying servers. The title was interesting enough to grab my attention so I had a quick peruse through it. It’s an article about why you should start using the cloud rather than buying local infrastructure.

While initially I was reasonably skeptical of cloud, that view has been tempered over time. When handled correctly, cloud or cloud-like services will definitely be a part of the business landscape for some time to come. (I personally suspect we’ll see pendulum swings on cloud services in pretty much exactly the same way as we see pendulum swings on outsourcing.)

The lynch pin in that statement above though is when handled correctly; in this case, I was somewhat concerned at the table showing the merits of cloud servers vs local servers when it came to backup:

Cloud vs Local ServersThis comparison to me shows a key question people aren’t yet asking of cloud services companies:

Do you understand backup?

It’s not a hard question, but it does deserve hard answers.

To say that a remote snapshot of a virtual server represents an offsite backup in a single instance may be technically true (minus fine print on whether or not application/database consistent recovery can be achieved), but it’s hardly the big picture on backup policies and processes. In fact, it’s about as atomic as you can get.

I had the pleasure of working with an IaaS company last year to help formulate their backup strategy; their intent was clear: to make sure they were offering business suitable and real backup policies for potential customers. So, to be blunt: it can be done.

As someone who has worked in backup my entire professional career, the above table scares me. In a single instance it might be accurate (might); as part of a full picture, it doesn’t even scratch the surface. Perhaps what best sums up my concerns with this sort of information is this rollover at the top of the table:

Sponsor content rolloverSeveral years back now, I heard an outsourcer manager crowing about getting an entire outsourcing deal signed, with strict requirements for backup and penalties for non-conformance that didn’t once mention the word recovery. It’s your data, it’s your business, you have a right and an obligation to ask a cloud services provider:

Do you understand backup?


Healthy paranoia

 Architecture, Backup theory  Comments Off on Healthy paranoia
Jun 012012

Healthy paranoia

Are your backup administrators people who are naturally paranoid?

What about your Data Protection Advocate?

What about the members of your Information Protection Advisory Council?

There’s healthy paranoia, and then there’s crazy paranoia. (Or as is trendy to say these days, “cray cray”.)

Being a facet of Information Lifecycle Protection, backup is about having healthy paranoia. It’s about behaving both as a cynic and a realist:

  • The realist will understand that IT is not immune to failures, and
  • The cynic will expect that cascading or difficult failures will occur.

Driven from a healthy sense of paranoia, part of the challenge of being involved in backup is an ability to plan for bad situations. If you’re involved in backup, you should be used to asking “But what if…?”

As I say in my book, backup is a game of risk vs cost:

  1. What’s the risk of X happening?
  2. What’s the cost of protecting against it?
  3. What’s the cost of not protecting against it?

Paranoia, in the backup game, is being able to quantify the types of risk and exposure the business has – item 1 in the above list. Ultimately, items 2 and 3 become business decisions, but item 1 is almost entirely the domain of the core backup participants.

As such, those involved in backup – the backup administrators, the DPA, the IPAC, need to be responsible for development and maintenance of a risk register. This should be a compilation of potential data loss (and potentially data availability loss*) situations, along with:

  • Probabilities of the event occurring (potentially just as “High”, “Low”, etc.);
  • Current mitigation techniques;
  • Preferred or optimal mitigation techniques;
  • Whether the risk is a primary risk (i.e., one that can happen in and of itself), or a secondary risk (i.e., can only happen after another failure);
  • RPO and RTO.

This register then gets fed back first to the broader IT department to determine question two in the risk vs cost list (“What’s the cost of protecting against it?”), but following that, it gets fed back to the business as a whole to answer the third question in the risk vs cost list (“What’s the cost of not protecting against it?”).

Finally, it’s important to differentiate between healthy paranoia and paranoia:

  • Healthy paranoia comes from acknowledging risks, prioritising their potential, and coming up with mitigation plans before deciding a response;
  • Paranoia (or unhealthy paranoia) happens when risks are identified, but mitigation is attempted before the risk is formally evaluated.

A backup administrator, given carte blanche over the company budget, could spend all of it for 5 years and still not protect against every potential failure the company could ever conceivably have. That’s unhealthy paranoia. Healthy paranoia is correctly identifying and prioritising risk so as to provide maximum appropriate protection for the business within reasonable budgetary bounds.

* Arguably, data availability loss is a broader topic that should also have significant involvement by other technical teams and business groups.