Mar 302017

World backup day is approaching. (A few years ago now, someone came up with the idea of designating one day of the year to recognise backups.) Funnily enough, I’m not a fan of world backup day, simply because we don’t backup for the sake of backing up, we backup to recover.

Every day should, in fact, be world backup day.

Something that isn’t done enough – isn’t celebrated enough, isn’t tested enough – are recoveries. For many organisations, recovery tests consist of actually doing a recovery when requested, and things like long term retention backups are never tested, and even more rarely recovered from.

bigStock Rescue

So this Friday, March 31, I’d like to suggest you don’t treat as World Backup Day, but World Recovery Test Day. Use the opportunity to run a recovery test within your organisation (following proper processes, of course!) – preferably a recovery that you don’t normally run in terms of day to day operations. People only request file recoveries? Sounds like a good reason to run an Exchange, SQL or Oracle recovery to me. Most recoveries are Exchange mail level recoveries? Excellent, you know they work, let’s run a recovery of a complete filesystem somewhere.

All your recoveries are done within a 30 day period of the backup being taken? That sounds like an excellent idea to do the recovery from an LTR backup written 2+ years ago, too.

Part of running a data protection environment is having routine tests to validate ongoing successful operations, and be able to confidently report back to the business that everything is OK. There’s another, personal and selfish aspect to it, too. It’s one I learnt more than a decade ago when I was still an on-call system administrator: having well-tested recoveries means that you can sleep easily at night, knowing that if the pager or mobile phone does shriek you into blurry-eyed wakefulness at 1am, you can in fact log onto the required server and run the recovery without an issue.

So this World Backup Day, do a recovery test.

The need to have an efficient and effective testing system is something I cover in more detail in Data Protection: Ensuring Data Availability. If you want to know more, feel free to check out the book on Amazon or CRC Press. Remember that it doesn’t matter how good the technology you deploy is if you don’t have the processes and training to use it.

Jan 112017

There are currently a significant number of vulnerable MongoDB databases which are being attacked by ransomware attackers, and even though the attacks are ongoing, it’s worth taking a moment or two to reflect on some key lessons that can be drawn from it.

If you’ve not heard of it, you may want to check out some of the details linked to above. The short summary though is that MongoDB’s default deployment model has been a rather insecure one, and it’s turned out there’s a lot of unsecured public-facing databases out there. A lot of them have been hit by hackers recently, with the contents of the databases deleted and the owners being told to pay a ransom to get their data back. As to whether that will get them their data back is of course, another issue.

Ransomware Image

The first lesson of course is that data protection is not a single topic. More so than a lot of other data loss situations, the MongoDB scenario points to the simple, root lesson for any IT environment: data protection is also a data security factor:

Data Protection

For the most part, when I talk about Data Protection I’m referring to storage protection – backup and recovery, snapshots, replication, continuous data protection, and so on. That’s the focus of my next book, as you might imagine. But a sister process in data protection has been and will always be data security. So in the first instance in the MongoDB attacks, we’re seeing the incoming threat vector entirely from the simple scenario of unsecured systems. A lackadaisical approach to security is exactly what’s happened – for developers and deployers alike – in the MongoDB space, and the result to date is estimated to be around 93TB of data wiped. That number will only go up.

The next lesson though is that backups are still needed. In The MongoDB attacks: 93 terabytes of data wiped out (linked again from above), Dissent writes that of 1138 victims analysed:

Only 13 report that they had recently backed up the now-wiped database; the rest reported no recent backups

That number is awful. Just over 11% of impacted sites had recent backups. That’s not data protection, that’s data recklessness. (And as the report mentions, 73% of the databases were flagged as being production.) In one instance:

A French healthcare research entity had its database with cancer research wiped out. They reported no recent backup.

That’s another lesson there: data protection isn’t just about bits and bytes, it’s about people’s lives. If we maintain data, we have an ethical obligation to protect it. What if that cancer data above held some clue, some key, to saving someone’s life? Data loss isn’t just data loss: it can lead to loss of money, loss of livelihood, or perhaps even loss of life.

Those details are from a sample of 118 sourced from a broader category of 27,000 hit systems.

So the next lesson is that even now, 2017, we’re still having to talk about backup as if it’s a new thing. During the late 90s I thought there was a light at the end of the tunnel for discussions about “do I need backup?”, and I’ve long since resigned myself to the fact I’ll likely still be having those conversations up until the day I retire, but it’s a chilling reminder of the ease at which systems can now be deployed without adequate protection. One of the common responses you’ll see to “we can’t back this up”, particularly in larger databases, is the time taken to complete a backup. That’s something Dell EMC has been focused on for a while now. There’s storage integrated data protection via ProtectPoint, and more recently, there’s BoostFS for Data Domain, giving systems distributed segment processing directly onto the database server for high speed deduplicated backups. (And yes, MongoDB was one of the systems in mind when BoostFS was developed.) If you’ve not heard of BoostFS yet, it was included in DDOS 6, released last year.

It’s not just backup though – for systems with higher criticality there should be multi-layered protection strategies: backups will give you potentially longer term retention, and off-platform protection, but if you need really fast recovery times with very low RPOs and RTOs, your system will likely need replication and snapshots as well. Data protection isn’t a “one size fits all” scenario that some might try to preach; it’s multi-layered and it can encompass a broad range of technology. (And if the data is super business critical you might even want to go the next level and add IRS protection for it, protecting yourself not only from conventional data loss, but also situations where your business is hacked as well.)

The fallout and the data loss from the MongoDB attacks will undoubtedly continue for some time. If one thing comes out of it, I’m hoping it’ll be a stronger understanding from businesses in 2017 that data protection is still a very real topic.


A speculative lesson: What’s the percentage of these MongoDB deployments that fall under the banner of ‘Shadow IT’? I.e., non-IT deployments of systems. By developers, other business groups, etc., within organisations? Does this also serve as a reminder of the risks that can be introduced when non-IT groups deploy IT systems without appropriate processes and rigour? We may never know the percentage breakdown between IT-led deployments and Shadow IT led deployments, but it’s certainly food for thought.

Aug 092016

I’ve recently been doing some testing around Block Based Backups, and specifically recoveries from them. This has acted as an excellent reminder of two things for me:

  • Microsoft killing Technet is a real PITA.
  • You backup to recover, not backup to backup.

The first is just a simple gripe: running up an eval Windows server every time I want to run a simple test is a real crimp in my style, but $1,000+ licenses for a home lab just can’t be justified. (A “hey this is for testing only and I’ll never run a production workload on it” license would be really sweet, Microsoft.)

The second is the real point of the article: you don’t backup for fun. (Unless you’re me.)

iStock Racing

You ultimately backup to be able to get your data back, and that means deciding your backup profile based on your RTOs (recovery time objectives), RPOs (recovery time objectives) and compliance requirements. As a general rule of thumb, this means you should design your backup strategy to meet at least 90% of your recovery requirements as efficiently as possible.

For many organisations this means backup requirements can come down to something like the following: “All daily/weekly backups are retained for 5 weeks, and are accessible from online protection storage”. That’s why a lot of smaller businesses in particular get Data Domains sized for say, 5-6 weeks of daily/weekly backups and 2-3 monthly backups before moving data off to colder storage.

But while online is online is online, we have to think of local requirements, SLAs and flow-on changes for LTR/Compliance retention when we design backups.

This is something we can consider with things even as basic as the humble filesystem backup. These days there’s all sorts of things that can be done to improve the performance of dense filesystem (and dense-like) filesystem backups – by dense I’m referring to very large numbers of files in relatively small storage spaces. That’s regardless of whether it’s in local knots on the filesystem (e.g., a few directories that are massively oversubscribed in terms of file counts), or whether it’s just a big, big filesystem in terms of file count.

We usually think of dense filesystems in terms of the impact on backups – and this is not a NetWorker problem; this is an architectural problem that operating system vendors have not solved. Filesystems struggle to scale their operational performance for sequential walking of directory structures when the number of files starts exponentially increasing. (Case in point: Cloud storage is efficiently accessed at scale when it’s accessed via object storage, not file storage.)

So there’s a number of techniques that can be used to speed up filesystem backups. Let’s consider the three most readily available ones now (in terms of being built into NetWorker):

  • PSS (Parallel Save Streams) – Dynamically builds multiple concurrent sub-savestreams for individual savesets, speeding up the backup process by having multiple walking/transfer processes.
  • BBB (Block Based Backup) – Bypasses the filesystem entirely, performing a backup at the block level of a volume.
  • Image Based Backup – For virtual machines, a VBA based image level backup reads the entire virtual machine at the ESX/storage layer, bypassing the filesystem and the actual OS itself.

So which one do you use? The answer is a simple one: it depends.

It depends on how you need to recover, how frequently you might need to recover, what your recovery requirements are from longer term retention, and so on.

For virtual machines, VBA is usually the method of choice as it’s the most efficient backup method you can get, with very little impact on the ESX environment. It can recover a sufficient number of files in a single session for most use requirements – particularly if file services have been pushed (where they should be) into dedicated systems like NAS appliances. You can do all sorts of useful things with VBA backups – image level recovery, changed block tracking recovery (very high speed in-place image level recovery), instant access (when using a Data Domain), and of course file level recovery. But if your intent is to recover tens of thousands of files in a single go, VBA is not really what you want to use.

It’s the recovery that matters.

For compatible operating systems and volume management systems, Block Based Backups work regardless of whether you’re in a virtual machine or whether you’re on a physical machine. If you’re needing to backup a dense filesystem running on Windows or Linux that’s less than ~63TB, BBB could be a good, high speed method of achieving that backup. Equally, BBB can be used to recover large numbers of files in a single go, since you just mount the image and copy the data back. (I recently did a test where I dropped ~222,000 x 511 byte text files into a single directory on Windows 2008 R2 and copied them back from BBB without skipping a beat.)

BBB backups aren’t readily searchable though – there’s no client file index constructed. They work well for systems where content is of a relatively known quantity and users aren’t going to be asking for those “hey I lost this file somewhere in the last 3 weeks and I don’t know where I saved it” recoveries. It’s great for filesystems where it’s OK to mount and browse the backup, or where there’s known storage patterns for data.

It’s the recovery that matters.

PSS is fast, but in any smack-down test BBB and VBA backups will beat it for backup speed. So why would you use them? For a start, they’re available on a wider range of platforms – VBA requires ESX virtualised backups, BBB requires Windows or Linux and ~63TB or smaller filesystems, PSS will pretty much work on everything other than OpenVMS – and its recovery options work with any protection storage as well. Both BBB and VBA are optimised for online protection storage and being able to mount the backup. PSS is an extension of the classic filesystem agent and is less specific.

It’s the recovery that matters.

So let’s revisit that earlier question: which one do you use? The answer remains: it depends. You pick your backup model not on the basis of “one size fits all” (a flawed approach always in data protection), but your requirements around questions like:

  • How long will the backups be kept online for?
  • Where are you storing longer term backups? Online, offline, nearline or via cloud bursting?
  • Do you have more flexible SLAs for recovery from Compliance/LTR backups vs Operational/BAU backups? (Usually the answer will be yes, of course.)
  • What’s the required recovery model for the system you’re protecting? (You should be able to form broad groupings here based on system type/function.)
  • Do you have any externally imposed requirements (security, contractual, etc.) that may impact your recovery requirements?

Remember there may be multiple answers. Image level backups like BBB and VBA may be highly appropriate for operational recoveries, but for long term compliance your business may have needs that trigger filesystem/PSS backups for those monthlies and yearlies. (Effectively that comes down to making the LTR backups as robust in terms of future infrastructure changes as possible.) That sort of flexibility of choice is vital for enterprise data protection.

One final note: the choices, once made, shouldn’t stay rigidly inflexible. As a backup administrator or data protection architect, your role is to constantly re-evaluate changes in the technology you’re using to see how and where they might offer improvements to existing processes. (When it comes to release notes: constant vigilance!)

Recovery survey

 NetWorker, Recovery  Comments Off on Recovery survey
Jul 132015

Back in 2012, I ran a survey to gauge some basic details about recovery practices within organisations. (The report from that survey can be downloaded here.)

Recovery survey

It’s been a few years and it seems worthwhile coming back to that topic and seeing how things have changed within NetWorker environments. I’ve asked mostly the same questions as before, but this time I’ve expanded the survey to ask a few extra questions about what you’re recovering as well.

I’d really appreciate if you can take a few minutes to complete the survey using your best estimates. I’ll be running this survey until 31 August and will publish the results by mid-September.

The survey has now closed. Thanks to everyone who participated. Results coming in September.

Testing (and debugging) an emergency restore

 Data Domain, NetWorker, Recovery, VBA  Comments Off on Testing (and debugging) an emergency restore
Feb 252015

A few days ago I had some spare time up my sleeve, and I decided to test out the Emergency Restore function in NetWorker VBA/EBR. After all, you never want to test out emergency recovery procedures for the first time in an emergency, so I wanted to be prepared.

If you’ve not seen it, the Emergency Restore panel is accessed from your EBR appliance (https://applianceName:8580/ebr-configure) and looks like the following:

EBR Emergency Restore Panel

The goal of the Emergency Restore function is simple: you have a virtual machine you urgently need to restore, but the vCenter server is also down. Of course, in an ideal scenario, you should never need to use the Emergency Restore function, but ideal and reality don’t always converge with 100% overlap.

In this scenario, to simulate my vCenter server being down, I went into vCenter, selected the ESX server I wanted to recover a virtual machine for (c64), and disconnected from it. To all intents and purposes to the ESX server, vCenter was down – at least, enough to satisfy VBA that I really needed to use the Emergency Restore function.

Once you’ve selected the VM, and the backup of the VM you want to restore, you click the Restore button to get things underway. The first prompt looks like the following:

EBR ESX Connection Prompt(Yes, my ESX server is named after the Commodore 64. For what it’s worth, my vCenter server is c128 and a smaller ESX server I’ve got configured is plus4.)

Entering the ESX server details and login credentials, you click OK to jump through to the recovery options (including the name of the new virtual machine):

EBR - Recovery OptionsAfter you fill in the new virtual machine name and choose the datastore you want to recover from, it’s as simple as clicking Restore and the ball is rolling. Except…

EBR Emergency Restore Error

After about 5 minutes, it failed, and the error I got was:

Restore failed.

Server could not create a restore task at this time. Please ensure your ESX host is resolvable by your DNS server. In addition, as configuration changes may take a few minutes to become effective, please try again at a later time.

From a cursory inspection, I couldn’t find any reference to the error on the support website, so I initially thought I must have done something wrong. Having re-read the Emergency Restore section of the VMware Integration Guide a few times, I was confident I hadn’t missed anything, so I figured the ESX server might have been taking a few minutes to be sufficiently standalone after the disconnection, and gave it a good ten or fifteen minutes before reattempting, but got the same error.

So I went through and did a bit of digging on the actual EBR server itself, diving into the logs there. I eventually re-ran the recovery while tailing the EBR logs, and noticed it attempting to connect to a Data Domain system I knew was down at the time … and had my ahah! moment.

You see I’d previously backed up the virtual machine to one Data Domain, but when I needed to run some other tests, changed my configuration and started backing up the virtual infrastructure to another Data Domain. EBR needed both online to complete the recovery, of course!

Once I had the original Data Domain powered up and running, the Emergency Restore went without a single hitch, and I was pleased to see this little message:

Successful submission of restore job

Before too long I was seeing good progress on the restore:

Emergency Restore Progress

And not long after that, I saw the sort of message you always want to see in an emergency recovery:

EBR Emergency Recovery Complete

There you have it – the Emergency Restore function tested well away from any emergency situation, and a bit of debugging while I was at it.

I’m sure you’ll hope you never need to use the Emergency Restore feature within your virtual environment, but knowing it’s there – and knowing how simple the process is – might help you avoid serious problems in an emergency.



A locale problem

 Recovery  Comments Off on A locale problem
Dec 012014

I had a doozy of a problem a short while ago – NetWorker 8.2 in a big environment, and every now and then the NMC Recovery interface would behave oddly. By oddly:

  • Forward/Back buttons might stop working when choosing between specific backups in the file browser
  • Manually entering a date/time might jump you to a different date/time
  • Backups that were executed extremely closely to each other (e.g., <15 minutes apart) might take a while to show up in NMC

Oddly enough, it actually looked like a DNS issue in the environment. Windows nslookups could often timeout for 2 x 2 seconds before returning successfully, and just occasionally the gstd.raw log file on the NMC server would report name resolution oddities. This seemed borne out by the fact that recoveries executed directly from clients using the old winworkr interface or the CLI would work – with a separate NMC and NetWorker server, the name resolution path between the types of recoveries were guaranteed to be different.

(Just a quick interrupt. The NetWorker Usage Survey is happening again. Every year I ask readers to participate and tell me a bit about their environment. It’s short – I promise! – you only need around 5 minutes to answer the questions. When you’re finished reading this article, I’d really appreciate if you could jump over and do the survey.) 

But it was an interesting one. Over the years I’ve seen a few oddities in the way NMC behaves, and I wasn’t inclined to completely let NMC off the hook. So while we were digging down on the DNS scenarios, I was also talking to the support and eventually engineering teams about it from an NMC perspective.

It turned out to be a locale problem. A very locale problem. It also eventually made sense why I couldn’t reproduce it in a lab. You see, I’m a bit of a lazy Windows system builder – I do the install, patch it and then get down to work. I certainly don’t do customisation of the languages on the systems or anything like that.

But the friendly engineer assigned to the case did do just that, and it became obvious that the problems were only reproducible when the the regional display formats on a Windows host were set to either “English (Australian)” or “English (New Zeaaland)”.

By Windows host, I mean the machine that the NMC Java application was being run on – not the NMC server, not the NetWorker server, but the NMC client.

So, the following would allow NMC to behave oddly:


But, with the following setting, the NMC recovery interface would purr like a kitten:


It’s certainly something worth keeping in mind if you’re using the recovery interface in NMC a lot – if something looks like it’s not quite right, flick your regional formats setting across to “English (United States)” and see whether that makes a big difference.

(Hey, now you’ve finished reading this article, just a friendly reminder: The NetWorker Usage Survey is happening again. Every year I ask readers to participate and tell me a bit about their environment. It’s short – I promise! – you only need around 5 minutes to answer the questions. When you’re finished reading this article, I’d really appreciate if you could jump over and do the survey.)


Oct 072012

Thanks to everyone who provided responses into the recovery survey. It proved to be a very interesting insight into some of the recovery profiles businesses experience. It confirmed some generally held views about recovery, but it also highlighted some differences. For example, consider the results for recovery frequencies:

Recovery frequencies

The full report, in PDF format, is available from the reports section of the NetWorker Hub.

Sep 162012

We talk a lot about backups, but we all know that backups are done for one reason – to recover when necessary. To that end, I’d like to get an understanding of how recoveries work in the broader backup community. I’ve already got a good amount of exposure to how my customers tend to run recoveries within their environments, but I’d like to collate, then publish the data as a broader reference point.

To that end, I’d be most grateful if you could complete this short survey.

I’ll keep the survey up and active until 30 September, and publish the results during October.

NOTE: The survey form I use may ask for an email address. All responses are treated as anonymous, and your email address will not be used for any purpose if provided, but if you’d prefer to not supply an email address, feel free to do so.


The survey has closed. Results will be posted soon. Thanks to everyone for their participation.

Jul 122012

As mentioned in my main NetWorker 8 introduction article, one of the biggest architectural enhancements in NetWorker v8 is a complete overhaul of the backup to disk architecture. This isn’t an update to AFTDs, it’s a complete restart – the architecture that was is no longer, and it’s been replaced by something newer, fresher, and more scalable.

In order to understand what’s involved in the changes, we first need to step back and consider how the AFTD architecture works in NetWorker v7.x. For the purposes of my examples, I’m going to consider a theoretical 10TB of disk backup capacity available to a NetWorker server. Now, under v7.x, you’d typically end up with filesystems and AFTDs that look like the following:

AFTD changes - v7.x AFTD

(In the above diagram, and those to follow, a red line/arrow indicates a write session coming into the AFTD, and a green line/arrow indicates a read session coming out of the AFTD.)

That is, you’d slice and dice that theoretical 10TB of disk capacity into a bunch of smaller sized filesystems, with typically one AFTD per filesystem. In 7.x, there are two nsrmmd processes per AFTD – one to handle write operations (the main/real path to the AFTD), and one to handle read operations from the AFTD device – the shadow or _AF_readonly path on the volume.

So AFTDs under this scenario delivered simultaneous backup and recovery operations by a bit of sleight-of-hand; in some ways, NetWorker was tricked into thinking it was dealing with two different volumes. In fact, that’s what lead to there being two instances of a saveset in the media database for any saveset created on AFTD – one for the read/write volume, and one of the read-only volume, with a clone ID of 1 less than the instance on the read/write volume. This didn’t double the storage; the “trick” was largely maintained in the media database, with just a few meta files maintained in the _AF_readonly path of the device.

Despite the improved backup options offered by v7.x AFTDs, there were several challenges introduced that somewhat limited the applicability of AFTDs in larger backup scenarios. Much as I’ve never particularly liked virtual tape libraries (seeing them as a solution to a problem that shouldn’t exist), I found myself typically recommending a VTL for disk backup in NetWorker ahead of AFTDs. The challenges in particular, as I saw them, were:

  • Limits on concurrency between staging, cloning and recovery from AFTDs meant that businesses often struggled to clone and reclaim space non-disruptively. Despite the inherent risks, this lead to many decisions not to clone data first, meaning only one copy was ever kept;
  • Because of those limits, disk backup units would need to be sliced into smaller allotments – such as the 5 x 2TB devices cited in the above diagram, so that space reclamation would be for smaller, more discrete chunks of data, but spread across more devices simultaneously;
  • A saveset could never exceed the amount of free space on an AFTD – NetWorker doesn’t support continuing a saveset from one full AFTD to another;
  • Larger savesets would be manually sliced up by the administrator to fit on an AFTD, introducing human error, or would be sent direct to tape, potentially introducing shoe-shining back into the backup configuration.

As a result of this, a lot of AFTD layouts saw them more being used as glorified staging units, rather than providing a significant amount of near-line recoverability options.

Another, more subtle problem from this architecture was that nsrmmd itself is not a process geared towards a high amount of concurrency; while based on its name (media multiplexor daemon) we know that it’s always been designed to deal with a certain number of concurrent streams for tape based multiplexing, there are limits to how many savesets an nsrmmd process can handle simultaneously before it starts to dip in efficiency. This was never so much an issue with physical tape – as most administrators would agree, using multiplexing above 4 for any physical tape will continue to work, but may result in backups which are much slower to recover from if a full filesystem/saveset recovery is required, rather than a small random chunk of data.

NetWorker administrators who have been using AFTD for a while will equally agree that pointing a large number of savesets at an AFTD doesn’t guarantee high performance – while disk doesn’t shoe-shine, the drop-off in nsrmmd efficiency per saveset would start to be quite noticeable at a saveset concurrency of around 8 if client and network performance were not the bottleneck in the environment.

This further encouraged slicing and dicing disk backup capacity – why allocate 10TB of capacity to a single AFTD if you’d get better concurrency out of 5 x 2TB AFTDs? To minimise the risk of any individual AFTD filling while others still had capacity, you’d configure the AFTDs to each have target sessions of 1 – effectively round-robbining the  starting of savesets across all the units.

I think that pretty much provides a good enough overview about AFTDs under v7.x that can talk about AFTDs in NetWorker v8.

Keeping that 10TB of disk backup capacity in play, under NetWorker v8, you’d optimally end up with the following configuration:

AFTD changes - v8.x AFTD

You’re seeing that right – multiple nsrmmd processes for a single AFTD – and I don’t mean a read/write nsrmmd and a shadow-volume read-only nsrmmd as per v7.x. In fact, the entire concept and implementation of the shadow volume goes away under NetWorker v8. It’s not needed any longer. Huzzah!

Assuming dynamic nsrmmd spawning (yes), and up to certain limits, NetWorker will now spawn one nsrmmd process for an AFTD each time it hits the target sessions setting for the AFTD volume. That raises one immediate change – for a v8.x AFTD configuration, bump up the target sessions for the devices. Otherwise you’ll end up spawning a lot more nsrmmd processes than are appropriate. Based on feedback from EMC, it would seem that the optimum target setting for a consolidate AFTD is 4. Assuming linear growth, this would mean that you’d have the following spawning rate:

  • 1 saveset, 1 x nsrmmd
  • 2 savesets, 1 x nsrmmd
  • 3 savesets, 1 x nsrmmd
  • 4 savesets, 1 x nsrmmd
  • 5 savesets, 2 x nsrmmd
  • 6 savesets, 2 x nsrmmd
  • 7 savesets, 2 x nsrmmd
  • 8 savesets, 2 x nsrmmd
  • 9 savesets, 3 x nsrmmd

Your actual rate may vary of course, depending on cloning, staging and recovery operations also being performed at the same time. Indeed, for the time being at least, NetWorker dedicates a single nsrmmd process to any nsrclone or nsrstage operation that is run (either manually or as a scheduled task), and yes, you can actually simultaneously recover, stage and clone all at the same time. NetWorker handles staging in that equation by blocking capacity reclamation while a process is reading from the AFTD – this prevents a staging operation removing a saveset that is needed for recovery or cloning. In this situation, a staging operation will report as per:

[root@tara usr]# nsrstage -b Default -v -m -S 4244365627
Obtaining media database information on server tara.pmdg.lab
80470:nsrstage: Following volumes are needed for cloning
80471:nsrstage:         AFTD-B-1 (Regular)
5874:nsrstage: Automatically copying save sets(s) to other volume(s)
Starting migration operation for Regular save sets...
6217:nsrstage: ...from storage node: tara.pmdg.lab
81542:nsrstage: Successfully cloned all requested Regular save sets (with new cloneid)
79629:nsrstage: Clones were written to the following volume(s) for Regular save sets:
6359:nsrstage: Deleting the successfully cloned save set 4244365627
Recovering space from volume 15867081 failed with the error 'volume 
mounted on ADV_FILE disk AFTD1 is reading'.
Refer to the NetWorker log for details.
89216:nsrd: volume mounted on ADV_FILE disk AFTD1 is reading

When a space reclamation is subsequently run (either as part of overnight reclamation, or an explicit execution of nsrim -X), the something along the following lines will get logged:

nsrd NSR info Index Notice: nsrim has finished crosschecking the media db 
nsrd NSR info Media Info: No space was recovered from device AFTD2 since there was no saveset eligible for deletion. 
nsrd NSR info Media Info: No space was recovered from device AFTD4 since there was no saveset eligible for deletion. 
nsrd NSR info Media Info: Deleted 87 MB from save set 4244365627 on volume AFTD-B-1 
nsrd NSR info Media Info: Recovered 87 MB by deleting 1 savesets from device AFTD1. 
nsrsnmd NSR warning volume (AFTD-B-1) size is now set to 2868 MB

Thus, you shouldn’t need to worry about concurrency any longer.

Another change introduced to the AFTD architecture is the device name is now divorced from the path. This is stored in a different field. For example:

AFTD path/device name separation

In the above example, the AFTD device has a device name of AFTD1, and the path to it is /d/backup1 – or to be more accurate, the first path to it is /d/backup1. I’ll get to the real import of that in a moment.

There is a simpler management benefit: previously if you setup an AFTD and needed to later change the path that it was mounted from, you had to do the following:

  1. Unmount the disk backup unit within NetWorker
  2. Delete the AFTD definition within NetWorker
  3. Create a new AFTD definition pointing to the new location
  4. At the OS, remount the AFTD filesystem at the new location
  5. Mount the AFTD volume from its new location in NetWorker

This was a tedious process, and the notion of “deleting” an AFTD, even though it had no bearing on the actual data stored on it, did not appeal to a lot of administrators.

However, the first path specified to an AFTD in the “Device Access Information” field refers to the mount point on the owner storage node, so under v8, all you’ve got to do in order to relocate the AFTD is:

  1. Unmount the disk backup unit within NetWorker.
  2. Remount the AFTD filesystem in its new location.
  3. Adjust the first entry in the “Device Access Information” field.
  4. Remount the AFTD volume in its new location.

This may not seem like a big change, but it’s both useful and more logical. Obviously another benefit of this is that you no longer have to remember device paths when performing manual nsrmm operations against AFTDs – you just specify the volume name. So your nsrmm commands would go from:

# nsrmm -u -f /d/backup1

to, in the above example:

# nsrmm -u -f AFTD1

The real benefit of this though is what I’ve been alluding to by talking about the first mount point specified. You can specify alternate mount points for the device. However, these aren’t additional volumes on the server – they’re mount points as seen by clients. This allows a bypass of using nsrmmd to perform a write to a disk backup unit from the client, and instead sees the client write via whatever operating system mount mechanism it’s used (CIFS or NFS).

In this configuration, your disk backup environment can start to look like the following:

AFTD changes - v8.x AFTD with NFS

You may be wondering what advantage this offers, given the backup is still written across the network – well, past an initial negotiation with nsrmmd on the storage node/server for the file to write to, the client then handles the rest of the write itself, not bothering nsrmmd again. In other words – so long as the performance of the underlying filesystem can scale, the number of streams you can write to an AFTD can expand beyond the maximum number of nsrmmd processes an AFTD or storage node can handle.

At this point, the “Device Access Information” is used as a multi-line field, with each subsequent line representing an alternate path the AFTD is visible at:

AFTD multimount

So, the backup process will work such that if the pool allows the specified AFTD to be used for backup, and the AFTD volume is visible to a client with “Client direct” setting enabled (a new option in the Client resource), on one of the paths specified, then the client will negotiate access to the device through a device-owner nsrmmd process, and go on to write the backup itself.

Note that this isn’t designed for sharing AFTDs between storage nodes – just between a server/storage node and clients.

Also, in case you’re skeptical, if your OS supports gathering statistics from the network-mount mechanism in use, you can fairly readily see whether NetWorker is honouring the direct access option. For instance, in the above configuration, I had a client called ‘nimrod’ mounting the disk backup unit via NFS from the backup server; before the backup started on a new, fresh mount, nfsstat showed no activity. After the backup, nfsstat yielded the following:

[root@nimrod ~]# nfsstat
Client rpc stats:
calls      retrans    authrefrsh
834        0          0
Client nfs v3:
null         getattr       setattr      lookup       access       readlink 
0         0% 7          0% 3         0% 4         0% 9         1% 0         0% 
read         write         create       mkdir        symlink      mknod 
0         0% 790       94% 1         0% 0         0% 0         0% 0         0% 
remove       rmdir         rename       link         readdir      readdirplus 
1         0% 0          0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo        pathconf     commit 
10        1% 2          0% 0         0% 6         0%

NetWorker also reports in the savegroup completion report that a client successfully negotiated direct file access (DFA) for a backup, too:

* nimrod:All savefs nimrod: succeeded.
* nimrod:/boot 86705:save: Successfully established DFA session with adv_file device for save-set ID '1694225141' (nimrod:/boot).
V nimrod: /boot                    level=5,          0 KB 00:00:01      0 files
* nimrod:/boot completed savetime=1341899894
* nimrod:/ 86705:save: Successfully established DFA session with adv_file device for save-set ID '1677447928' (nimrod:/).
V nimrod: /                        level=5,         25 MB 00:01:16    327 files
* nimrod:/ completed savetime=1341899897
* nimrod:index 86705:save: Successfully established DFA session with adv_file device for save-set ID '1660670793' (nox:index:65768fd6-00000004-4fdfcfef-4fdfd013-00041a00-3d2a4f4b).
 nox: index:nimrod                 level=5,        397 KB 00:00:00     25 files
* nimrod:index completed savetime=1341903689

Finally, logging is done in the server daemon.raw to also indicate that a DFA save has been negotiated:

91787 07/10/2012 05:00:05 PM  1 5 0 1626601168 4832 0 nox nsrmmd NSR notice Save-set ID '1694225141' 
(nimrod:/boot) is using direct file save with adv_file device 'AFTD1'.

The net result of all these architectural changes to AFTDs is a significant improvement over v7.x AFTD handling, performance and efficiency.

(As noted previously though, this doesn’t apply just to AFTDs, for what it’s worth – that client direct functionality also applies to Data Domain Boost devices, which allows a Data Domain system to integrate even more closely into a NetWorker environment. Scaling, scaling, scaling: it’s all about scaling.)

In order to get the full benefit, I believe sites currently using AFTDs will probably go through the most pain; those who have been using VTLs may have a cost involved in the transition, but they’ll be able to transition to an optimal architecture almost immediately. However, sites with multiple smaller AFTDs won’t see the full benefits of the new architecture until they redesign their backup to disk environment, increasing the size of backup volumes. That being said, the change pain will be worth it for the enhancements received.

Jun 272012

[Edit- 2015: I must get around to writing a refutation to the article below. Keeping it for historical purposes, but I’d now argue I was approaching the problem from a reasonably flawed perspective.]

Don’t get me wrong – I’m quite the fan of deduplication, and not just because it’s really interesting technology. It has potential to allow a lot more backup data to be kept online for much longer periods of time.

Having more backups immediately available for recovery is undoubtedly great.

I wrote previously about 7 problems with deduplication, but they’re just management problems, not functional problems. Yet, there’s one, core problem with deduplication: it’s a backup solution.

Deduplication is about backup.

It’s not about recovery.

Target deduplication? If it’s inline, like with Data Domain products, it’s stellar. Source deduplication? It massively reduces the amount of data you have to stream across your network.

When it comes to recovery though, deduplication isn’t a shining knight. That data has to be rehydrated, and unless you’re doing something really intelligent in terms of matching non-corrupt blocks, or maintaining massive deduplication caches on a client, you’re going to be rehydrating at the target rest point and streaming the full data back across the network.

That 1TB database at a remote site you’ve been backing up over a ADSL link after initial seeding, thanks to source based deduplication? How long can you afford to have the recovery take if it’s got to stream back across that ADSL link?

I’m not saying to avoid using deduplication. I think it’s likely to become a standard feature of backup solutions within 5 years. By itself though, it’s unlikely to speed up your recoveries. In short: if you’re deploying a data deduplication solution, after you’ve done all your sizing tests, sit down and map out what systems may present challenges during recovery from deduplicated systems (hint: it’s almost always going to be the remote ones), and make sure you have a strategy for them. Always have a strategy.

Always have a recovery strategy. After all, if you don’t, you don’t have a backup system. You’ve just got a bunch of backups.

[Edit- 2015: I must get around to writing a refutation to the article above. Keeping it for historical purposes, but I’d now argue I was approaching the problem from a reasonably flawed perspective.]


PS: Thanks to Siobhán for prodding me on this topic.