Sep 252010
 

Introduction

I was lucky enough to get to work on the beta testing programme for NetWorker 7.6 SP1. While there are a bunch of new features in NW 7.6 SP1 (with the one most discussed by EMC being the Data Domain integration), I want to talk about three new features that I think are quite important, long term, in core functionality within the product for the average Joe.

These are:

  • Scheduled Cloning
  • AFTD enhancements
  • Checkpoint restarts

Each of these on their own represents significant benefit to the average NetWorker site, and I’d like to spend some time discussing the functionality they bring.

[Edit/Aside: You can find the documentation for NetWorker 7.6 SP1 available in the updated Documentation area on nsrd.info]

Scheduled Cloning

In some senses, cloning has been the bane of the NetWorker administrator’s life. Up until NW 7.6 SP1, NetWorker has had two forms of cloning:

  • Per-group cloning, immediately following completion of backups;
  • Scripted cloning.

A lot of sites use scripted cloning, simply due to the device/media contention frequently caused in per-group cloning. I know this well; since starting working with NetWorker in 1996, I’ve written countless numbers of NetWorker cloning scripts, and currently am the primary programmer for IDATA Tools, which includes what I can only describe as a cloning utility on steroids (‘sslocate’).

Will those scripts and utilities go away with scheduled cloning? Well, I don’t think they’re always going to go away – but I do think that they’ll be treated more as utilities rather than core code for the average site, since scheduled cloning will be able to achieve much of the cloning requirements for companies.

I had heard that scheduled cloning was on the way long before the 7.6 SP1 beta, thanks mainly to one day getting a cryptic email along the lines of “if we were to do scheduled cloning, what would you like to see in it…” – so it was pleasing, when it arrived, to see that much of my wish list had made it in there. As a first-round implementation of the process, it’s fabulous.

So, let’s look at how we configure scheduled clones. First off, in NMC, you’ll notice a new menu item in the configuration section:

Scheduled Cloning Resource, in menu

This will undoubtedly bring joy to the heart of many a NetWorker administrator. If we then choose to create a new scheduled clone resource, we can create a highly refined schedule:

Scheduled Clone Resource, 1 of 2

Let’s look at those options first before moving onto the second tab:

  • Name and comment is pretty self explanatory – nothing to see there.
  • Browse and retention – define, for the clone schedule, both the browse and retention time of the savesets that will be cloned.
  • Start time – Specify exactly what time the cloning is to start.
  • Schedule period – Weekly allows you to specify which days of the week the cloning is to run. Monthly allows you to specify which dates of the month the cloning will run.
  • Storage node – Allows you to specify to which storage node the clone will write to. Great for situations where you have say, an off-site storage node and you want the data streamed directly across to it.
  • Clone pool – Which pool you want to write the clones to – fairly expected.
  • Continue on save set error – This is a big help. Standard scripting of cloning will fail if one of the savesets selected to clone has an error (regardless of whether that’s a read error, or it disappears (e.g., is staged off) before it gets cloned, etc.) and you haven’t used the ‘-F’ option. Click this check box and the cloning will at least continue and finish all savesets it can hit in one session.
  • Limit number of save set clones – By default this is 1, meaning NetWorker won’t create more than one copy of the saveset in the target pool. This can be increased to a higher number if you want multiple clones, or it can be set to zero (for unlimited), which I wouldn’t see many sites having a need for.

Once you’ve got the basics of how often and when the scheduled clone runs, etc., you can move on to selecting what you want cloned:

Scheduled Clone Resource, 2 of 2

I’ve only just refreshed my lab server, so you can see that a bit of imagination is required in the above screen shot to flesh out what this may look in a normal site. But, you can choose savesets to clone based on:

  • Group
  • Client
  • Source Pool
  • Level
  • Name

or

  • Specific saveset ID/clone IDs

When specifying savesets based on group/client/level/etc., you can also specify how far back NetWorker is to look for savesets to clone. This avoids a situation whereby you might say, enable scheduled cloning and suddenly have media from 3 years ago requested.

You might wonder about the practicality of being able to schedule a clone for specific SSID/CloneID combos. I can imagine this would be particularly useful if you need to do ad-hoc cloning of a particularly large saveset. E.g., if you’ve got a saveset that’s say, 10TB, you might want to configure a schedule that would start specifically cloning this at 1am Saturday morning, with your intent being to then delete the scheduled clone after it’s finished. In other words, it’s to replace having to do a scheduled cron or at job just for a single clone.

Once configured, and enabled, scheduled cloning runs like a dream. In fact, it was one of the first things I tackled while doing beta testing, and almost every subsequent day found myself thinking at 3pm “why is my lab server suddenly cloning? – ah yes, that’s why…”

AFTD Enhancements

There’s not a huge amount to cover in terms of AFTD enhancements – they’re effectively exactly the same enhancements that have been run into NetWorker 7.5 SP3, which I’ve previously covered here. So, that means there’s a better volume selection criteria for AFTD backups, but we don’t yet have backups being able to continue from one AFTD device to another. (That’s still in the pipeline and being actively worked on, so it will come.)

Even this one key change – the way in which volumes are picked in AFTDs for new backups – will be a big boon for a lot of NetWorker sites. It will allow administrators to not focus so much on the “AFTD data shuffle”, as I like to consider it, and instead focus on higher level administration of the backup environment.

(These changes are effectively “under the hood”, so there’s not much I can show in the way of screen-shots.)

Checkpoint Restarting

When I first learned NetBackup, I immediately saw the usefulness of checkpoint restarting, and have been eager to see it appear in NetWorker since that point. I’m glad to say it’s appeared in (what I consider to be) a much more usable form. So what is checkpoint restarting? If you’re not familiar with the term, it’s where the backup product has regular points at which it can restart from, rather than having to restart an entire backup. Previously NetWorker has only done this at the saveset level, but that’s not really what the average administrator would think of when ‘checkpoints’ are discussed. NetBackup, last I looked at it, does this at periodic intervals – e.g., every 15 minutes or so.

Like almost everything in NetWorker, we get more than one way to run a checkpoint:

Checkpoint restart options

Within any single client instance you can choose to enable checkpoint restarting, with the restart options being:

  • Directory – If a backup failure occurs, restart from the last directory that NetWorker had started processing.
  • File – If a backup failure occurs, restart from the last file NetWorker had started processing.

Now, the documentation warns that with checkpoint enabled, you’ll get a slight performance hit on the backup process. However, that performance hit is nothing compared to the performance and potentially media hit you’d take if you’re 9.8TB through a 10TB saveset and the client is accidentally rebooted!

Furthermore, in my testing (which admittedly focused on savesets smaller than 10GB), I inevitably found that with either file or directory level checkpointing enabled, the backup actually ran faster than the normal backup. So maybe it’s also based on the hardware you’re running on, or maybe that performance hit doesn’t come in until you’re backing up millions of files, but either way, I’m not prepared to say it’s going to be a huge performance hit for anyone yet.

Note what I said earlier though – this can be configured on a single client instance. This lets you configure checkpoint restarting even on the individual client level to suit the data structure. For example, let’s consider a fileserver that offers both departmental/home directories, and research areas:

  • The departmental/home directories will have thousands and thousands of directories – have a client instance for this area, set for directory level checkpointing.
  • The research area might feature files that are hundreds of gigabytes each – use file level checkpointing here.

When I’d previously done a blog entry wishing for checkpoint restarts (“Sub saveset checkpointing would be good“), I’d envisaged the checkpointing being done via the continuation savesets – e.g., “C:”, “<1>C:”, “<2>C:”, etc. It hasn’t been implemented this way; instead, each time the saveset is restarted, a new saveset is generated of the same level, catering to whatever got backed up during that saveset. On reflection, I’m not the slightest bit hung up over how it’s been implemented, I’m just overjoyed to see that it has been implemented.

Now you’re probably wondering – does the checkpointing work? Does it create any headaches for recoveries? Yes, and no. As in, yes it works, and no, throughout all my testing, I wasn’t able to create any headaches in the recovery process. I would feel very safe about having checkpoint restarts enabled on production filesystems right now.

Bonus: Mac OS X 10.6 (“Snow Leopard”) Support

For some time, NetWorker has had some issues supporting Mac OS X 10.6, and it’s certainly caused some problems for various customers as this operating system continues to get market share in the enterprise. I was particularly pleased during a beta refresh to see an updated Mac OS X client for NetWorker. This works excellently for backup, recovery, installation, uninstallation, etc. I’d suggest on the basis of the testing I did that any site with OS X should immediately upgrade those clients at least to 7.6 SP1.

In Summary

The only major glaring question for me, looking at NetWorker 7.6 SP1 is the obvious: this has so many updates, and so many new features, way more than we’d see in a normal service pack – why the heck isn’t it NetWorker v8?

First thoughts – NetWorker 7.5 SP3 AFTD Enhancements

 NetWorker  Comments Off on First thoughts – NetWorker 7.5 SP3 AFTD Enhancements
Aug 232010
 

On the weekend it came out, being your typical backup geek, I downloaded NetWorker 7.5 SP3, installed a new virtual machine, and started kicking the tyres on the new release. My primary focus was on the AFTD changes. These, as you may recall, compromise some core changes that EMC had previously touted for 7.6 SP1.

I’m quite glad these were pushed down into 7.5 SP3.

The change that was introduced was to give NetWorker a new volume selection algorithm for AFTD devices. Up until 7.5 SP3, NetWorker’s volume selection criteria for AFTD has been:

  • If a new backup starts and a device is currently writing (or “writing, idle”):
    • If that device’s target sessions have not been met, write to that device.
    • If that device’s target sessions has been met, move onto the device that has mounted the least recently labelled volume and start writing there.
  • If a new backup starts and no device is currently writing:
    • Pick the device whose volume was least recently labelled, and start writing to that.

The net result of this of course is what I like to call the AFTD-data-shuffle. Administrators invariably found that certain disk backup devices within their environment would fill sooner than others, and as a result of this, they’d either continually have to stage from those frequently hit AFTDs to other AFTDs, or stage that data out to tape.

Like the changes announced for 7.6 SP1, 7.5 SP3 now applies a more intelligent volume selection criteria, instead picking the volume that has the least data written to it.

This is, of course, a VGT enhancement – or without the TLA, it’s a Very Good Thing.

There was one minor catch, which prevented me from doing this blog piece. As a result of this change, there was an unanticipated reversal in the volume selection criteria for physical tape – i.e., when there’s no tape with data that is still appendable, NetWorker will pick the most recently labelled tape to write to, rather than the least recently labelled tape to write to.

The good news though is that this is fixed in NetWorker 7.5 SP3 cumulative release 1 (7.5.3.1), which was released last week. So, if you’re after an enhanced volume selection algorithm for AFTD that doesn’t impact physical volume selection, you’ll want to check out 7.5.3.1.

May 042010
 

There is a bug with the way NetWorker 7.5.2 handles ADV_FILE devices in relation to disk evacuation. I.e., in a situation where you use NetWorker 7.5.2 to completely stage all savesets from an ADV_FILE device, the subsequent behaviour of NetWorker is contrary to normal operations.

If following the disk evacuation, either the standard overnight volume/saveset recycling checks are done, or an nsrim -X is explicitly called, before any new savesets are written to the ADV_FILE device, NetWorker will flag the depopulated volume as recyclable. The net result of this is that it will not permit new savesets to be written to the volume until such time as it is relabelled, or flagged as not recyclable.

When a colleague asked me to investigate this for a customer, I honestly thought it had to be some mistake, but I ran up the tests and dutifully confirmed that NetWorker under v7.5.2 was indeed doing it. However, it just didn’t seem right in comparison to previous known NetWorker behaviour, so I stepped my lab server back to 7.4.5, and NetWorker didn’t mangle the volume after it was evacuated. I then stepped up to 7.5.1, and again, NetWorker didn’t mangle the volume after it was evacuated.

This led me to review the cumulative patch cluster notes for 7.5.2.1 – while there’s been a more recent version released, I didn’t have it handy at the time. Nothing was mentioned on the notes that seemed to relate to this issue, but since I’d got the test process down to a <15 minute activity, I replaced the default 7.5.2 install with 7.5.2.1, and re-ran the tests.

Under 7.5.2.1, NetWorker behaved exactly as expected; no matter how many times “nsrim -X” was run after evacuating a disk backup unit volume, NetWorker did not mark the volume in question as recyclable.

My only surmise therefore is that one of the actual documented fixes in the 7.5.2.1 cumulative build, while not explicitly referring to the issue at hand, happened to (as a side-effect), resolve the issue.

To cut a long story short though, I would advise that if you’re backing up to ADV_FILE devices using NetWorker 7.5.2 that you strongly consider moving to 7.5.2 cumulative patch cluster 1 – i.e., 7.5.2.1.

Apr 262010
 

As I mentioned in an earlier post, EMC have announced on their community forum that there are some major changes on the way for ADV_FILE devices. In this post, I want to outline in a little more detail why these changes are important.

Volume selection criteria

One of the easiest changes to describe is the new volume selection criteria that will be applied. Currently regardless of whether it is backing up to tape, virtual tape, or ADV_FILE disk devices, NetWorker uses the same volume selection algorithm – whenever there are multiple volumes that could be chosen, it always picks volumes to write to in order of labeled date, from oldest to most recent. For tapes (and even virtual tapes), this selection criteria makes perfect sense. For disk backup units though, it’s seen administrators constantly “fighting” NetWorker to reclaim space from disk backup volumes in that same labeling order.

If we look at say, four disk backup units, with the used capacity shown in red, this means that NetWorker currently writes to volumes in the following order:

Current volume selection criteriaSo it doesn’t matter that the first volume picked also has the highest used capacity – in actual fact, the entire selection criteria is geared around trying to fill volumes in sequence. Again, that works wonderfully for tapes, but it’s terrible when it comes to ADV_FILE devices.

The new selection criteria for ADV_FILE devices, according to EMC, is going to look like the following:

Improved volume selection criteriaSo, recognising that it’s sub-optimal to fill disk backup units, NetWorker will instead write to volumes in order of least used capacity. This change alone will remove a lot of the day to day management headaches of ADV_FILE devices from backup administrators.

Dealing with full volumes

The next major change coming is dealing with full volumes – or alternatively, you may wish to think of it as dealing with savesets whose size exceeds that of the available space on a disk backup unit.

Currently if a disk backup unit fills during the backup process, whatever saveset being written to that unit just stays right there, hung, waiting for NetWorker staging to kick in and free space before it will continue writing. This resembles the following:

Dealing with full volumesAs every NetWorker administrator who has worked with ADV_FILE devices will tell you, the above process is extremely irritating as well as extremely disruptive. Further, this only works in situations where you’re not writing one huge saveset that literally exceeds the entire formatted capacity of your disk backup unit. So in short, if you’ve previously wanted to backup a 6TB saveset, you’ve had to have disk backup units that were more than 6TB in size, even if you would naturally prefer to have a larger number of 2TB disk backup units. (In fact, the general practice has been when backing up to ADV_FILE devices to ensure that every volume can fit at least two of your largest savesets on it, plus another 10%, if you’re using the devices for anything other than just intermediate-staging.)

Thankfully the coming change will see what we’ve been wanting in ADV_FILE devices for a long time – the ability for a saveset to just span from one volume it has filled across to another. This means you’ll get backups like:

Disk backup unit spanningThis will avoid situations where the backup process is effectively halted for the duration of staging operations, and it will allow for disk backup units that are smaller than the size of the largest savesets to be backed up. This in turn will allow backup administrators to very easily schedule in disk defragmentation (or reformatting) operations on those filesystems that suffer performance degradation over time from the mass write/read/delete operations seen by ADV_FILE devices.

Other changes

The other key changes outlined by EMC on the community forum are:

  • Change of target sessions:
    • Disk backup units currently have a default target parallelism of 4, and a maximum target parallelism setting of 512. These will be reduced to 1 and 32 respectively (and of course can be changed by the administrator as required), so as to better enforce round-robining of capacity usage across all disk backup units. This is something most administrators will end up doing by default, but it’s a welcome change for new installs.
  • Full thresholds:
    • The ability to define a %full threshold at which point NetWorker will cease writing to one disk backup unit and start writing to another. Some question whether this is useful, but I can see the edge of a couple of different usage scenarios. First, as a way of allowing different pools to share the same filesystem, making better use of capacity, and secondly, in situations where a disk backup unit can’t be a dedicated filesystem.

When we add all these changes up, ADV_FILE type devices are going to be back in a position where they’ll give VTLs a run for their money on cost vs features. (With the possible exception being the relative ease of device sharing under VTLs compared to the very manual process of SAN/NAS sharing of ADV_FILE devices.)

Apr 202010
 

I had been aware for a while from an NDA conversation that these changes were on the way, but of course have not been able to discuss them.

However, with EMC opening up discussion on the EMC Community Forum – i.e., out in public, I now feel that I can at least discuss how excited I am about the coming ADV_FILE changes.

For some time I’ve railed against architectural failings in ADV_FILE devices, and explained why those failings have led me to advocate the use of VTLs over ADV_FILE devices. As announced on this thread in the forums by Paul Meighan, many of those architectural limitations are soon going to be relegated to the software evolutionary junkpile. In particular, EMC have stated in the forum article that the following changes are on the way:

  1. Volume selection criteria becomes intelligent. NetWorker currently uses the same volume selection criteria for disk backup as it does for tapes. This means that the oldest labelled volume with free space on it always gets picked first, and subsequent volumes get picked following this strategy. This has meant that backup administrators have continually fought a running battle to keep the original disk backup units staged more regularly than others. Instead, NetWorker will now pick ADV_FILE volumes in order of maximum capacity free, which will free a lot of backup administrators from the overall pain of day to day capacity management.
  2. Savesets can span advanced file type devices. Finally, the gloves are off! With the ability to have savesets cease writing to one disk backup unit and move over to another, NetWorker ADV_FILE devices will be able to serve as a scaleable and transparent storage pool, backups will flow from one device to another in exactly the way they always should have.
  3. Session changes. To reflect round-robining best practices, the default target sessions for disk backup units will drop from 4 to 1.

When we add together the first two changes, we get powerful enhancements in NetWorker’s disk backup functionality. Do other products already do this? Yes, I’m not suggesting that NetWorker is the first to this, but it’s fantastic to finally see this functionality coming into play.

Until this point, NetWorker has suffered the continual challenge with disk backup of constant administrative overheads and trying to plan in advance the best possible space allocation technique for disk backup filesystems. Once these changes come into play: no more challenge on either of these fronts.

Folks, this is big. Yes, these changes should have come a long time ago, but I’m not going to let the delay get in the way of being damn grateful that they’re finally coming.

Apr 142010
 

In the previous article, I covered the first five of ten reasons why tape is still important. Now, let’s consider the other five reasons.

6. Tape is greener for storage

Offline storage of tape is cheap, from an environmental perspective. Depending on your locality, you may not even have to keep the storage area air-conditioned.

Disk arrays and replicated backup server clusters don’t really have the notion of offline options. Even if they’re using MAID, the power consumption for the psuedo-offline part of the storage will be higher than that for unpowered, inactive tape.

7. Replicated tape is cheaper than replicated disk

And by “replicated tape” I mean cloning. Having clones of your tapes is a cheaper option than running a system with full replication. Full replication requires similar hardware configurations on both sides of the replica; cloning a tape requires – another tape. That’s a lot cheaper, before you even look at any link costs.

8. Done right, tapes are the best form of thin provisioning you’ll get

Thin provisioning is big, since it’s an inherent part of the “cloud” meme at the moment. Time your purchases correctly and tape will be the best form of thin provisioning within your enterprise environment.

9. Tape is more fault tolerant than an array

Oh, I know you’ve got the chuckles now, and you think I’ve gone nuts. Arrays are highly fault tolerant – looking at RAID alone, if your disk backup environment is a suite of RAID-6 LUNs, then for each LUN you can withstand two disk failures. But let’s look at longer term backups – those files that you’ve backed up multiple times. Some would argue that these shouldn’t be backed up multiple times, but that’s an argument that doesn’t translate well down into the smaller enterprises and corporates. Sure, big and rich companies can afford deduplicated archiving solutions, but smaller companies have to make do with the traditional weekly fulls kept for 5 or 6 weeks, and monthly fulls kept for anywhere between 1 and 10 years will have the luxury of a potentially large number of copies of any individual file. The net result? Perhaps as much as 50% of longer term recoveries will be extremely fault tolerant – if the March tape fails, go back to the February tape, or the January tape, or the December tape, etc. This isn’t something you really want to rely on, but it’s always worth keeping in mind regardless.

10. Tape is ideally suited for lesser RTO/RPOs

Sure if you have RTOs and RPOs that demand near instant recovery with minimum data loss, you’re going to need disk. But when we look at the cheapness of tape, and practically all of the other items we’ve discussed, the cost of deploying a disk backup system to meet non-urgent RPOs and RTOs seems at best a case of severe overkill.

Apr 122010
 

Various companies will spin you their “tape is dead” story, and I’m the first to admit that the use pattern for tapes is evolving, but to anyone who claims that tape has lost its relevance, I’ll argue otherwise.

This is part 1 of a 2 part article, and we’ll cover reasons 1 through 5 here.

1. Tape is cheap

Comparatively tape is still significantly cheaper than disk. In AUD, from end-resellers you can buy individual LTO-4 cartridges (800GB native) for $50. Even at a discount price, in Australia you’ll still pay around $90 to $110 for a 1TB drive (the closest comparison).

2. Tape is offline

If your backup server is using traditional backup to disk and is infected by a destructive virus or trojan, you can lose days, weeks, months or perhaps even years of backups.

No software, no matter how destructive (unless we’re talking Skynet levels of destruction) is going to be able to reach out from your infected computers and destroy media that’s sitting outside of your tape libraries. It’s just not going to happen. There’s a tonne of more likely scenarios that you’d need to worry about first before getting down to this scenario.

3. You can run a tape out of a burning building

Say you’ve bought the “tape is dead” argument, and all your backups are in either on a VTL, a standard array for disk backup, or some multi-cluster centralised storage system (e.g., a RAIN as per Avamar). But you’re a small site comparatively, and so you have to buy the replication system in a future budget.

Then your datacentre catches on fire. Good luck with grabbing your array or cluster of backup servers and running out the building with them. On the other hand, that nearby company that also caught fire but stuck with tape had their administrator snatch last night’s backup tape out of the library and run out of their building.

Sure, the follow up response is that you should have replicated VTLs or replicated arrays or replicated dedupe clusters, etc., but it’s not uncommon to see smaller sites buy into the “tape is dead” solution and not do any replication – planning to get budget for it in say, the second year or deployment, or when that colocation datacentre goes ahead in the “sometime later” timeframe.

4. Tapes have better offline bandwidth

Need to get a considerable amount of data (e.g., hundreds of terabytes, maybe even petabytes) from one location to other? Unless you can stream data across your links at hundreds of megabytes per second (still a while away for any reasonable corporate entity), you’re going to have better luck with sending your data via tapes rather than disks. Lighter and more compact than disks, let alone disk arrays, your capacity per cubic metre is going to be considerably higher with tape than it is with disk.

Think I’m joking? Let’s look at the numbers. Say you’ve got a cubic metre of shipping space available, let’s see which option – tape or disk – gives you the most capacity.

An LTO cartridge is 10.2cm x 10.54cm x 2.15cm. That means in 100cm x 100cm x 100cm, you can fit 9 x 9 x 46 cartridges, which comes to a grand total of 3,726 units of media. Using LTO-5 for our calculations, that’s a native capacity of 5,589 TB per cubic metre. Of course, that’s without squeezing additional media in the remaining space, but I’ll leave that up to people with more math skills than I.

A typical 3.5″ internal form-factor drive (using the 1.5TB Seagate Barracuda drive for comparison) is 10.2cm x 14.7cm x 2.6cm. In a cubic metre, you’ll fit 9 x 6 x 38 disk drives, or 2,052 drives. Using 2TB drives (currently the highest capacity), you’ll get 4,104 TB per cubic metre.

So on the TB per cubic metre front, tape wins by almost 1,500 TB.

Looking at weight – we start to see some big differences here too. The average LTO cartridge (using LTO-4 from IBM as our “average”) is 200 grams. A cubic metre of them will be 745.2 KG. That Seagate Barracuda I quoted before though weighs in at 920 grams – so for a cubic metre of disk drive capacity, you’re looking at 1,887.4 KG. There’s a tonne of difference there!

Tape wins on that sort of high capacity offline data transfer without a doubt.

5. Storage capacity of a tape system is not limited by physical datacentre footprint

If you’ve got a disk array, there’s an absolute limit to how much data you can store in it that (as much as anything) is determined by its physical footprint. If you fill it and need to add more storage, you need to expand its footprint.

Tape? Remove some cartridges, put some more in. Your offline physical footprint will grow of course – but if we’re talking datacentres, we’re talking a real, tangible cost per cubic metre of space. Your tape library of course will occupy a certain amount of space, but its storage capabilities are practically limitless regardless of its size, since all you have to do is pull full media out and replace it with empty media. Offline storage space will usually cost up to an order of magnitude less than datacentre space, so disk arrays just can’t keep up on this front.

Reasons 6 through 10 will be published soon.

Mar 032010
 

While I touched on this in the second blog posting I made (Instantiating Savesets), it’s worthwhile revisiting this topic more directly.

Using ADV_FILE devices can play havoc with conventional tape rotation strategies; if you aren’t aware of these implications, it could cause operational challenges when it comes time to do recovery from tape. Let’s look at the lifecycle of a saveset in a disk backup environment where a conventional setup is used. It typically runs like this:

  1. Backup to disk
  2. Clone to tape
  3. (Later) Stage to tape
  4. (At rest) 2 copies on tape

    Looking at each stage of this, we have:

    Saveset on ADV_FILE deviceThe saveset, once written to an ADV_FILE volume, has two instances. The instance recorded as being on the read-read only part of the volume will have an SSID/CloneID of X/Y. The instance recorded as being on the read-write part of the volume will have an SSID/CloneID of X/Y+1. This higher CloneID is what causes NetWorker, upon a recovery request, to seek the “instance” on the read-only volume. Of course, there’s only one actual instance (hence why I object so strongly to the ‘validcopies’ field introduced in 7.6 reporting 2) – the two instances reported are “smoke and mirrors” to allow simultaneous backup to and recovery from an ADV_FILE volume.

    The next stage sees the saveset cloned:

    ADV_FILE + Tape CloneThis leaves us with 3 ‘instances’ – 2 physical, one virtual. Our SSID/CloneIDs are:

    • ADV_FILE read-only: X/Y
    • ADV_FILE read-write: X/Y+1
    • Tape: X/Y+n, where n > 1.

    At this point, any recovery request will still call for the instance on the read-only part of the ADV_FILE volume, so as to help ensure the fastest recovery initiation.

    At some future point, as disk capacity starts to run out on the ADV_FILE device, the saveset will typically be staged out:

    ADV_FILE staging to tapeAt the conclusion of the staging operation, the physical + virtual instances of the saveset on the ADV_FILE device are removed, leaving us with:

    Savesets on tape only

    So, at this point, we end up with:

    • A saveset instance on a clone volume with SSID/CloneID of: X/Y+n.
    • A saveset instance on (typically) a non-clone volume with SSID/CloneID of: X/Y+n+m, where m > 0.

    So, where does this leave us? (Or if you’re not sure where I’ve been heading yet, you may be wondering what point I’m actually trying to make.)

    Note what I’ve been saying each time – NetWorker, when it needs to read from a saveset for recovery purposes, will want to pick the saveset instance with the lowest CloneID. At the point where we’ve got a clone copy and a staged copy, both on tape, the clone copy will have the lowest CloneID.

    The net result is that NetWorker will, in these circumstances, when both tapes aren’t online, request the clone volume for recovery – even though in an extreme number of cases, this will be the volume that’s offsite.

    For NetWorker versions 7.3.1 and lower, there was only one solution to this – you had to hunt down the actual clone saveset instances NetWorker was asking for, mark them as suspect, and reattempt the recovery. If you managed to mark them all as suspect, then you’d be able to ‘force’ NetWorker into facilitating the recovery from the volume(s) that had been staged to. However, after the recovery you had to make sure you backed out of those changes, so that both the clones and the staged copies would be considered not-suspect.

    Some companies, in this situation, would instigate a tape rotation policy such that clone volumes would be brought back from off-site before savesets were likely to be staged out, with subsequently staged media sent offsite. This has a dangerous side-effect of temporarily leaving all copies of backups on-site, jeapordising disaster recovery situations, and hence it’s something that I couldn’t in any way recommend.

    The solution introduced around 7.3.2 however is far simpler – a mminfo flag called offsite. This isn’t to be confused with the convention of setting a volume location field to ‘offsite’ when the media is removed from site. Annoyingly, this remains unqueryable; you can set it, and NetWorker will use it, but you can’t say, search for volumes with the ‘offsite’ flag set.

    The offsite flag has to be manually set, using the command:

    # nsrmm -o offsite volumeName

    (where volumeName typically equals the barcode).

    Once this is set, then NetWorker’s standard saveset (and therefore volume) selection criteria is subtly adjusted. Normally if there are no online instances of a saveset, NetWorker will request the saveset with the lowest CloneID. However, saveset instances that are on volumes with the offsite flag set will be deemed ineligible and NetWorker will look for a saveset instance that isn’t flagged as being offsite.

    The net result is that when following a traditional backup model with ADV_FILE disk backup (backup to disk, clone to tape, stage to tape), it’s very important that tape offsiting procedures be adjusted to set the offsite flag on clone volumes as they’re removed from the system.

    The good news is that you don’t normally have to do anything when it’s time to pull the tape back onsite. The flag is automatically cleared* for a volume as soon as it’s put back into an autochanger and detected by NetWorker. So when the media is recycled, the flag will be cleared.

    If you come from a long-term NetWorker site and the convention is still to mark savesets as suspect in this sort of recovery scenario, I’d suggest that you update your tape rotation policies to instead use the offsite flag. If on the other hand, you’re about to implement an ADV_FILE based backup to disk policy, I’d strongly recommend you plan in advance to configure a tape rotation policy that uses the offsite flag as cloned media is sent away from the primary site.


    * If you did need to explicitly clear the flag, you can run:

    # nsrmm -o notoffsite volumeName

    Which would turn the flag back off for the given volumeName.

    Nov 302009
     

    With their recent acquisition of Data Domain, some people at EMC have become table thumping experts overnight on why you it’s absolutely imperative that you backup to Data Domain boxes as disk backup over NAS, rather than a fibre-channel connected VTL.

    Their argument seems to come from the numbers – the wrong numbers.

    The numbers constantly quoted are number of sales of disk backup Data Domain vs VTL Data Domain. That is, some EMC and Data Domain reps will confidently assert that by the numbers, a significantly higher percentage of Data Domain for Disk Backup has been sold than Data Domain with VTL. That’s like saying that Windows is superior to Mac OS X because it sells more. Or to perhaps pick a little less controversial topic, it’s like saying that DDS is better than LTO because there’s been more DDS drives and tapes sold than there’s ever been LTO drives and tapes.

    I.e., an argument by those numbers doesn’t wash. It rarely has, it rarely will, and nor should it. (Otherwise we’d all be afraid of sailing too far from shore because that’s how it had always been done before…)

    Let’s look at the reality of how disk backup currently stacks up in NetWorker. And let’s preface this by saying that if backup products actually started using disk backup properly tomorrow, I would be the first to shout “Don’t let the door hit your butt on the way out” to every VTL on the planet. As a concept, I wish VTLs didn’t have to exist, but in the practical real world, I recognise their need and their current ascendency over ADV_FILE. I have, almost literally at times, been dragged kicking and screaming to that conclusion.

    Disk Backup, using ADV_FILE type devices in NetWorker:

    • Can’t move a saveset from a full disk backup unit to a non-full one; you have to clear the space first.
    • Can’t simultaneously clone from, stage from, backup to and recover from a disk backup unit. No, you can’t do that with tape either, but when disk backup units are typically in the order of several terabytes, and virtual tapes are in the order of maybe 50-200 GB, that’s a heck of a lot less contention time for any one backup.
    • Use tape/tape drive selection algorithms for deciding which disk backup unit gets used in which order, resulting in worst case capacity usage scenarios in almost all instances.
    • Can’t accept a saveset bigger than the disk backup unit. (It’s like, “Hello, AMANDA, I borrowed some ideas from you!”)
    • Can’t be part-replicated between sites. If you’ve got two VTLs and you really need to do back-end replication, you can replicate individual pieces of media between sites – again, significantly smaller than entire disk backup units. When you define disk backup units in NetWorker, that’s the “smallest” media you get.
    • Are traditionally space wasteful. NetWorker’s limited staging routines encourages clumps of disk backup space by destination pool – e.g., “here’s my daily disk backup units, I use them 30 days out of 31, and those over there that occupy the same amount of space (practically) are my monthly disk backup units, I use them 1 day out of 31. The rest of the time they sit idle.”
    • Have poor staging options (I’ll do another post this week on one way to improve on this).

    If you get a table thumping sales person trying to tell you that you should buy Data Domain for Disk Backup for NetWorker, I’d suggest thumping the table back – you want the VTL option instead, and you want EMC to fix ADV_FILE.

    Honestly EMC, I’ll lead the charge once ADV_FILE is fixed. I’ll champion it until I’m blue in the face, then suck from an oxygen tank and keep going – like I used to, before the inadequacies got too much. Until then though, I’ll keep skewering that argument of superiority by sales numbers.

    Nov 252009
     

    Everyone who has worked with ADV_FILE devices knows this situation: a disk backup unit fills, and the saveset(s) being written hang until you clear up space, because as we know savesets in progress can’t be moved from one device to another:

    Savesets hung on full ADV_FILE device until space is cleared

    Honestly, what makes me really angry (I’m talking Marvin the Martian really angry here) is that if a tape device fills and another tape of the same pool is currently mounted, NetWorker will continue to write the saveset on the next available device:

    Saveset moving from one tape device to another

    What’s more, if it fills and there’s a drive that currently does have a tape mounted, NetWorker will mount a new tape in that drive and continue the backup in preference to dismounting the full tape and reloading a volume in the current drive.

    There’s an expression for the behavioural discrepancy here: That sucks.

    If anyone wonders why I say VTLs shouldn’t need to exist, but I still go and recommend them and use them, that’s your number one reason.

    %d bloggers like this: