Yesterday I experienced one of those weird NetWorker issues that is such an odd combination of factors that I felt it had to be discussed.

Here’s the scenario. A customer was:

  • Previously running NetWorker 7.4.2 on their backup server.
  • Upgraded the server to 7.5.1.
  • Had a bunch of Windows clients and one Unix client.
  • The Unix client was configured for filesystem backups and Oracle backups.
  • All clients were running 7.4.2(ish). The Oracle module was 4.5.
  • Once the upgrade was done, Unix filesystem backups continued to work but the Oracle backups would fail with:
client:RMAN:/path/to/script.rman 1 retry attempted
client:RMAN:/path/to/script.rman off
client:RMAN:/path/to/script.rman /path/to/nsrnmo[291]: -l:  not found
client:RMAN:/path/to/script.rman nsrnmostart returned status of 127
client:RMAN:/path/to/script.rman /path/to/nsrnmo exiting.

My first thought when a colleague asked me to have a look at it was that somehow there was enough of a slight enough incompatibility between 7.5.x and NMO 4.5 that some argument carried over from an earlier version of NMO was causing problems with talking to a 7.5.x server. This wasn’t the case. (Yes, I knew that the two versions are meant to be compatible, and when I’ve installed and used them they have been, but that doesn’t mean you can’t have one single setting somewhere that tickles a coding error across versions.)

I went back and forth with a few other checks with the customer, noting that there were various issues reported in the NMO applogs, but none specific enough to nail the problem. So since everything looked OK I agreed with the customer that a WebEx would probably help us solve the issue faster.

Even though the customer had given me the client resource, I hadn’t found anything wrong with the backup command or the save set name, so out of curiosity I’d asked the customer when we started the WebEx to show me the client details. The saveset looked fine, so we jumped across to the backup command, and that also looked fine. But then, underneath the backup command, there was the “save operations” field, and in that save operations field held:

VSS:*=off

It hadn’t been recently added. It had been there since before the upgrade, and before the upgrade the backups had been working. But as we know, on pre-VSS Windows systems invoking that will cause backup failures, so I asked the customer to remove that entry and start the backup. Neither of us really thought that this would solve the problem, given the filesystem backups were still working, but lo and behold, with that removed the Oracle RMAN backups started properly working.

In retrospect, this of course was definitely the problem, but working it out was a bit more challenging. The reason was that the configuration shouldn’t have worked under a NetWorker 7.4.x server either, but for some reason it did. The 7.4.x NetWorker server was likely not sending through the VSS directive to the Unix client and the Unix Oracle module, but having upgraded to 7.5.x, the new install stopped “filtering the error” and started causing the problem to manifest. Or alternatively, 7.4.x and 7.5.x both send the save operations setting, but just differently enough to be dangerous.

I wouldn’t exactly say this was NetWorker’s fault – those VSS options are only designed for use with Windows 2003 and higher clients, and I’d guess that the VSS:*=off was just applied to every single client on the customer site without considering the 1 x Unix client.

In retrospect, the following line now completely makes sense:

client:RMAN:/path/to/script.rman off

That was our only “hint” as to the cause of the problem in the savegroup completion. It wasn’t enough by a long stretch. Sometimes, and this is the challenging bit – sometimes you can have configuration errors even if you haven’t changed the actual resource configuration. Different versions of NetWorker will react differently to an incorrect configuration – so the upgrade didn’t cause the problem, it just allowed the problem to appear.

 

You’d practically have to be Amish in order to not know that Apple today finally released their much anticipated tablet. Called the iPad, it’s a new approach to tablet computing.

Early today on Twitter, someone remarked that the iPad will be a failure because it’s not really a mobile computing platform. To me, this comment is actually indicative of why overall tablet computing hasn’t taken off. In short, both the netbook market and the tablet market have struggled for the same reason: portable computing is not about being able to run 5 VMware instances in the palm of your hand, factoring 100,000,000 digit primes, or complete desktop replacement strategies.

Portable computing is about portability.

There’s an old saying: you can have fast, cheap or good, pick 2, you can’t have all 3.

This sums up portable computing thus far. Portable computing in the form of conventional tablets and netbooks have failed because they’ve tried to do all three. You can’t have fast, cheap and good in a single device when you try to do the same thing as a desktop. It just doesn’t work.

The way to make it work is to redefine what you use the device for.

Let’s consider my computing requirements. I have an iPhone, a (work) Mac Book Pro, and a (home) Mac Pro.

There are things I do on each of these platforms that I wouldn’t attempt to do on the others. I can encode video at 300 frames per second on my Mac Pro whilst running 8 virtual machines. I wouldn’t even consider doing that on my laptop. I can multitask on my laptop, having Twitter, browsing, Adium, etc., open and all available at once. I wouldn’t want to consider that on my iPhone.

On my iPhone, I can check the news and check email regardless of whether I’m laying in bed, eating out at a restaurant or even (heaven forbid) in the loo. I wouldn’t attempt that with my laptop, and I sure as hell wouldn’t attempt it with my Mac Pro.

So why the hell is it logical to expect that I should be able to do all of the above on a portable computing device that falls somewhere between a Smart Phone and a Laptop? There is no logic in such a requirement. It’s based on false expectations, on a false sense of entitlement.

By comparison, one of the beefs that I constantly have with Linux is that you don’t design consumer products with geeks in mind. I am not a regular computer user, yet I think I’m perhaps a more realistic desktop user than a lot of geeks. I don’t want to customise my desktop to the nth degree, or skin every single app in an entirely different window dressing, or have a choice between 3,134 personal media players. I just want the desktop environment to work.

Linux, in short, is designed for a programmer or other highly technical user who values customisation over productivity. This, I’d argue, is the core failing of Linux at the desktop, and also mirrors the core failing of conventional tablets and netbook designs.

Don’t get me wrong. I love Linux at the server level. But I’d sooner be tasered in a bath repeatedly than subject myself to Linux on the desktop these days. I’m after productivity, not customisation.

So, after that digression, let me return to the original topic – the iPad.

Apple have, in my mind, quite rightly determined that you can’t make a portable computing device that either is (a) a big version of a Smart Phone with no additional features (other than screen) OR (b) attempts to be a complete laptop or desktop replacement.

In short, there’s room for another platform level there.

So, that’s my assumption in this entire piece. (Bear in mind, based on their last financials, I’d suggest to you that the odds are with me in backing Apple for knowing what they’re doing as opposed to with the nay-sayers who are saying the same thing about the iPad as was said about the iPhone.)

My assumption is that portable computing is not about replicating an entire desktop experience, but giving an über-portable experience. It’s about picking up your device and carrying it around with you, still doing stuff while you’re standing in an elevator or you’ve got five minutes before the meeting starts. Or it’s about taking your troubleshooting session with you from your desk into the computer room without losing a moment’s connectivity.

I predict that iPad will be huge in the enterprise realm. Not for email. That’s passé. Not for eBooks. That’s nice, but largely irrelevant. Not for instructional videos, meeting notes or portable presentations. That’s all sales and consulting. These are all good, but represent only a small portion of the enterprise realm.

What I’m talking enterprise systems management – the core of any IT group within any company.

I looked at the iPad and I started to drool not because of any of the advertised features – not even from thinking about how fantastic it would work in hospitals and medical circles – but because I was thinking of all the different enterprise systems I’ve worked with over the years – NetWorker, NetBackup, a multitude of databases and ERP systems, HDS arrays, EMC arrays, NetApp arrays, system console services, DRAC, etc., and I imagined an iPad interface for every single one. Sure I’ve wistfully thought about such management apps for the iPhone/iPod Touch in the past, but each time I’ve acknowledged that screen real estate would be a real killer.

Not with the iPad.

So let me make a prediction: any enterprise vendor that didn’t look at Apple’s presentation today and see the future of enterprise management interfaces is stupid. It’s not about desktop apps or Java apps or web portals – well, maybe it is about web portals, but only 100% HTML5 compliant ones that don’t rely on Java or flash.

The future of enterprise systems management is in portable computing devices. It’s about backup administrators who are able to kick off a recovery from wherever they are in the building. It’s about storage administrators being able to bind a LUN or manage a snapshot while they’re waiting for a meeting to start. It’s about network administrators being able to open up an additional port while they’re waiting for the lift to arrive, or react instantly to a notification that an unapproved system has attached itself to the network. It’s about system administrators being able to continue their ssh or RDP session as they move from their desk to the computer room as they prepare to power cycle a machine having problems. It’s about the IT manager being able to access the global dashboard to check up to date service statuses when she’s in a meeting with the board.

You get where I’m going here.

The iPad will make it into enterprises not because of any of the advertised features, but because of the portable management functionality it allows. Since Apple has not tried to make the iPad all things to all people, it will do what it does superbly, and one of those things will be enterprise management apps.

Mark my words: if your enterprise vendor is not forming a team right now to develop management apps for their software or hardware for the iPad, they’ve got their heads stuck in the sand, and asking their competitors to take advantage of them.

Ask your vendor tomorrow: when will the iPad version of their management app arrive?

 

I had to briefly run up NetWorker 6.x this week in order to confirm that cross platform directed recovery really did work back then, and that I wasn’t going nuts.

The short answer is yes, it did.

The longer answer is that it made me profoundly grateful that we’ve moved beyond 6.x:

Older NetWorker GUI

Sure, the Unix GUI was at least serviceable – for watching backups, at least. But it sucked for creating any new resources. The basic form approach with endless scrolling, etc., was painful. The old Windows native GUI at least had some advantages there.

But it was all the other things about 6.x that you just take for granted now in 7.x that I noticed, such as:

  • (On Linux at least) What? What happened to inquire and sjirdtag? Honestly, I’ve been using these so heavily that when they didn’t install as part of my 6.1.3 install I removed and re-added the packages just to make sure I wasn’t going crazy.
  • ADV_FILE. OK, so I complain about the inadequacies of ADV_FILE a lot, but the 7.x series of NetWorker has been around for so long that I automatically tried to create an ADV_FILE device for my testing. Nope, had to create a file type device instead.
  • Urgh – monolithic resource files (nsr.res, nsrjb.res, nsrla.res). Once again I’m grateful to have the split resource database system that came in with NetWorker 7.0.

On top of those points, and so many areas where there’s been enhancements to NetWorker over the years, there was one memory that particularly struck me with 6.x: dynamic drive sharing – introduced in the 6.x era, and way too oversold at a time when SANs were only just beginning to be designed to work properly with tape, DDS gave me more project overruns and nightmare support scenarios than anything else. While DDS hasn’t gone away in 7.x, it’s become easier with TapeAlert, better management generally, and better SANs.

Oh, but one thing we lost: cross platform recovery capabilities. Please EMC engineering, bring it back. At least for the folks that are migrating Novell NetWare to OES on SLES.

 

Over at The Backup Blog, Scott Waterhouse offers an alternate perspective on why the announcement by IBM of an in-lab tape technology that fits 35TB per cartridge is largely irrelevant to a doomed market.

I respectfully disagree with Scott’s assessment. I also swear that even though I absolutely loathe the song “Killing me Softly”, naming the blog post after that song had nothing to do with my disagreement on his assessment.

Scott takes two arguments:

  1. It seems a lot like previous announcements by Sun that they were going to release $10M+ servers that were just servers, then later come up with a model that allows the development of servers one twentieth to one fortieth cheaper that do the same job.
  2. That there already is a serious decline in tape, and this will trigger a terminal decline.

You may recall that a while ago I linked to a fairly astute piece by Drew Robb over at Server Watch titled “Tape vs Disk: Tape Refuses to be Evicted“. What was most interesting in Drew’s article was this quote:

How are tape sales? IDC references several studies. Tape overall is down, although the slide is mainly at the lower end. Robert Amatruda, a tape analyst for IDC, said that the market for tape automation products below 100 tape cartridges would suffer most. Another IDC study on Asia-Pacific sales from last year showed automated tape libraries to be up 15 percent for the year, while tape drives fell 19 percent. Cheryl Ganesan-Lim, an IDC analyst, noted that disk storage allows better recovery speeds, thus making it suitable for Tier 1 and Tier 2 storage. Tape, on the other hand, is better for deep archiving of rarely accessed data. She expected tape library sales to rise slightly over the next five years.

So tape is down in lower-end, smaller-scale and more immediate data recovery categories, but it is largely holding its own at the high end. It looks like tape’s death isn’t imminent.

A lot of people are quick to jump on the notion that tape sales are declining. What I take from Drew’s article is the logical fact that at the low end of the market, tape is well and truly dropping off. Pretty much every small business that I’m aware of at an IT level have shifted their backup operations from tape to disk (removable or otherwise) in the last 5 years. I don’t see this trend reversing.

But I’m equally not seeing tape “dying” at the enterprise level as well. I recently wrote an article titled “Direct to Tape is Dead: Long Live Tape“. The title was quite intentional – I do see that at an enterprise level the reasons for backing up to tape directly have been falling for years, and this will be the decade where that is well and truly finished off as a “standard” backup practice. However, that doesn’t meant the death of tape in backup circles.

Scott and I disagree usually when it comes to deduplication. My preference for a start is target based deduplication so that it slots into an existing solution, and he raises alternate arguments that moving to source based deduplication is a good thing. Neither argument is 100% correct, and neither argument is 100% incorrect; they’re just different ways of looking at the same problem.

Scott argues that because IBM has come up with a staggering increase in the capacity of tape, they’re going to struggle to sell sufficient numbers of units in comparison to say, LTO-4 media – and they’re going to be unable to raise the price of their products to match the 40 fold increase in capacity:

But I would be willing to bet my last dollar that there will not be any similar increase in cost or in units shipped to offset this. No tape cartridge is going to cost $2000 (roughly 40x what a current LTO cartridge costs). And they sure aren’t going to sell 40x as may of them.

Looking at a cost perspective, I’m not convinced. When we compare say, even a theoretical cost of $2000 per cartridge for IBM über-dense tape capable of holding 35TB uncompressed, and the actual cost of a Data Domain 32TB dedupe solution, the numbers speak fairly heavily towards buying a bunch of 35TB tapes. Even at that price for the media, there will be orders of magnitude difference between the cost of magnetic tape and the cost of fully specced dedupe solutions. (Particularly when accounting for the need for replication – hence, two such units.)

What I’m going to suggest is that we’re seeing an evolution in the datacentre which is splitting off a high end portion – maybe 5% to 10% of the datacentres of the world. There’s an incorrect assumption, I believe, that everyone can solve all their backup and data storage issues with deduplication. I’d argue that given the relative costs of these technologies at the moment, and the inherent need they currently create for replication of solutions, thus effectively doubling (at times) of prices, and the relatively huge (by comparison) CapEx costs associated with doubling those purchases vs the relatively small ongoing OpEx costs of media, there will be a significant portion of the datacentre that continues to work with tape on a day to day basis and will continue to upgrade those tape technologies to the ones which give higher capacity.

I’d go so far as to diagram it as follows:

Disk and tape usage in backup

Obviously I’m not trying to make the above diagram scientifically accurate. What I’m trying to highlight is that top 5-10% of businesses in the enterprise arena who will more than likely ditch tape altogether in the backup arena. (I will make no predictions on archive.) I fully agree that there’s an evolutionary trend for this ditching of tape entirely in certain datacentres, but only in the biggest.

What I’m increasingly seeing is that there’s a marked difference between what small percentage of high end enterprises do and what the rest of companies that are classified as “enterprises” do when it comes to backup and recovery. This is driven by cost, availability and complexity. Like relativity and quantum physics/mechanics, neither the “dedupe and replicate” nor the “disk and tape” arguments hold true for the entire picture. When looking at the available scenarios from one perspective, it’s clear dedupe and replicate is the way to go. When looking at the available solutions from another perspective, it’s clear disk+tape is the way to go.

My argument simply is that we’re still only at the point where 5-10% of the enterprises out there are suitable for the dedupe only+replicate solutions, and the majority of the rest will still fall into a category of requiring disk and tape. Again, neither argument is wrong, it’s just we’ve seen an evolutionary split in the datacentre between types of enterprises, and those types of enterprises need to be handled differently.

 

A borked LaCie 2TB BigDisk Extreme has reminded me of the role of backup and recovery within disaster recovery itself. By disaster recovery, I mean total “system” failure, whether that system is an entire server, an entire datacentre, or in my case, a large drive.

What is the difference between a regular failure and a disaster? I think it’s one of those things that’s entirely the perspective of organisation or person who experiences it.

As for my current disaster, I’ve got a 2TB drive with just 34GB free. I’ve got up-to-date backups for this drive which I can restore from, and in the event of a catastrophe, I could actually regenerate the data, given that it’s all my media files. It’s also operational, so long as I don’t power it off again. (This time it took more than 30 minutes to become operational after a shutdown. It’s been getting worse and worse.)

So I’ve got a backup, I’ve got a way of regenerating the data if I have to, and my storage is still operational. Why is it a disaster? Here’s a few reasons, which I’ll then use to explain what makes for a disaster more generally, and why backup/recovery is only a small part of disaster recovery:

  1. I don’t have spares. Much as I’d love to have a 10 or 20TB array at home running on RAID-6 or something like that, I don’t have that luxury. For me, if a drive fails, I have to go out and buy a replacement drive. That’s budget – capital expenditure, if you will. What’s more, it’s usually unexpected capital expenditure.
  2. Not all my storage is high speed. Being a home user, a chunk of my storage is either USB-2 or FireWire 400/800. None of these formats offer blistering data transfer speeds. The 2TB drive is hooked up to Firewire 800, and I backup to Firewire 400, which means I’m bound to a maximum of around 30-35MB/s throughput for either running the backup or recovering from it.
  3. The failure constrains me. Until I get the drive replaced, I have to be particularly careful about any situation that would see the drive powered off.

So there’s three factors there that constitute a “disaster”:

  1. Tangible cost.
  2. Time to repair.
  3. Interruptive.

A regular failure will often have one or two of the above, but all three are needed to turn it into a disaster. This is why a disaster is highly specific to the location where it happens – it’s not any specific thing, but a combination of the situation, the impact locally and the required response that render a disaster from a failure.

There’s of course varying levels of disasters too, even at an individual level. Having a borked media drive is a disaster, but it’s not a “primary” disaster for me, because the core of what I do on my computer I can still get done. The same applies with corporations – it could be that losing both a primary fileserver and a manually controlled archive fileserver would constitute a “disaster”, but the first is always likely to be a far more serious disaster. That’s because it generates higher spikes in one or more of the factors – cost and interruption.

So, returning to the topic of the post – let’s consider why backup/recovery only forms a fraction of disaster recovery. When we consider a regular failure requiring recovery, it’s clear that the backup/recovery process forms not only the nexus to the activity, but likely the longest or most “costly” component (usually in terms of staff time).

In a disaster recovery situation, that’s no longer guaranteed to be the case. While the actual act of recovery is likely to take some time within a disaster recovery situation, there’s usually going to be a heap of other activities. There’ll be:

  • Personnel issues – getting human resources allocated to fixing the problem, and the impact of the failure on a number of people. Typically you don’t find (in a business world) that a disaster is something that only affects a single user within the organisation. It’s going to impact a significant number of workers – hence the tangible cost and the interruptive nature of them.
  • Fault resolution time – If you can seamlessly failover from an event, it’s unlikely it will be treated as a disaster. Sure, it may be a major issue, but a disaster is something that is going to take real time to fix. A disaster will see staff needing to work nigh-continuously in order to get the system operational. That will include:
    • Time taken to assess the situation,
    • Time taken to get replacement systems ready,
    • Time taken to recover,
    • Time taken to mop up/finalise access,
    • Time taken to repair original failure,
    • Time taken to revert services and
    • Time taken to report.
  • Post recovery exercises – in a good organisation, disaster recovery operations don’t just stop when the last byte of data has been recovered. As alluded to in the above bullet point, there needs to be a formal evaluation of the circumstances that lead up to the disaster, the steps required to rectify it, any issues that might have occurred, and plans to avoid it (or mitigate it) in future. For some staff, this exercise may be the longest part of the disaster recovery process.
  • Post disaster upgrades – if, as a result of the disaster and the post recovery exercises it’s determined that new systems must be put into place (e.g., adding a new cluster, or changing the way business continuity is handled), then it can be fairly stated that all of the work involved in such upgrades is still attributed to the original disaster recovery situation.

All of these factors (and many more – it will vary, site by site) lead to the inevitable conclusion that it’s insufficient to consider that disaster recovery is just a logical extension of a regular backup and recovery process. It’s far more interruptive. It’s more costly in terms of either direct staff time or a variety of other factors, and it’s far more interruptive – both to individuals within the organisation, and the organisation as a whole.

As such, the response to a disaster recovery situation should not be driven directly by the IT department. IT of course will play a valuable and critical role in the recovery process, but the response must be driven by a team with oversight against all affected areas, and the post-recovery processes must equally be driven by a team whose purdue extends beyond just the IT department.

We can’t possibly prepare for every disaster. To do so would require unlimited budget and unlimited resources. (It would also be reminiscent of the Brittas Empire.)

Instead, what we can plan for is that disasters will, inevitably happen. By acknowledging that there is always a risk of a disaster, organisations can prepare for them by:

  • Determining “levels” of disaster – quantifying what tier of disaster a situation will be by say, percentage of affected employees, loss of ability to perform primary business functions, etc.
  • Determining role based involvement in disaster response teams for each of those levels of disaster.
  • Determining procedures for:
    • Communication throughout the disaster recovery process.
    • Activating disaster response teams.
    • Documenting the disaster.
    • Reporting on the disaster.
    • Post-disaster meetings.

Good preparation of the above will not mitigate a disaster, but it’ll at least considerably reduce the risk of a disaster becoming a complete catastrophe.

Don’t just assume that disaster recovery is a standard backup and recovery process. It’s not – not by a long shot. Making this assumption puts the business very much at risk.

 

Another update that happened to NetWorker while I was on holiday was 7.5.1.9 – 7.5 SP1 cumulative patch cluster 9. This includes the bug fixes from previous cumulative patch clusters for 7.5.1, as well as just a couple of additional fixes. If you’re already on 7.5.1.8, there may not be much incentive to update. (If you’re on 7.5.1 vanilla though, this may be a good sign that 7.5.1 is getting quite stable, and upgrading to 7.5.1.9 may be worthwhile.)

As is usual for cumulative patch clusters, you’ll need to chat with your EMC NetWorker support provider to get access to the patches.

I’ve not yet started to test 7.5.1.9 – focusing first on 7.6.0.1, and then I’ll get to 7.5.1.9 once I reallocate a virtual lab machine.

 

There’s been a cumulative patch release for NetWorker 7.6. This isn’t a service pack, but a bunch of patches on top of the vanilla 7.6 implementation. (To be more accurate, this happened last week, but I was on holiday and didn’t get around to retrieving it until now.)

There’s only a few fixes in it, but it seems to be recommended for servers and storage nodes running 7.6.

I’m currently running some tests against it and not having any issues as yet. If you’re after the cumulative patch details or a download link, please contact your EMC NetWorker support provider.

 

According to a press release, IBM have come up with a tape format which is so dense that it’ll fit about 35TB of uncompressed data on it.

Obviously this is a “just in the lab” technology and it’s going to be a while away from hitting the market. It remains, however, a remarkable feat – by comparison LTO-4 manages a “measly” 800 GB of uncompressed data, and the soon to be released LTO-5 manages 1.6TB of uncompressed data.

The critical question of course will remain – how fast will you have to pump data at this beast in order to get it streaming? I’m guessing it will be a seriously high speed. As we continue to see tape getting faster and faster, I’ll continue to say: this is the decade where direct to tape backup models will die, long live tape.

 

I don’t like having to do this, particularly since I’m on holidays and only logged into my work email to send one, rather than read, but I noticed an email come in on a support case that I’ve been keenly dealing with, and wanted to check what the latest update from EMC on it was.

But on this case, I’ve been passed a response from EMC NetWorker engineering which is so boneheaded and stupid that I can’t help but have a short rant about it.

(I’ll qualify one thing here: I’m talking EMC NetWorker engineering – the back-end people, not the support people.)

In short, as of 7.6, there’s a new media database field called ‘validcopies’, which, according to the man page is:

The number of successful copies (instances or clones) of the save set, all with the same save time and save set identifier.

Now, digging a little bit further, we’ve got the release notes for 7.6, which states:

mminfo changed to allow query for valid save set copies in order to prevent data loss

There was no convenient method to query for save sets with valid clone copies on other volumes using mminfo. This made certain tasks more difficult to perform, such as determining if space could be cleared on the EDLs.

(Italicised emphasis mine, bold from the release notes.)

Now, in addition to validcopies initially being entirely FUBAR as a reporting mechanism (I’m happy with the patch I’ve been testing, and I’m hoping it will get into the first service pack for 7.6), I noted in the support case that I didn’t think it was appropriate for NetWorker to return 2 ‘validcopies’ for savesets on ADV_FILE devices. (I.e., one for the read-only volume, one for the read-write volume.) Sure, in the classic use of the ‘copies’ flag, we’re used to this, but ‘validcopies’, being something new, and being about preventing data loss, should have only reported 1 valid copy per entire disk backup unit, not 2.

Instead, EMC NetWorker engineering have adamantly said that it will report 2 valid copies per disk backup unit, 1 per read-only device, one per read-write device.

This is boneheaded. If the validcopies flag is all about preventing data loss, then it must be accurate as to the number of distinct, usable copies.

If engineering is so confident that a backup to ADV_FILE represents two distinct valid copies for the purposes of preventing data loss if a copy is lost, let’s see them delete a whole bunch of uncloned savesets from the read-write ADV_FILE devices on EMC’s production backups and then recover. What? You can’t do that? But you said you had two valid copies, and you only deleted one of them? Boo-hoo to you too.

I’ll end my grumpy rant with the following advice: don’t say or do something stupid that might allow a customer to do something stupid that might result in data loss. Haven’t you read this, after all?

 

Any regular reader knows that I don’t for a minute believe that tape is dead. However, it is time to address the changing use for tape within the enterprise datacentre, and what we’re going to see in the coming decade.

To start with, let’s examine the traditional role within tape within enterprise backup and recovery. Long term backup users “grew up” with one of the two following backup strategies:

  1. Each server (or critical server) had a tape drive (or drives) directly attached, and wrote data to the media in locally attached drives, or
  2. A central backup server received network backups and pushed them directly out to tape storage locally attached to the backup server.

Over time, as backup and recovery grew up, we saw the first model continually fail until it has become almost universally derided as the antithesis of best practices. The second model though, the centralised backup model, has effectively formed the absolute nexus of enterprise backup and recovery best practices.

The effect of the evolution of the centralised backup model has been a continual tug of war between network and data throughput to tape, and the performance characteristics of tape.

I sincerely doubt that this will be the decade that tape will die. However, this is the decade where direct to tape will die. To be perfectly honest, it’s fair to say we exited the noughties with the direct to tape model on life-support.

What’s wrong, specifically, with the direct to tape model? A primary reason is that tape is getting too fast. For a while in the noughties we were in a period where it was relatively straight forward to performance tune a backup environment to be able to keep data streaming relatively well at tape. This was around the LTO-1 and LTO-2 mark. However, LTO-3 started to cause the edifice to groan, LTO-4 to creak and crumble, and LTO-5 will just finish the job.

The rest of the environment quite simply hasn’t kept up with tape. We need high capacity tape for green, long term storage of backups or archives, but getting the data out to it is becoming increasingly difficult via a multi-pronged delivery system. Consider for instance an environment with just 50 machines, a NAS, and a SAN, where 34 of those machines use storage on the SAN, two machines use storage from the NAS in addition to the NAS presenting storage direct to end users. 4 of the machines are actually ESX servers, with the remaining 30 of the 34 SAN connected machines being guests. The number of areas where performance tuning comes into play are significant:

  • How many SAN connected machines will be backed up at once?
  • What are the performance characteristics of the SAN under heavy simultaneous read load across all defined LUNs?
  • What are the performance characteristics of the SAN under heavy simultaneous read load across all defined LUNs while doing a RAID-5 reconstruction or undergoing a RAID-5 failure? (etc, etc.)
  • How many hosts on the SAN use wide striping? How many? How many of these will be simultaneously backed up?
  • How many hot spares are there on the SAN?
  • What are the ongoing operational performance requirements of the SAN while heavy simultaneous read is occurring across all defined LUNs?
  • What are the performance characteristics of the SAN when significant spikes of primary production activity occur during a backup and all LUNs are busy with reads, and then key LUNs also become extremely busy with writes?
  • How many machines that are SAN connected will get copy-on-write snapshot backups, and how many will have non-snapshot backups?
  • What are the performance characteristics of the SAN snapshot pools?
  • What’s the impact of doing an NDMP backup of the NAS server as well as hosts using its storage? (Assuming for instance that those two other hosts have iSCSI access.)
  • How many simultaneous NDMP backups does the NAS server support?
  • What are the performance characteristics of the NAS host doing multiple NDMP backups whilst simultaneously supporting primary production access?
  • How many virtualised machines will be backed up at once? How many are likely to be on any one ESX server at any given time?
  • Will VCBs/etc be used for VMware guest backups? (Only for Windows of course. Let’s mess things up and say that 20 of the virtualised systems are running Linux.)
  • Will the tape library share access to the SAN?
  • What’s the speed of the SAN? 2Gbs? 4Gbs? This (obviously) significantly impacts throughput when we start talking about high speed tape.
  • For each client in the backup environment, what is the optimum client parallelism settings for the backup? For SAN connected and virtual clients, do these per-client optimum client parallelism settings impact other hosts? (It’s like the prisoner’s dilemma).
  • Then there’s all the actual/traditional backup server (/storage node) questions:
    • What’s the base network speed?
    • How many network ports does the backup server have?
    • What’s the backplane characteristics of the backup server?
    • What impact will filesystem density make on individual client performance?
    • etc, etc, etc.

In even a relatively small environment now, performance tuning of the entire environment to focus on one item – e.g., keeping tape streaming – is just completely impractical. The entire environment has to be evaluated in a more holistic way with a focus on overall performance for primary production, not tape streaming speed.

Of course, that’s not the only issue facing tape in an enterprise environment. Drives are relatively expensive, yet you need as many as possible so you can balance backup and restore objectives. However, media sizes are becoming so large that your chances of needing to read from tape that you’re still writing to continues to grow with each generation, placing physical roadblocks to backup and recovery performance. Then you’ve got the meta-access times: load times and seek times are relatively poor compared to using disk, meaning that SLAs requiring minimum times between recovery request and recovery commence can’t readily be met with tape.

In short, we’ve hit the wall when it comes to the direct-to-tape backup model. I’m not the first backup consultant to say this, and I won’t be the last. This isn’t even the first time I’ve said it – I’ve been advising customers for years that they need <disk> inserted between the backup process and the tape, either as a simple buffer (for the smaller environments), or as a high speed/nearline recovery area for the larger environments.

The performance tuning advantages alone of migrating away from direct-to-tape are immense. Instead of worrying about how every single one of those questions above (and probably 3x as many more) will affect tape, and having to practically guess on a day to day basis on how streaming will be affected, you can instead focus tape streaming performance on just a few hosts within the environment – the backup server and any additional storage nodes you have. Get those hosts beefed up so that they can stream large chunks of data out to tape. Rather than having to “muscle up” the entire environment, you instead just have to get the performance and power out of a few select hosts. This can be a huge cost saving, and provides better, more guaranteed streaming speed to tape, since you move from dealing with all the above issues to just simple ones: how fast can you send very, very large chunks of data from the <disk> connected to the backup server/storage nodes to physical tape?

We still need tape. I do not accept the long term reliability of any solution that intends to keep everything on disk (VTL, ADV_FILE, etc) for the entire lifespan of a backup environment. Certainly not as a “blanket rule”, anyway – i.e., if you’re looking at making a broad statement, the broad statement is “tape is still needed” rather than “tape isn’t necessary”. Nothing equals tape when it comes to:

  • Long term recoverability;
  • Media that is guaranteed “offline”, completely immune to viruses and malware;
  • For green credibility and
  • For cost per GB.

The movement away from the direct to tape model is not actually about “killing tape”, but instead it’s about reorienting business practices to suit business requirements rather than molding business requirements to suit backup media characteristics. Larger companies will of course look at designing their architecture to eliminate the need for day to day cloning to tape, focusing instead on say, cloning monthly backups only to tape, with the rest being replicated between multiple datacentres, etc. But that’s not the way it will be for the majority of the enterprise. Regardless though of whether you only clone monthly backups and use replication instead, or whether you still do daily cloning, tape stays part of the overall strategy. It just isn’t the primary focus of backup any longer.

This is the decade where we stop worrying about silly terms such as D2D2T and instead work with the changed playing field. The change is that we backup to <disk>, then get copies out to physical tape.

Direct to tape is dead, long live tape.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha