Are your service level agreements and your backup software support contracts in alignment?

A lot of companies will make the decision to run with “business hours” backup support – 9 to 5, or some variant like that, Monday to Friday. This is seen as a cheaper option, and for some companies, depending on their requirements, it can be a perfectly acceptable arrangement too. That’s usually the case where there are no SLAs, or smaller environments where the business is geared to being able to operate for protracted periods with minimal IT.

What can sometimes be forgotten in attempts to restrain budgets is whether reduced support for production support systems has any impact on meeting business requirements relating to service level agreements. If for instance, you have to start getting data flowing back within 2 hours of a failure, a system fails at midnight and the subsequent recovery has issues, your chances of being able to hit your service level agreement start to plummet if you don’t have a support contract that guarantees you access to help at this point in time.

A common response to this from management – be it IT, or financial – is “we’ll buy per-incident support if we need to“. In other words, the service level agreements the business has established necessitates a better support contract than is budgeted for, so it is ‘officially’ planned to “wing it” in the event of a serious issue.

I describe that as an Icarus Support Contract.

Icarus, as you may remember, is from Greek mythology. His father Daedalus fashioned wings out of feathers and wax so that he and Icarus could escape from prison. They escaped, but Icarus, enjoying the sensation of flight so much, disregarded his father’s warnings about flying too high. The higher he got, the closer he was to the sun. Then, eventually, the sun melted the wax, his wings fell off, and he fell to his death into the sea.

Planning to buy per-incident support is effectively building a contingency plan based on unbooked, unallocated resources.

It’s also about as safe as relying on wings held together by wax when flying high. Sure, if you’re lucky, you’ll sneak through it; but is do you really want to trust data recovery and SLAs to luck? What if those unbooked resources are already working on something for someone who does have a 24×7 contract? There’s a creek for that – and a paddle too.

In a previous job, I once discussed disaster recovery preparedness with an IT manager at a financial institution. Their primary site and their DR site were approximately 150 metres away from one other, leaving them with very little wiggle room in the event of a major catastrophe in the city. (Remember, the site being inaccessible can be just as deadly to business as the site being destroyed – and while there’s a lot less things that may destroy two city blocks, there’s plenty more things that might cut off two city blocks from human access for days.)

When questioned about the proximity of the two sites, he wasn’t concerned. Why? They were a big financial institution, they had emergency budget, and they were a valued customer of a particular server/storage manufacturer. Quite simply, if something happened and they lost both sites, they’d just go and buy or rent a truckload of new equipment and get themselves back operational again via backups. I always found this a somewhat dubious preparedness strategy – it’s definitely an example of an Icarus support contract.

I’ve since talked to account managers at multiple server/storage vendors, including the one used in this scenario, and all of them, in this era of shortened inventory streams, have scoffed at the notion of being able to instantly drop in 200+ servers and appropriate storage at the drop of a hat – especially in a situation where there’s a disaster and there’s a run on such equipment. (In Australia for instance, a lot of high end storage kit usually takes 3-6 weeks to arrive since it’s normally shipped in from overseas.)

Icarus was a naïve fool who got lost in the excitement of the moment. The fable of Icarus teaches us the perils of ignoring danger and enjoying the short-term too much. In this case, relying on future unbooked resources in the event of an issue in order to save a few dollars here and there in the now isn’t all that reliable. It’s like the age-old tape cost-cutting: if you manage to shave 10% off the backup media budget by deciding not to backup certain files or certain machines, you may very well get thanked for it. However, no-one will remember congratulating you when there’s butt-kicking to be done if it turns out that data no longer being backed up actually needed recovery.

So what is an Icarus support contract? Well, it’s a contract where you rely on luck. It’s a gamble – that in the event of a serious problem, you can buy immediate assistance at the drop of a hat. Just how bad can planning on being lucky get? Well, consider that over the last 18 months the entire world has been dealing with Icarus financial contracts – they were officially called Sub-Prime Mortgages, but the net result was the same – they were contracts and financial agreements built around the principle of luck.

Do your business a favor, and avoid Icarus support contracts. That’s the real way to get lucky in business – to not factor luck into your equations.

 

Sometimes NetWorker may not want to cooperate when it comes to moving media in and out of drives, or around in a tape library. While nsrjb -n will do the trick for some media load operations where you don’t want to mount media, it’s not always available. Sometimes you will need to do a media move operation without NetWorker – either in situations where NetWorker isn’t running, or at times when NetWorker is disagreeing with the output of sjirdtag.

In these cases, you want to work with sjimm.

The usage for sjimm is:

[root@tara ~]# sjimm
1642:sjimm: usage: sjimm jukebox {drive|slot|inlt|mt} src {drive|slot|inlt|mt} dst

In this case, ‘jukebox’ will be the x.y.z component of the SCSI device ID as output from inquire (or as determined by checking the control port field for the jukebox.)

For instance, on my lab system, inquire shows me:

[root@tara ~]# inquire -l | grep Autochanger
scsidev@0.0.0:SPECTRA PYTHON          5500|Autochanger (Jukebox), /dev/sg6

So I know then that the jukebox component of my sjimm command will be 0.0.0.

So say I wanted to move the tape in slot 23 into the first drive in my autochanger. I’d use the command:

[root@tara ~]# sjimm 0.0.0 slot 23 drive 1

Note though that this doesn’t mount the tape. If I then run nsrjb, for my drive area I see:

drive 1 (/dev/nst0) slot   :   
drive 2 (/dev/nst1) slot   :   
drive 3 (/dev/nst2) slot   :   
drive 4 (/dev/nst3) slot   :   
drive 5 (/dev/nst4) slot   :   
drive 6 (/dev/nst5) slot   :

Note too that I didn’t give the drive to load the tape into as a operating system device, but instead a device number as per the autochanger’s definition. (I’ll get to tracing that in a minute.)

I can verify that there is a tape in the given drive at this point by running the command:

[root@tara ~]# nsrmm -p -f /dev/nst0
Verified LTO Ultrium-4 tape 800823L4 on /dev/nst0

When you’re done with the tape, you can then move it back:

[root@tara ~]# sjimm 0.0.0 drive 1 slot 23

Note that depending on the drive type, it may be necessary before issuing the above command to issue the mt command to take the media “offline”, which usually issues an eject command to the drive – e.g.,:

[root@tara ~]# mt -f /dev/nst0 rewoff

Other than that, there’s actually not a lot to sjimm. You can move tapes from slots to drives, slots to CAP slots, drives to slots, slots to slots, etc.

However, I did mention that I’d help you work out what drive number corresponds to what operating system device. Obviously if you’ve got the library configured, you can just use nsrjb’s output to see see the autochanger device <-> OS device path mapping. If you don’t yet have a tape library configured in NetWorker, or the issue is determining which drive is currently mapped to which path (after say, a tape drive replacement), you need to do a little more digging.

So, in this case you’d run sjisn – which is designed to report serial numbers and device details for tape library components. Like sjimm, sjisn takes the control port of the tape library we want to communicate with – e.g.:

[root@tara ~]# sjisn 0.0.0

Serial Number data for 0.0.0 (SPECTRA  PYTHON          ):
Library:
Serial Number: XYZZY
SCSI-3 Device Identifiers:
ATNN=SPECTRA PYTHON          XYZZY
WWNN=11223344ABCDEF00
Drive at element address 1:
SCSI-3 Device Identifiers: ATNN=ZF7584364
Drive at element address 2:
SCSI-3 Device Identifiers:
ATNN=ZF7584366
Drive at element address 3:
SCSI-3 Device Identifiers:
ATNN=ZF7584368
Drive at element address 4:
SCSI-3 Device Identifiers:
ATNN=ZF7584370
Drive at element address 5:
SCSI-3 Device Identifiers:
ATNN=ZF7584372
Drive at element address 6:
SCSI-3 Device Identifiers:
ATNN=ZF7584374

The number given in the “Drive at element address” line for each drive represents, literally, the drive number according to the tape library itself. I.e., when it refers to drive 1, it means the drive with serial number ZF7584364.

Moving on, we can then run inquire -l to provide the device details so as to align internal library drive numbers to operating system paths, cross-referencing by the serial numbers (or WWNs when using a fibre-channel tape library).  In this case, I’ll just show the details for two of the tape drives:

scsidev@0.3.0:IBM     ULT3580-TD4     5500|Tape, /dev/nst2
                                           S/N:    ZF7584368
                                           ATNN=IBM     ULT3580-TD4     ZF7584368
                                           WWNN=11223344ABCDEF03
scsidev@0.4.0:IBM     ULT3580-TD4     5500|Tape, /dev/nst3
                                           S/N:    ZF7584370
                                           ATNN=IBM     ULT3580-TD4     ZF7584370
                                           WWNN=11223344ABCDEF04

So, you can see from the above that we can map the drives as follows:

  • The drive known to the OS as /dev/nst2, which has a serial number of ZF7584368 maps to the library device number 3.
  • The drive known to the OS as /dev/nst3, which has a serial number of ZF7584370 maps to the library device number 4.

So this would give us the drive numbers to use in sjimm if we needed to move tapes in or out of those drives without using NetWorker’s NMC or nsrjb.

As a side-note, that’s also how you’d go about identifying the correct device order for a manual jbconfig operation when the library device order is out of sync with the operating system devices – cross-checking via sjisn and inquire.

 

I’m interested in gathering a basic understanding of the NetWorker deployment profiles amongst people who are reading this blog. I’m hoping that the survey is short enough that you won’t mind quickly filling it in. My purpose in collecting this is to get an understanding of the sorts of topics I might cover in future.

Note – if you have multiple NetWorker servers, select each version of NetWorker you’re running as a server in the first question, and continue to aggregate the answers for all your datazones throughout.

I’ll be keeping this survey running until Saturday 21 March 2010, Australian time.

[Edit, 2010-03-20]

The survey has now closed. Results will be posted in the week starting 22 March 2010, and a link will be provided here once published to the survey results.

[Edit, 2010-03-27]

You can download the survey results from here.

 

While initially I had some success with Snow Leopard and Mac OS X, I’m increasingly finding that it’s just boiling down to being too random for reliable backups. So far problems mainly seem to occur after a machine has gone to sleep and woken up multiple times – or had its network location changed multiple times. Thus it mainly seems (for the moment) to affect laptops or machines that frequently sleep.

The net result is that you’ll get into situations where several errors will start to happen and you’ll need to eventually reinstall the NetWorker client, reboot, and then potentially reinstall the NetWorker client another time. Note that complete cold restarts do not seem to as reliably fix (or temporarily offer a workaround to the) issues as does the reinstall/reboot/reinstall method.

Error 1

Attempts to connect from the server to the client will fail – e.g.,

[root@nox ~]# nsradmin -p 390113 -s archon
39078:nsradmin: RPC error: Remote system error
There does not appear to be a NetWorker nsrexecd server running on archon.

Error 2

Stopping and restarting the NetWorker services on the client fails:

root@archon ~
$ SystemStarter stop NetWorker
Stopping NetWorker Client.
root@archon ~
$ ps -eaf | grep nsr
0  5381  5230   0   0:00.00 ttys001    0:00.00 grep nsr
root@archon ~
$ SystemStarter start NetWorker
Starting NetWorker Client.
/Library/StartupItems/NetWorker/NetWorker: line 10:  5389 Illegal instruction     /usr/sbin/nsrexecd

Error 3

I’m finding that directives are getting confused over directories and paths too:

* archon:/ 70340:savepnpc: ignoring directory specification for `/Users/preston/Library/Application Support/Yojimbo/' in
* archon:/ `/Users/preston/Library/Application Support/Yojimbo/.nsr' - not contained within directory `/users/preston/Library/Application Support/Yojimbo/'
* archon:/ 70340:savepnpc: ignoring directory specification for `/Users/preston/Library/Parallels/' in
* archon:/ `/Users/preston/Library/Parallels/.nsr' - not contained within directory `/users/preston/Library/Parallels/'

It seems to be a spurious error – usually when this happens the directives are still processed.

Error 4

On some backups – usually full, I get hundreds of malloc errors in the savegroup completion – e.g.,

* archon:/ savepnpc(668,0xa0a01500) malloc: *** error for object 0x20: pointer being freed was not allocated
* archon:/ *** set a breakpoint in malloc_error_break to debug
* archon:/ savepnpc(668,0xa0a01500) malloc: *** error for object 0x20: pointer being freed was not allocated
* archon:/ *** set a breakpoint in malloc_error_break to debug

What I’m doing

I’ve currently got a question case open with EMC asking when we’ll get official support for Snow Leopard. I’ll update this blog with details when I can.

[Update] There are existing escalations to get Snow Leopard support. The current tentative schedule, I’m told, is for support in NetWorker 7.6 SP1. There’s apparently escalations against 7.5.x as well – personally, if I were a betting person, I’d be betting we’ll more likely get support in just 7.6 via SP1 rather than both 7.6 and the 7.5 tree.

 

I want to start this article by saying that I’m bound by NDAs all over the place. The company that I work for, being partners with a variety of companies, has NDAs in place for each vendor that results in me being under an NDA as well. Thus, I’m not going to:

  1. Break any NDAs
  2. Advocate violating NDAs

I’m bound by those NDAs in what I write on this blog – I attend partner briefing con-calls/presentations etc., periodically, and get told about upcoming features or more generally roadmaps going up to 2 years out. I’m involved in beta testing – version and feature – and I so I get to see things before a lot of other people. I also get to talk directly to product management at vendors too. So to any vendor reading this, I hope they’ll understand that I’ll still follow all your NDA processes.

Just because I’m bound by NDAs doesn’t mean I can’t talk about where I think they’re wrong.

There’s a growing chorus of “NDAs suck” at the moment, and I’m not laying claim to the idea of blogging about the suck-value of NDAs on my own. I’ve reached the point of wanting to blog about it based on the previous efforts of Grumpy Storage in “Show me the Money (Information)“, and more recently in Matthew Yeager‘s “First, execute with urgency. The rest is commentary“. (Incidentally, that’s two people you really should be following on Twitterianhf and mpyeager respectively.)

Over at Grumpy Storage, Ian, as an end-customer, wrote:

I need electronic copies of any & all materials discussed or presented – no exceptions, without this I can’t use it as reference material in my internal strategy planning. If you hide behind “it’s beyond NDA”, or “NDA prohibits” then I’ll interpret that as “you don’t trust me personally or respect me professionally” and the relationship will be difficult from then on.

This is a pretty damning comment on Ian’s part, and realistically represents how a lot of customers feel about NDAs – and this may be the surprising part – how a lot of suppliers and system integrators feel about them too. (I think he’s wrong about where the trust issue lays, and I’ll get to that soon.)

Matthew drew up an excellent summary of how NDAs protect intent over execution, and some possible solutions to this, and I’d suggest you consider reading both Ian’s and Matthew’s articles in full before continuing with what I’m going to say.

My argument is that NDAs themselves don’t suck. However, I do feel that in the vast majority of instances in which NDAs are applied do, indeed suck.

Trusted partners/suppliers are often “piggy in the middle” when it comes to NDAs. Where we frequently add value is by being closely aligned to our customers (who we prefer to also call partners), working at understanding their business requirements and delivering solutions and information that are tailored to suit those requirements. We recognise that time is precious, attention is a currency, and that the work of IT managers and staff isn’t to be sold to by a business, but to deliver to the business. By having the time to work directly with businesses, we offer a value-add that bungee-vendor sales rarely if ever can. That’s why a lot of companies choose to work with integrators and suppliers rather than vendors directly. As such, perhaps more than end-customers, as an integrator I can look at the various information I know that are locked away under NDA and really, really regret that I can’t readily tell my customers to help them with their forward planning.

So in that sense, NDAs are a constant case of “Here’s some really good information! But. You. Can’t. Tell. Anyone.

Now, my beef with NDAs is not that they exist – I’m a fierce proponent of intellectual property protection. My beef is in where NDAs are applied. Or perhaps to be more succinct – in the frequency with which NDAs are applied. It’s too often. It’s across the board on a range of things where it logically makes no sense, and it’s often for the wrong reasons.

Ian at Grumpy Storage sees NDAs as a trust issue. I agree, but I think he’s (understandably) missing where the trust-issue really exists. You see, in big companies – and most vendors fall into this category, few people have “authority”. In this case, by authority, I’m talking about authority to discuss information on unreleased products or features with non-employees. This goes to the heart of corporate secrecy, and if companies should understand anything by now it’s that social networking is eroding this. So it’s trust alright, but the trust issue is in companies mistrusting their staff to make sensible judgment calls, or mistrusting the market to such a degree that the wrong disclosure decisions are made.

Recently, a senior vendor employee told me the following in relation to consulting:

“giving away info” is exactly what consultants need to do — controversial, but effective

Here’s the rub: the same applies to most situations where NDAs are pulled out. That is, in places where information is currently bartered (“I’ll tell you, but only if you sign this document that says I can sue you if you tell anyone else”), it should be flowing freely. (Call it the next step in the Cluetrain Manifesto if you will.) This is something that’s imperative to turn around. It’s already important with this generation, but just think of how important it’s going to be in a business environment saturated with Gen-Y’ers, all whom thrive on interchange and connectivity. (I’ve not said it so succinctly before, but I think Gen-Y is going to cause one of the biggest upheavals ever experienced in business communications, practices and procedures.)

I’d wager that the following two reasons sum up most of the times that NDAs are waved around:

  1. Vendor employees are insufficiently empowered as to be able to make a judgment call that the people they are speaking to can be trusted. Lacking this empowerment, they must take the safe approach. (Hey, they need jobs too.)
  2. Vendor management and legal frequently resort to the knee jerk reaction (sometimes due to a lack of empowerment themselves) of trying to hide as much information as possible.

These, of course, are on top of the actual valid reasons why we have NDAs – to protect key components of intellectual property. However, those valid reasons are definitely in the minority. If a picture helps, I’d suggest the following breakdown is fairly indicative of why vendors ask people to sign NDAs:

Reasons behind NDAs

The net result is that within the IT industry overall we’re awash with NDAs. It reminds me of the Great Loyalty Oath Crusade, from my favourite book, Catch-22:

Almost overnight the Glorious Loyalty Oath Crusade was in full flower, and Captain Black was enraptured to discover himself spearheading it. He had really hit on something. All the enlisted men and officers on combat duty had to sign a loyalty oath to get their map cases from the intelligence tent, a second loyalty oath to receive their flak suits and parachutes from the parachute tent, a third loyalty oath for Lieutenant Balkington, the motor vehicle officer, to be allowed to ride from the squadron to the airfield in one of the trucks. Every time they turned around there was another loyalty oath to be signed. They signed a loyalty oath to get their pay from the finance officer, to obtain their PX supplies, to have their hair cut by the Italian barbers. To Captain Black, every officer who supported his Glorious Loyalty Oath Crusade was a competitor, and he planned and plotted twenty-four hours a day to keep one step ahead. He would stand second to none in his devotion to country. When other officers had followed his urging and introduced loyalty oaths of their own, he went them one better by making every son of a bitch who came to his intelligence tent sign two loyalty oaths, then three, then four; then he introduced the pledge of allegiance, and after that “The Star-Spangled Banner,” one chorus, two choruses, three choruses, four choruses. Each time Captain Black forged ahead of his competitors, he swung upon them scornfully for their failure to follow his example. Each time they followed his example, he retreated with concern and racked his brain for some new stratagem that would enable him to turn upon them scornfully again.

Sometimes it seems we’re stuck in the middle of a Great NDA Crusade, and just like in Catch-22, we need a Major –– de Coverley, who can say:

“Gimme eat.”

Instead of eat, Corporal Snark gave Major –– de Coverley a loyalty oath to sign. Major –– de Coverley swept it away with mighty displeasure the moment he recognized what it was, his good eye flaring up blindingly with fiery disdain and his enormous old corrugated face darkening in mountainous wrath.

“Gimme eat, I said,” he ordered loudly in harsh tones that rumbled ominously through the silent tent like claps of distant thunder.

Corporal Snark turned pale and began to tremble. He glanced toward Milo pleadingly for guidance. For several terrible seconds there was not a sound. Then Milo nodded.

“Give him eat,” he said.

Corporal Snark began giving Major –– de Coverley eat. Major –– de Coverley turned from the counter with his tray full and came to a stop. His eyes fell on the groups of other officers gazing at him in mute appeal, and, with righteous belligerence, he roared:

“Give everybody eat!”

“Give everybody eat!” Milo echoed with joyful relief, and the Glorious Loyalty Oath Crusade came to an end.

(Catch-22, ISBN 978-0-999-47046-5, Joseph Heller, First Published in Great Britain in 1962. Thanks also to The Sheila Variations website, that saved me from retyping those sections by having already quoted them.)

I want a vendor who will be the Major –– de Coverley of the industry. A vendor who will stand up and say “enough is enough” to frivolous NDAs that do nothing more than stifle discussion.

I’m not calling for an end to NDAs. There are some NDAs that should be preserved. For instance, I’d never argue for the cessation of NDAs when it comes to alpha/beta testing. I’d also suggest that long term forecasts should fall under the realm of NDAs too. (That’s two examples of where the “20%” or so that I estimate of NDAs that are valid come from.)

But what’s long term? That’s a year out, at least. Within that time frame? You should be confident enough in your development programme that you can talk about it to everyone, not just people under NDA. Hell, even if you want to bring this back to only six months, there should be a “forward looking” period that vendors are comfortable talking about without NDA shields. After all, let’s face it: everything published under an NDA  still starts with various comments such as:

The items discussed in this document contain forward-looking statements that reflect … blah blah blah … it is our aim to get there … blah blah blah … but don’t hold us to anything if we don’t get there.

So it’s not as if the information discussed in NDAs is so rock solid that you can take bets on it anyway! So then … make those same caveats then pull out the useful information about upcoming features!

For information about features and products that are going to come out within 6-12 months, there’s no point for that to be under NDA. In fact, it does more harm than good, especially when you’re talking to a company that wants to buy something, but needs to know where it’s heading. It leads to situations where products are say, disqualified for consideration because they don’t have a feature yet, but because it’s so tightly bound up in an NDA, even though it will be available by the time the purchase decision is made, the message doesn’t get heard.

I know there’s the argument that new features, or perhaps more importantly, upcoming features, need to be protected from competitors. Does anyone seriously think NDAs shield anyone from this? Employees routinely shift from vendor to vendor, and while they’re usually under non-compete clauses, and clauses that restrain them from discussing products and features they were working on, those clauses only last so long – in most cases seemingly limited to 12 weeks or so. In short – if vendor A wants to know what vendor B is up to, they poach staff, or watch who they’re purchasing and make educated guesses.

Not only that, every vendor that has a clue has fairly heavily populated product development strategies ranging from 6 months to 2 years out, and just hearing that someone is going to implement some technology doesn’t mean that a competitor can instantly slot in development resources immediately on it in order to ape that functionality too. (Assuming they don’t already have the technology – it can be a case of “catch up” sometimes.)

So, would much change under reduced disclosure via NDAs? It seems bloody unlikely.

Ah“, some would say, “It’s not just the competitors. It’s also the risk of being sued by a company if they purchase X on the basis of us implementing some feature A that we’ve talked about, but for some reason we don’t get around to it in the specified timeframe.”

“Um, so what?” would be my response to this. There’s two very important rejoinders to the above arguments:

  1. Make forward looking statements with the standard caveats that are already heavily applied to NDAs anyway; i.e., it works for an NDA situation, so why won’t it work for an ordinary situation?
  2. Only talk about things that are well within development scope – again, we’re talking about that period of up to 6 or 12 months out from now. That should be things that you’re reasonably confident of achieving.

Ah“, some would say, “Then there’s stymieing by proxy – even if competitors don’t intend to implement the same thing we’re doing, they’ll just talk about doing it to convince people to stick with them, or buy them instead.”

To this I would say: Companies that repeatedly talk about products or features they then don’t go on to release in time (or at all) quickly get a reputation for vaporware. So don’t get too hung up about that – the market usually deals with vaporware vendors very efficiently.

Ah“, some would say, “But what about the Osbourne Effect?” To this I’d say that particularly with mature product ranges, there shouldn’t regularly be an upcoming update that’s so earth shattering that it would cause someone to hold off buying until that is released. If someone needs a backup product now, or an array now, or a tape library now, they won’t keep on indefinitely putting it off just because there’s bigger and better things around the corner. Guess what? We’re all in IT here –– we all know that products have a fairly defined ride between superiority, regularity and obsolescence. Or as the old saying goes: if you keep waiting for the best computer to be released before you buy, you’ll never buy a computer.

In situations where there’s potential upheaval, have a clear upgrade strategy that clearly states and amortizes the cost appropriately – most companies will thank you. On the other hand, what they won’t thank you for is a situation where they buy a product from you that gets end of lifed or shelved shortly thereafter without any advance warning or clear roadmap of a way forward. I’ve seen multiple instances where vendors have permanently soured relationships with managers at customer sites. This makes the technical person at the site that recommended the purchase look bad, or worry about looking bad. And it also makes the manager who authorised the purchase worry that they “look bad”. Such issues don’t remain at that customer site – unresolved failures in customer satisfaction roll forward into every site that a person moves on to. Trust me – I’ve seen it, I know managers who refuse to buy products from vendor X for exactly that reason, and they’ve carried it through as policy on sites they’ve moved on to.

    Being upfront on the other hand encourages customers to believe you have their best interest at heart. For instance, companies are still happily buying LTO-4 tape libraries, particularly from vendors offering free LTO-5 drive swap-ins, or even in situations where they know there’ll be a (relatively) small fee.

    What we need is for the vendors to start to frankly evaluate where they’re slapping NDAs about. Sometimes it’s like navigating through a sea of pamphlet wielders at a train station – or a voting booth.

    Come on vendors – reappraise where and how frequently you’re throwing NDAs around and prove to us that you actually live in the same information-rich world that you want to supply products to. Tone the NDAs down and use them appropriately, and use them sparingly. If you want another analogy – it’s becoming a bit too “boy who cried wolf”, quite frankly.

     

    Over at a website called ignore the code, there’s a fascinating and insightful piece at the moment about removing features.

    This is often a controversial topic in software design and development, and Lukas Mathis handles the topic in his typically excellent style. In particular, the summation of the problem through illustrations of two “Swiss Army Knives” demonstrates the issue quite well.

    So what does this have to do with NetWorker, you might ask? Well, quite a bit. In light of the recent release of NetWorker 7.5 SP2 I thought it relevant to spend a little time ruminating about the software development process, relating it to NetWorker, and asking EMC product management some questions about their processes.

    Within any software development model, there are four requirements:

    1. Adding new features.
    2. Refining existing features.
    3. Removing obsolete features.
    4. Fixing bugs.

    It’s a challenging problem – any one or two of these requirements can be readily accommodated without much fuss. The challenge that faces all vendors though is balancing all four software development processes. Personally, I don’t envy the juggling process that faces product managers and product support managers on a daily basis. Why? All four requirements combined create clashing priorities and schedules that makes for a very challenging environment. (It’s not unique to NetWorker of course – it applies pretty equally to just about every software product.)

    In most situations, it’s easiest to add new features. This can be a double-edged sword. On the positive side, it can be a key factor in enticing potential customers to become actual customers, and it can equally be a key factor in enticing existing customers to remain customers rather than moving to the competition. On the negative side, it can lead to software bloating – a primary criticism of companies like Microsoft and Adobe. (Thankfully, I don’t think you can accuse NetWorker of being too ‘bloated’; in the 14 or so years I’ve been using it, the install footprint has of course gone up, but there’s not really been any “why the hell did they do that?” new features, and overall the footprint is well within the bounds for backup and recovery software.)

    Like any good backup product, NetWorker’s development history is full of new features being added to it, such as the following:

    1. Storage nodes added in v5.x.
    2. Dynamic drive sharing added in v6.
    3. Advanced File Type Devices (ADV_FILE) added in v7.
    4. Jobs database introduced in v7.3.
    5. Virtualisation visualisation in v7.5.
    6. and so on.

    Without new features being regularly updated, companies leave themselves open to having the competition overtake them, and so periodically when we see a vendor respond to market forces (or try to push the market in a new direction), we should, even if we aren’t particularly fond of the new feature, accept that adding new features are inevitable in software development.

    Equally, NetWorker history is rife with examples of existing features being refined, such as the following:

    1. Support for dedicated storage nodes.
    2. Enhancing the index system in v6 to overcome previous design limitations.
    3. Enhancing the resource configuration database in v7 to overcome previous design limitations.
    4. Frequent enhancement of all the database and application backup modules.
    5. Pool based retention.
    6. and so on.

    You could say that feature refinement is all about evolutionary growth of the product. It’s never specifically about introducing entire new features – these are existing features that have grown between releases – usually in response to changing requirements in customer environments. (For instance, the previous resource configuration database worked well so long as you had smallish environments. Over time as environments became more complex, with more clients, and increased configuration requirements, it could no longer cut the mustard, triggering the redesign.)

    The more challenging aspect for enterprise backup software is the notion of removing features – if doing so affects legacy recoverability options, it could cause issues for long-term users of the products, and so we usually usability features removed rather than core support features. A few of the features over time that have been removed are:

    1. Support for the old GUIs (networkr.exe from Windows, nwadmin from Unix).
    2. Support for browsing indices via NFS mounts. (This was even before my time with NetWorker. It looks like it would have been fun to play with, but it wasn’t exactly cross-platform compatible!)
    3. Support for cross platform recoveries.
    4. Support for defunct tape formats (e.g., VHS).

    I’d argue that it’s rarely the case that decisions to remove functionality are taken lightly. Usually it will be for one of three reasons:

    • The feature was ‘fragile’ and fixing it would take too much effort.
    • The feature is no longer required after a change in direction for the product.
    • The feature is no longer being used by a sufficient number of users and its continued presence would hamper new directions/features for the product.

    None of these, I’d argue, are easy decisions.

    Finally we have the bugs – or “unanticipated features”, as we sometimes like to call them. Any vendor that tells you their software is 100% bug free is either lying, or their ‘product’ no more complex than /bin/true. Bugs are practically unavoidable, so the focus must be on solid testing, identification and containment. I’ll be the first to admit that there have been spotty patches in the past where testing in NetWorker has seemed to be lacking, but having been on the last couple of betas, I’m seeing a roaring return to rigorous testing in 7.5 and 7.6. Did these pick up all bugs? No – again, see my point about no software ever being 100% bug free.

    I’ll hand on my heart say that I can’t cite a single company that has had a spotless record when it comes to bug control – this isn’t easy. Enterprise class backup software introduces new levels of complexity into the equation, and it’s worthwhile considering why. You can take exactly the same piece of enterprise backup software and install it into 50 different companies and I’ll bet that you’ll get a significant number of “unique” situations in addition to the core/standard user experience. Backup software touches on practically every part of an IT environment, and so is affected by a myriad of environment and configuration issues that normal software rarely has to contend with. Or to put it better: while another piece of software may have to contend with one or two isolated areas of environment/configuration uniqueness, backup software will usually have to contend with all of them, and remain as stable as possible throughout.

    This isn’t easy. I may periodically get exasperated over bugs, etc., but I recognise the inevitability that I’ll be continuing to deal with bugs in any software I’m using for the rest of my life – so it’s hardly a NetWorker specific issue. (I’m going on the basis here that quantum computing won’t suddenly deliver universal turing machines capable of simulating every possible situation and input for software and hardware.)

    While I was writing this article, I thought it would be worthwhile to get some feedback from EMC NetWorker product management on this, and I’m pleased to include my questions to them, as well as their answers, below. These answers come from product management and engineering, and I’m presenting them unedited in their complete form.

    Question 1

    I’ve been told that EMC has taken considerable steps to speed up the RFE process. Can you briefly summarise the improvements that have been made and the buy-in from product management and engineering on this?

    Answer:

    With the large size of the NetWorker installed base, we receive many RFEs per month. These requests range in nature from architectural changes to relatively small operational enhancements. We have made great strides in organizing the RFE pool in such a manner so that at the front end of the release planning process we can look back over hundreds of discreet requests and digest those requests into an achievable number of specific and prioritized product requirements.

    RFEs come in to the product team through three sources. We take RFEs on PowerLink (EMC’s information portal), through the Support organization, and in face to face meetings with customers and partners. NetWorker Product Management has a central database so that we can consolidate the RFE pool and apply a standard process for scrubbing and categorizing the requests. This is a time consuming process, but it provides us with the capabilities to track the areas of the product that are receiving the most requests and. That allows us to establish goals for a particular release and include RFEs accordingly. An example might be improved back up to disk workflows. The ability to quickly drill down to the requests most relevant to our high-level priorities allows us to efficiently write requirements that directly incorporate end-user feedback.

    More customer requests for enhancement will be implemented in 2010 than ever before.  We will address some of the big changes that customers have been calling for, and will also look to implement some bonus enhancements; small changes that won’t make the marketing slides but will make NetWorker operations easier on backup administrators who interact with the product on a daily basis.

    Question 2

    One challenge with any software vendor is integrating patches (or hot fixes) into stable development trees. How would EMC rate itself with this in relation to NetWorker?

    Answer:

    We maintain a high level of discipline in maintaining our active code branches.  Hot fixes typically flow into our bug-fix service packs, (such as 7.5 SP1) which then flow back into the main code branch. Any code change made to an active branch must also be applied to the development branch, which builds on a regular basis. Build failures in development are taken very seriously by Engineering, and we engage resources to actively troubleshoot and resolve these issues.

    Question 3

    Currently we’re seeing cumulative patch cluster releases for most of the supported versions of NetWorker. E.g., NetWorker 7.5 SP1 is now up to cumulative patch cluster 8. These patch clusters currently remain available only via EMC support or partner support programs, and aren’t readily downloadable via standard PowerLink sources. With the projects currently being worked on to improve PowerLink, will we see this change, or is the rationale to not readily provide these cumulative patches a support one?

    Answer:

    When we post to PowerLink, we want to be sure that anyone who downloads code from EMC knows exactly what they’re getting. If we posted all of the clusters within today’s PowerLink framework, the result would be a confusing PowerLink experience for customers.  We consider the patch cluster process to be an improvement on earlier practices and look forward to continued improvements in this area.

    Question 4

    What feature are you most pleased to have seen integrated into either NetWorker 7.5 or 7.6?

    Answer:

    We are very pleased with the NetWorker Management Console work that has done over the course of 7.5 and 7.6. Visualization of virtual environments (introduced in 7.5) has been very well received by customers, and we believe that the improvements in 7.6 around customization and performance will also be greatly appreciated as customers move to 7.6+ releases.

    Question 5

    One RFE process advocated is to have product management vet RFEs and submit them to a public forum to be voted on by community users. Advocates of this model say that it allows better community involvement and has products evolve to meet existing user requirements. Those who disagree with this model usually suggest that existing user feature suggestions don’t always accommodate design changes that would help boost market share. Is this a model which EMC has considered, or is it seeking to informally do this via the various EMC Community Forums that have been established?

    Answer:

    A closed loop is ideally what our enterprise customers who submit RFEs look for i.e. to enter an RFE, track it, see if it is relevant and will be seriously considered.  Capturing and allowing other users to vote is an option we are actively exploring. We would have to put some infrastructure in place to do so, but it is under investigation. The first audience for such an option would be our recently launched EMC community for NetWorker. The NetWorker user community is quite sophisticated, and we value their input tremendously. While it is true that some users take a narrow view of how NetWorker should evolve, others take a broader and more market-centric view. Our RFEs run the full spectrum.

     

    While I touched on this in the second blog posting I made (Instantiating Savesets), it’s worthwhile revisiting this topic more directly.

    Using ADV_FILE devices can play havoc with conventional tape rotation strategies; if you aren’t aware of these implications, it could cause operational challenges when it comes time to do recovery from tape. Let’s look at the lifecycle of a saveset in a disk backup environment where a conventional setup is used. It typically runs like this:

    1. Backup to disk
    2. Clone to tape
    3. (Later) Stage to tape
    4. (At rest) 2 copies on tape

      Looking at each stage of this, we have:

      Saveset on ADV_FILE deviceThe saveset, once written to an ADV_FILE volume, has two instances. The instance recorded as being on the read-read only part of the volume will have an SSID/CloneID of X/Y. The instance recorded as being on the read-write part of the volume will have an SSID/CloneID of X/Y+1. This higher CloneID is what causes NetWorker, upon a recovery request, to seek the “instance” on the read-only volume. Of course, there’s only one actual instance (hence why I object so strongly to the ‘validcopies’ field introduced in 7.6 reporting 2) – the two instances reported are “smoke and mirrors” to allow simultaneous backup to and recovery from an ADV_FILE volume.

      The next stage sees the saveset cloned:

      ADV_FILE + Tape CloneThis leaves us with 3 ‘instances’ – 2 physical, one virtual. Our SSID/CloneIDs are:

      • ADV_FILE read-only: X/Y
      • ADV_FILE read-write: X/Y+1
      • Tape: X/Y+n, where n > 1.

      At this point, any recovery request will still call for the instance on the read-only part of the ADV_FILE volume, so as to help ensure the fastest recovery initiation.

      At some future point, as disk capacity starts to run out on the ADV_FILE device, the saveset will typically be staged out:

      ADV_FILE staging to tapeAt the conclusion of the staging operation, the physical + virtual instances of the saveset on the ADV_FILE device are removed, leaving us with:

      Savesets on tape only

      So, at this point, we end up with:

      • A saveset instance on a clone volume with SSID/CloneID of: X/Y+n.
      • A saveset instance on (typically) a non-clone volume with SSID/CloneID of: X/Y+n+m, where m > 0.

      So, where does this leave us? (Or if you’re not sure where I’ve been heading yet, you may be wondering what point I’m actually trying to make.)

      Note what I’ve been saying each time – NetWorker, when it needs to read from a saveset for recovery purposes, will want to pick the saveset instance with the lowest CloneID. At the point where we’ve got a clone copy and a staged copy, both on tape, the clone copy will have the lowest CloneID.

      The net result is that NetWorker will, in these circumstances, when both tapes aren’t online, request the clone volume for recovery – even though in an extreme number of cases, this will be the volume that’s offsite.

      For NetWorker versions 7.3.1 and lower, there was only one solution to this – you had to hunt down the actual clone saveset instances NetWorker was asking for, mark them as suspect, and reattempt the recovery. If you managed to mark them all as suspect, then you’d be able to ‘force’ NetWorker into facilitating the recovery from the volume(s) that had been staged to. However, after the recovery you had to make sure you backed out of those changes, so that both the clones and the staged copies would be considered not-suspect.

      Some companies, in this situation, would instigate a tape rotation policy such that clone volumes would be brought back from off-site before savesets were likely to be staged out, with subsequently staged media sent offsite. This has a dangerous side-effect of temporarily leaving all copies of backups on-site, jeapordising disaster recovery situations, and hence it’s something that I couldn’t in any way recommend.

      The solution introduced around 7.3.2 however is far simpler – a mminfo flag called offsite. This isn’t to be confused with the convention of setting a volume location field to ‘offsite’ when the media is removed from site. Annoyingly, this remains unqueryable; you can set it, and NetWorker will use it, but you can’t say, search for volumes with the ‘offsite’ flag set.

      The offsite flag has to be manually set, using the command:

      # nsrmm -o offsite volumeName

      (where volumeName typically equals the barcode).

      Once this is set, then NetWorker’s standard saveset (and therefore volume) selection criteria is subtly adjusted. Normally if there are no online instances of a saveset, NetWorker will request the saveset with the lowest CloneID. However, saveset instances that are on volumes with the offsite flag set will be deemed ineligible and NetWorker will look for a saveset instance that isn’t flagged as being offsite.

      The net result is that when following a traditional backup model with ADV_FILE disk backup (backup to disk, clone to tape, stage to tape), it’s very important that tape offsiting procedures be adjusted to set the offsite flag on clone volumes as they’re removed from the system.

      The good news is that you don’t normally have to do anything when it’s time to pull the tape back onsite. The flag is automatically cleared* for a volume as soon as it’s put back into an autochanger and detected by NetWorker. So when the media is recycled, the flag will be cleared.

      If you come from a long-term NetWorker site and the convention is still to mark savesets as suspect in this sort of recovery scenario, I’d suggest that you update your tape rotation policies to instead use the offsite flag. If on the other hand, you’re about to implement an ADV_FILE based backup to disk policy, I’d strongly recommend you plan in advance to configure a tape rotation policy that uses the offsite flag as cloned media is sent away from the primary site.


      * If you did need to explicitly clear the flag, you can run:

      # nsrmm -o notoffsite volumeName

      Which would turn the flag back off for the given volumeName.

       

      My boss, on his blog, has raised a pertinent question – if it’s so important, according to some vendors, that backup and archive are all achieved through the same product interface, then how many companies out there assign the role of archive administrator to the backup administrator? (Or vice versa).

      I like this question; it’s kind of like the old conundrum of whether the dog wags the tail, or whether the tail wags the dog. That is, are companies that heavily push an integrated backup and archive interface:

      • Responding to the needs of IT to meet current desired business functionality, or,
      • Are they trying to drive IT in a way that perhaps doesn’t meet desired business functionality?

      (Or indeed, something else entirely).

      [Edit, further thoughts, 2010-03-03] I’ve been thinking more about this, and I have to say I can’t think of a single customer environment off-hand where the backup administrator is also responsible for archiving. Archiving seems to remain primarily the purdue of the storage administration teams in sites that I’m aware of, so it does beg the question – how beneficial is an integrated backup and archive administration process?

      [Original wrap-up] So if you’ve got any thoughts on the integration of backup and archive administration, either at the software or the human resources layer, I’d encourage you to jump across to Mike’s blog and make your voice heard.

      (As a first, I’ve disabled comments on this blog posting, so as to encourage discussion to remain in one location – the source article.)

       

      Close enough together that I have to declare them a tie, the top stories for February were:

      It’s fair to say that Carry a jukebox with you is remaining a big hit all the time – a bit like the “NSR peer information” story, and so February will be the last month that it gets included in consideration for top articles.

      Towards the end of the month, with the release of NetWorker 7.5 SP2, there was quite a lot of interest in the articles “NetWorker 7.5.2 released” and “NetWorker 7.5.2 – What’s it got?“. Obviously if you’ve got Windows 2008 or Windows 7 clients that you need to backup, 7.5 SP2 is almost a no-brainer – you’ll really need to be using it. So far, based on my testing on Linux, 7.5 SP2 is looking fairly good for that platform too. As always, everyone should read the release notes before deciding whether to upgrade their environments.

      © 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha