Enterprise Systems Backup and Recovery If you have an interest in, or work in data protection/backup and recovery environments, you should check out my book, Enterprise Systems Backup and Recovery: A Corporate Insurance Policy. Designed for system administrators and managers alike, it focuses on features, policies, procedures and the human element to ensuring that your company has a suitable and working backup system.
|
While initially I had some success with Snow Leopard and Mac OS X, I’m increasingly finding that it’s just boiling down to being too random for reliable backups. So far problems mainly seem to occur after a machine has gone to sleep and woken up multiple times – or had its network location changed multiple times. Thus it mainly seems (for the moment) to affect laptops or machines that frequently sleep.
The net result is that you’ll get into situations where several errors will start to happen and you’ll need to eventually reinstall the NetWorker client, reboot, and then potentially reinstall the NetWorker client another time. Note that complete cold restarts do not seem to as reliably fix (or temporarily offer a workaround to the) issues as does the reinstall/reboot/reinstall method.
Error 1
Attempts to connect from the server to the client will fail – e.g.,
[root@nox ~]# nsradmin -p 390113 -s archon
39078:nsradmin: RPC error: Remote system error
There does not appear to be a NetWorker nsrexecd server running on archon.
Error 2
Stopping and restarting the NetWorker services on the client fails:
root@archon ~
$ SystemStarter stop NetWorker
Stopping NetWorker Client.
root@archon ~
$ ps -eaf | grep nsr
0 5381 5230 0 0:00.00 ttys001 0:00.00 grep nsr
root@archon ~
$ SystemStarter start NetWorker
Starting NetWorker Client.
/Library/StartupItems/NetWorker/NetWorker: line 10: 5389 Illegal instruction /usr/sbin/nsrexecd
Error 3
I’m finding that directives are getting confused over directories and paths too:
* archon:/ 70340:savepnpc: ignoring directory specification for `/Users/preston/Library/Application Support/Yojimbo/' in
* archon:/ `/Users/preston/Library/Application Support/Yojimbo/.nsr' - not contained within directory `/users/preston/Library/Application Support/Yojimbo/'
* archon:/ 70340:savepnpc: ignoring directory specification for `/Users/preston/Library/Parallels/' in
* archon:/ `/Users/preston/Library/Parallels/.nsr' - not contained within directory `/users/preston/Library/Parallels/'
It seems to be a spurious error – usually when this happens the directives are still processed.
Error 4
On some backups – usually full, I get hundreds of malloc errors in the savegroup completion – e.g.,
* archon:/ savepnpc(668,0xa0a01500) malloc: *** error for object 0x20: pointer being freed was not allocated
* archon:/ *** set a breakpoint in malloc_error_break to debug
* archon:/ savepnpc(668,0xa0a01500) malloc: *** error for object 0x20: pointer being freed was not allocated
* archon:/ *** set a breakpoint in malloc_error_break to debug
What I’m doing
I’ve currently got a question case open with EMC asking when we’ll get official support for Snow Leopard. I’ll update this blog with details when I can.
[Update] There are existing escalations to get Snow Leopard support. The current tentative schedule, I’m told, is for support in NetWorker 7.6 SP1. There’s apparently escalations against 7.5.x as well – personally, if I were a betting person, I’d be betting we’ll more likely get support in just 7.6 via SP1 rather than both 7.6 and the 7.5 tree.
I want to start this article by saying that I’m bound by NDAs all over the place. The company that I work for, being partners with a variety of companies, has NDAs in place for each vendor that results in me being under an NDA as well. Thus, I’m not going to:
- Break any NDAs
- Advocate violating NDAs
I’m bound by those NDAs in what I write on this blog – I attend partner briefing con-calls/presentations etc., periodically, and get told about upcoming features or more generally roadmaps going up to 2 years out. I’m involved in beta testing – version and feature – and I so I get to see things before a lot of other people. I also get to talk directly to product management at vendors too. So to any vendor reading this, I hope they’ll understand that I’ll still follow all your NDA processes.
Just because I’m bound by NDAs doesn’t mean I can’t talk about where I think they’re wrong.
There’s a growing chorus of “NDAs suck” at the moment, and I’m not laying claim to the idea of blogging about the suck-value of NDAs on my own. I’ve reached the point of wanting to blog about it based on the previous efforts of Grumpy Storage in “Show me the Money (Information)“, and more recently in Matthew Yeager’s “First, execute with urgency. The rest is commentary“. (Incidentally, that’s two people you really should be following on Twitter – ianhf and mpyeager respectively.)
Over at Grumpy Storage, Ian, as an end-customer, wrote:
I need electronic copies of any & all materials discussed or presented – no exceptions, without this I can’t use it as reference material in my internal strategy planning. If you hide behind “it’s beyond NDA”, or “NDA prohibits” then I’ll interpret that as “you don’t trust me personally or respect me professionally” and the relationship will be difficult from then on.
This is a pretty damning comment on Ian’s part, and realistically represents how a lot of customers feel about NDAs – and this may be the surprising part – how a lot of suppliers and system integrators feel about them too. (I think he’s wrong about where the trust issue lays, and I’ll get to that soon.)
Matthew drew up an excellent summary of how NDAs protect intent over execution, and some possible solutions to this, and I’d suggest you consider reading both Ian’s and Matthew’s articles in full before continuing with what I’m going to say.
My argument is that NDAs themselves don’t suck. However, I do feel that in the vast majority of instances in which NDAs are applied do, indeed suck.
Trusted partners/suppliers are often “piggy in the middle” when it comes to NDAs. Where we frequently add value is by being closely aligned to our customers (who we prefer to also call partners), working at understanding their business requirements and delivering solutions and information that are tailored to suit those requirements. We recognise that time is precious, attention is a currency, and that the work of IT managers and staff isn’t to be sold to by a business, but to deliver to the business. By having the time to work directly with businesses, we offer a value-add that bungee-vendor sales rarely if ever can. That’s why a lot of companies choose to work with integrators and suppliers rather than vendors directly. As such, perhaps more than end-customers, as an integrator I can look at the various information I know that are locked away under NDA and really, really regret that I can’t readily tell my customers to help them with their forward planning.
So in that sense, NDAs are a constant case of “Here’s some really good information! But. You. Can’t. Tell. Anyone.”
Now, my beef with NDAs is not that they exist – I’m a fierce proponent of intellectual property protection. My beef is in where NDAs are applied. Or perhaps to be more succinct – in the frequency with which NDAs are applied. It’s too often. It’s across the board on a range of things where it logically makes no sense, and it’s often for the wrong reasons.
Ian at Grumpy Storage sees NDAs as a trust issue. I agree, but I think he’s (understandably) missing where the trust-issue really exists. You see, in big companies – and most vendors fall into this category, few people have “authority”. In this case, by authority, I’m talking about authority to discuss information on unreleased products or features with non-employees. This goes to the heart of corporate secrecy, and if companies should understand anything by now it’s that social networking is eroding this. So it’s trust alright, but the trust issue is in companies mistrusting their staff to make sensible judgment calls, or mistrusting the market to such a degree that the wrong disclosure decisions are made.
Recently, a senior vendor employee told me the following in relation to consulting:
“giving away info” is exactly what consultants need to do — controversial, but effective
Here’s the rub: the same applies to most situations where NDAs are pulled out. That is, in places where information is currently bartered (“I’ll tell you, but only if you sign this document that says I can sue you if you tell anyone else”), it should be flowing freely. (Call it the next step in the Cluetrain Manifesto if you will.) This is something that’s imperative to turn around. It’s already important with this generation, but just think of how important it’s going to be in a business environment saturated with Gen-Y’ers, all whom thrive on interchange and connectivity. (I’ve not said it so succinctly before, but I think Gen-Y is going to cause one of the biggest upheavals ever experienced in business communications, practices and procedures.)
I’d wager that the following two reasons sum up most of the times that NDAs are waved around:
- Vendor employees are insufficiently empowered as to be able to make a judgment call that the people they are speaking to can be trusted. Lacking this empowerment, they must take the safe approach. (Hey, they need jobs too.)
- Vendor management and legal frequently resort to the knee jerk reaction (sometimes due to a lack of empowerment themselves) of trying to hide as much information as possible.
These, of course, are on top of the actual valid reasons why we have NDAs – to protect key components of intellectual property. However, those valid reasons are definitely in the minority. If a picture helps, I’d suggest the following breakdown is fairly indicative of why vendors ask people to sign NDAs:

The net result is that within the IT industry overall we’re awash with NDAs. It reminds me of the Great Loyalty Oath Crusade, from my favourite book, Catch-22:
Almost overnight the Glorious Loyalty Oath Crusade was in full flower, and Captain Black was enraptured to discover himself spearheading it. He had really hit on something. All the enlisted men and officers on combat duty had to sign a loyalty oath to get their map cases from the intelligence tent, a second loyalty oath to receive their flak suits and parachutes from the parachute tent, a third loyalty oath for Lieutenant Balkington, the motor vehicle officer, to be allowed to ride from the squadron to the airfield in one of the trucks. Every time they turned around there was another loyalty oath to be signed. They signed a loyalty oath to get their pay from the finance officer, to obtain their PX supplies, to have their hair cut by the Italian barbers. To Captain Black, every officer who supported his Glorious Loyalty Oath Crusade was a competitor, and he planned and plotted twenty-four hours a day to keep one step ahead. He would stand second to none in his devotion to country. When other officers had followed his urging and introduced loyalty oaths of their own, he went them one better by making every son of a bitch who came to his intelligence tent sign two loyalty oaths, then three, then four; then he introduced the pledge of allegiance, and after that “The Star-Spangled Banner,” one chorus, two choruses, three choruses, four choruses. Each time Captain Black forged ahead of his competitors, he swung upon them scornfully for their failure to follow his example. Each time they followed his example, he retreated with concern and racked his brain for some new stratagem that would enable him to turn upon them scornfully again.
Sometimes it seems we’re stuck in the middle of a Great NDA Crusade, and just like in Catch-22, we need a Major –– de Coverley, who can say:
“Gimme eat.”
Instead of eat, Corporal Snark gave Major –– de Coverley a loyalty oath to sign. Major –– de Coverley swept it away with mighty displeasure the moment he recognized what it was, his good eye flaring up blindingly with fiery disdain and his enormous old corrugated face darkening in mountainous wrath.
“Gimme eat, I said,” he ordered loudly in harsh tones that rumbled ominously through the silent tent like claps of distant thunder.
Corporal Snark turned pale and began to tremble. He glanced toward Milo pleadingly for guidance. For several terrible seconds there was not a sound. Then Milo nodded.
“Give him eat,” he said.
Corporal Snark began giving Major –– de Coverley eat. Major –– de Coverley turned from the counter with his tray full and came to a stop. His eyes fell on the groups of other officers gazing at him in mute appeal, and, with righteous belligerence, he roared:
“Give everybody eat!”
“Give everybody eat!” Milo echoed with joyful relief, and the Glorious Loyalty Oath Crusade came to an end.
(Catch-22, ISBN 978-0-999-47046-5, Joseph Heller, First Published in Great Britain in 1962. Thanks also to The Sheila Variations website, that saved me from retyping those sections by having already quoted them.)
I want a vendor who will be the Major –– de Coverley of the industry. A vendor who will stand up and say “enough is enough” to frivolous NDAs that do nothing more than stifle discussion.
I’m not calling for an end to NDAs. There are some NDAs that should be preserved. For instance, I’d never argue for the cessation of NDAs when it comes to alpha/beta testing. I’d also suggest that long term forecasts should fall under the realm of NDAs too. (That’s two examples of where the “20%” or so that I estimate of NDAs that are valid come from.)
But what’s long term? That’s a year out, at least. Within that time frame? You should be confident enough in your development programme that you can talk about it to everyone, not just people under NDA. Hell, even if you want to bring this back to only six months, there should be a “forward looking” period that vendors are comfortable talking about without NDA shields. After all, let’s face it: everything published under an NDA still starts with various comments such as:
The items discussed in this document contain forward-looking statements that reflect … blah blah blah … it is our aim to get there … blah blah blah … but don’t hold us to anything if we don’t get there.
So it’s not as if the information discussed in NDAs is so rock solid that you can take bets on it anyway! So then … make those same caveats then pull out the useful information about upcoming features!
For information about features and products that are going to come out within 6-12 months, there’s no point for that to be under NDA. In fact, it does more harm than good, especially when you’re talking to a company that wants to buy something, but needs to know where it’s heading. It leads to situations where products are say, disqualified for consideration because they don’t have a feature yet, but because it’s so tightly bound up in an NDA, even though it will be available by the time the purchase decision is made, the message doesn’t get heard.
I know there’s the argument that new features, or perhaps more importantly, upcoming features, need to be protected from competitors. Does anyone seriously think NDAs shield anyone from this? Employees routinely shift from vendor to vendor, and while they’re usually under non-compete clauses, and clauses that restrain them from discussing products and features they were working on, those clauses only last so long – in most cases seemingly limited to 12 weeks or so. In short – if vendor A wants to know what vendor B is up to, they poach staff, or watch who they’re purchasing and make educated guesses.
Not only that, every vendor that has a clue has fairly heavily populated product development strategies ranging from 6 months to 2 years out, and just hearing that someone is going to implement some technology doesn’t mean that a competitor can instantly slot in development resources immediately on it in order to ape that functionality too. (Assuming they don’t already have the technology – it can be a case of “catch up” sometimes.)
So, would much change under reduced disclosure via NDAs? It seems bloody unlikely.
“Ah“, some would say, “It’s not just the competitors. It’s also the risk of being sued by a company if they purchase X on the basis of us implementing some feature A that we’ve talked about, but for some reason we don’t get around to it in the specified timeframe.”
“Um, so what?” would be my response to this. There’s two very important rejoinders to the above arguments:
- Make forward looking statements with the standard caveats that are already heavily applied to NDAs anyway; i.e., it works for an NDA situation, so why won’t it work for an ordinary situation?
- Only talk about things that are well within development scope – again, we’re talking about that period of up to 6 or 12 months out from now. That should be things that you’re reasonably confident of achieving.
“Ah“, some would say, “Then there’s stymieing by proxy – even if competitors don’t intend to implement the same thing we’re doing, they’ll just talk about doing it to convince people to stick with them, or buy them instead.”
To this I would say: Companies that repeatedly talk about products or features they then don’t go on to release in time (or at all) quickly get a reputation for vaporware. So don’t get too hung up about that – the market usually deals with vaporware vendors very efficiently.
“Ah“, some would say, “But what about the Osbourne Effect?” To this I’d say that particularly with mature product ranges, there shouldn’t regularly be an upcoming update that’s so earth shattering that it would cause someone to hold off buying until that is released. If someone needs a backup product now, or an array now, or a tape library now, they won’t keep on indefinitely putting it off just because there’s bigger and better things around the corner. Guess what? We’re all in IT here –– we all know that products have a fairly defined ride between superiority, regularity and obsolescence. Or as the old saying goes: if you keep waiting for the best computer to be released before you buy, you’ll never buy a computer.
In situations where there’s potential upheaval, have a clear upgrade strategy that clearly states and amortizes the cost appropriately – most companies will thank you. On the other hand, what they won’t thank you for is a situation where they buy a product from you that gets end of lifed or shelved shortly thereafter without any advance warning or clear roadmap of a way forward. I’ve seen multiple instances where vendors have permanently soured relationships with managers at customer sites. This makes the technical person at the site that recommended the purchase look bad, or worry about looking bad. And it also makes the manager who authorised the purchase worry that they “look bad”. Such issues don’t remain at that customer site – unresolved failures in customer satisfaction roll forward into every site that a person moves on to. Trust me – I’ve seen it, I know managers who refuse to buy products from vendor X for exactly that reason, and they’ve carried it through as policy on sites they’ve moved on to.
Being upfront on the other hand encourages customers to believe you have their best interest at heart. For instance, companies are still happily buying LTO-4 tape libraries, particularly from vendors offering free LTO-5 drive swap-ins, or even in situations where they know there’ll be a (relatively) small fee.
What we need is for the vendors to start to frankly evaluate where they’re slapping NDAs about. Sometimes it’s like navigating through a sea of pamphlet wielders at a train station – or a voting booth.
Come on vendors – reappraise where and how frequently you’re throwing NDAs around and prove to us that you actually live in the same information-rich world that you want to supply products to. Tone the NDAs down and use them appropriately, and use them sparingly. If you want another analogy – it’s becoming a bit too “boy who cried wolf”, quite frankly.
Over at a website called ignore the code, there’s a fascinating and insightful piece at the moment about removing features.
This is often a controversial topic in software design and development, and Lukas Mathis handles the topic in his typically excellent style. In particular, the summation of the problem through illustrations of two “Swiss Army Knives” demonstrates the issue quite well.
So what does this have to do with NetWorker, you might ask? Well, quite a bit. In light of the recent release of NetWorker 7.5 SP2 I thought it relevant to spend a little time ruminating about the software development process, relating it to NetWorker, and asking EMC product management some questions about their processes.
Within any software development model, there are four requirements:
- Adding new features.
- Refining existing features.
- Removing obsolete features.
- Fixing bugs.
It’s a challenging problem – any one or two of these requirements can be readily accommodated without much fuss. The challenge that faces all vendors though is balancing all four software development processes. Personally, I don’t envy the juggling process that faces product managers and product support managers on a daily basis. Why? All four requirements combined create clashing priorities and schedules that makes for a very challenging environment. (It’s not unique to NetWorker of course – it applies pretty equally to just about every software product.)
In most situations, it’s easiest to add new features. This can be a double-edged sword. On the positive side, it can be a key factor in enticing potential customers to become actual customers, and it can equally be a key factor in enticing existing customers to remain customers rather than moving to the competition. On the negative side, it can lead to software bloating – a primary criticism of companies like Microsoft and Adobe. (Thankfully, I don’t think you can accuse NetWorker of being too ‘bloated’; in the 14 or so years I’ve been using it, the install footprint has of course gone up, but there’s not really been any “why the hell did they do that?” new features, and overall the footprint is well within the bounds for backup and recovery software.)
Like any good backup product, NetWorker’s development history is full of new features being added to it, such as the following:
- Storage nodes added in v5.x.
- Dynamic drive sharing added in v6.
- Advanced File Type Devices (ADV_FILE) added in v7.
- Jobs database introduced in v7.3.
- Virtualisation visualisation in v7.5.
- and so on.
Without new features being regularly updated, companies leave themselves open to having the competition overtake them, and so periodically when we see a vendor respond to market forces (or try to push the market in a new direction), we should, even if we aren’t particularly fond of the new feature, accept that adding new features are inevitable in software development.
Equally, NetWorker history is rife with examples of existing features being refined, such as the following:
- Support for dedicated storage nodes.
- Enhancing the index system in v6 to overcome previous design limitations.
- Enhancing the resource configuration database in v7 to overcome previous design limitations.
- Frequent enhancement of all the database and application backup modules.
- Pool based retention.
- and so on.
You could say that feature refinement is all about evolutionary growth of the product. It’s never specifically about introducing entire new features – these are existing features that have grown between releases – usually in response to changing requirements in customer environments. (For instance, the previous resource configuration database worked well so long as you had smallish environments. Over time as environments became more complex, with more clients, and increased configuration requirements, it could no longer cut the mustard, triggering the redesign.)
The more challenging aspect for enterprise backup software is the notion of removing features – if doing so affects legacy recoverability options, it could cause issues for long-term users of the products, and so we usually usability features removed rather than core support features. A few of the features over time that have been removed are:
- Support for the old GUIs (networkr.exe from Windows, nwadmin from Unix).
- Support for browsing indices via NFS mounts. (This was even before my time with NetWorker. It looks like it would have been fun to play with, but it wasn’t exactly cross-platform compatible!)
- Support for cross platform recoveries.
- Support for defunct tape formats (e.g., VHS).
I’d argue that it’s rarely the case that decisions to remove functionality are taken lightly. Usually it will be for one of three reasons:
- The feature was ‘fragile’ and fixing it would take too much effort.
- The feature is no longer required after a change in direction for the product.
- The feature is no longer being used by a sufficient number of users and its continued presence would hamper new directions/features for the product.
None of these, I’d argue, are easy decisions.
Finally we have the bugs – or “unanticipated features”, as we sometimes like to call them. Any vendor that tells you their software is 100% bug free is either lying, or their ‘product’ no more complex than /bin/true. Bugs are practically unavoidable, so the focus must be on solid testing, identification and containment. I’ll be the first to admit that there have been spotty patches in the past where testing in NetWorker has seemed to be lacking, but having been on the last couple of betas, I’m seeing a roaring return to rigorous testing in 7.5 and 7.6. Did these pick up all bugs? No – again, see my point about no software ever being 100% bug free.
I’ll hand on my heart say that I can’t cite a single company that has had a spotless record when it comes to bug control – this isn’t easy. Enterprise class backup software introduces new levels of complexity into the equation, and it’s worthwhile considering why. You can take exactly the same piece of enterprise backup software and install it into 50 different companies and I’ll bet that you’ll get a significant number of “unique” situations in addition to the core/standard user experience. Backup software touches on practically every part of an IT environment, and so is affected by a myriad of environment and configuration issues that normal software rarely has to contend with. Or to put it better: while another piece of software may have to contend with one or two isolated areas of environment/configuration uniqueness, backup software will usually have to contend with all of them, and remain as stable as possible throughout.
This isn’t easy. I may periodically get exasperated over bugs, etc., but I recognise the inevitability that I’ll be continuing to deal with bugs in any software I’m using for the rest of my life – so it’s hardly a NetWorker specific issue. (I’m going on the basis here that quantum computing won’t suddenly deliver universal turing machines capable of simulating every possible situation and input for software and hardware.)
While I was writing this article, I thought it would be worthwhile to get some feedback from EMC NetWorker product management on this, and I’m pleased to include my questions to them, as well as their answers, below. These answers come from product management and engineering, and I’m presenting them unedited in their complete form.
Question 1
I’ve been told that EMC has taken considerable steps to speed up the RFE process. Can you briefly summarise the improvements that have been made and the buy-in from product management and engineering on this?
Answer:
With the large size of the NetWorker installed base, we receive many RFEs per month. These requests range in nature from architectural changes to relatively small operational enhancements. We have made great strides in organizing the RFE pool in such a manner so that at the front end of the release planning process we can look back over hundreds of discreet requests and digest those requests into an achievable number of specific and prioritized product requirements.
RFEs come in to the product team through three sources. We take RFEs on PowerLink (EMC’s information portal), through the Support organization, and in face to face meetings with customers and partners. NetWorker Product Management has a central database so that we can consolidate the RFE pool and apply a standard process for scrubbing and categorizing the requests. This is a time consuming process, but it provides us with the capabilities to track the areas of the product that are receiving the most requests and. That allows us to establish goals for a particular release and include RFEs accordingly. An example might be improved back up to disk workflows. The ability to quickly drill down to the requests most relevant to our high-level priorities allows us to efficiently write requirements that directly incorporate end-user feedback.
More customer requests for enhancement will be implemented in 2010 than ever before. We will address some of the big changes that customers have been calling for, and will also look to implement some bonus enhancements; small changes that won’t make the marketing slides but will make NetWorker operations easier on backup administrators who interact with the product on a daily basis.
Question 2
One challenge with any software vendor is integrating patches (or hot fixes) into stable development trees. How would EMC rate itself with this in relation to NetWorker?
Answer:
We maintain a high level of discipline in maintaining our active code branches. Hot fixes typically flow into our bug-fix service packs, (such as 7.5 SP1) which then flow back into the main code branch. Any code change made to an active branch must also be applied to the development branch, which builds on a regular basis. Build failures in development are taken very seriously by Engineering, and we engage resources to actively troubleshoot and resolve these issues.
Question 3
Currently we’re seeing cumulative patch cluster releases for most of the supported versions of NetWorker. E.g., NetWorker 7.5 SP1 is now up to cumulative patch cluster 8. These patch clusters currently remain available only via EMC support or partner support programs, and aren’t readily downloadable via standard PowerLink sources. With the projects currently being worked on to improve PowerLink, will we see this change, or is the rationale to not readily provide these cumulative patches a support one?
Answer:
When we post to PowerLink, we want to be sure that anyone who downloads code from EMC knows exactly what they’re getting. If we posted all of the clusters within today’s PowerLink framework, the result would be a confusing PowerLink experience for customers. We consider the patch cluster process to be an improvement on earlier practices and look forward to continued improvements in this area.
Question 4
What feature are you most pleased to have seen integrated into either NetWorker 7.5 or 7.6?
Answer:
We are very pleased with the NetWorker Management Console work that has done over the course of 7.5 and 7.6. Visualization of virtual environments (introduced in 7.5) has been very well received by customers, and we believe that the improvements in 7.6 around customization and performance will also be greatly appreciated as customers move to 7.6+ releases.
Question 5
One RFE process advocated is to have product management vet RFEs and submit them to a public forum to be voted on by community users. Advocates of this model say that it allows better community involvement and has products evolve to meet existing user requirements. Those who disagree with this model usually suggest that existing user feature suggestions don’t always accommodate design changes that would help boost market share. Is this a model which EMC has considered, or is it seeking to informally do this via the various EMC Community Forums that have been established?
Answer:
A closed loop is ideally what our enterprise customers who submit RFEs look for i.e. to enter an RFE, track it, see if it is relevant and will be seriously considered. Capturing and allowing other users to vote is an option we are actively exploring. We would have to put some infrastructure in place to do so, but it is under investigation. The first audience for such an option would be our recently launched EMC community for NetWorker. The NetWorker user community is quite sophisticated, and we value their input tremendously. While it is true that some users take a narrow view of how NetWorker should evolve, others take a broader and more market-centric view. Our RFEs run the full spectrum.
While I touched on this in the second blog posting I made (Instantiating Savesets), it’s worthwhile revisiting this topic more directly.
Using ADV_FILE devices can play havoc with conventional tape rotation strategies; if you aren’t aware of these implications, it could cause operational challenges when it comes time to do recovery from tape. Let’s look at the lifecycle of a saveset in a disk backup environment where a conventional setup is used. It typically runs like this:
- Backup to disk
- Clone to tape
- (Later) Stage to tape
- (At rest) 2 copies on tape
Looking at each stage of this, we have:
The saveset, once written to an ADV_FILE volume, has two instances. The instance recorded as being on the read-read only part of the volume will have an SSID/CloneID of X/Y. The instance recorded as being on the read-write part of the volume will have an SSID/CloneID of X/Y+1. This higher CloneID is what causes NetWorker, upon a recovery request, to seek the “instance” on the read-only volume. Of course, there’s only one actual instance (hence why I object so strongly to the ‘validcopies’ field introduced in 7.6 reporting 2) – the two instances reported are “smoke and mirrors” to allow simultaneous backup to and recovery from an ADV_FILE volume.
The next stage sees the saveset cloned:
This leaves us with 3 ‘instances’ – 2 physical, one virtual. Our SSID/CloneIDs are:
- ADV_FILE read-only: X/Y
- ADV_FILE read-write: X/Y+1
- Tape: X/Y+n, where n > 1.
At this point, any recovery request will still call for the instance on the read-only part of the ADV_FILE volume, so as to help ensure the fastest recovery initiation.
At some future point, as disk capacity starts to run out on the ADV_FILE device, the saveset will typically be staged out:
At the conclusion of the staging operation, the physical + virtual instances of the saveset on the ADV_FILE device are removed, leaving us with:

So, at this point, we end up with:
- A saveset instance on a clone volume with SSID/CloneID of: X/Y+n.
- A saveset instance on (typically) a non-clone volume with SSID/CloneID of: X/Y+n+m, where m > 0.
So, where does this leave us? (Or if you’re not sure where I’ve been heading yet, you may be wondering what point I’m actually trying to make.)
Note what I’ve been saying each time – NetWorker, when it needs to read from a saveset for recovery purposes, will want to pick the saveset instance with the lowest CloneID. At the point where we’ve got a clone copy and a staged copy, both on tape, the clone copy will have the lowest CloneID.
The net result is that NetWorker will, in these circumstances, when both tapes aren’t online, request the clone volume for recovery – even though in an extreme number of cases, this will be the volume that’s offsite.
For NetWorker versions 7.3.1 and lower, there was only one solution to this – you had to hunt down the actual clone saveset instances NetWorker was asking for, mark them as suspect, and reattempt the recovery. If you managed to mark them all as suspect, then you’d be able to ‘force’ NetWorker into facilitating the recovery from the volume(s) that had been staged to. However, after the recovery you had to make sure you backed out of those changes, so that both the clones and the staged copies would be considered not-suspect.
Some companies, in this situation, would instigate a tape rotation policy such that clone volumes would be brought back from off-site before savesets were likely to be staged out, with subsequently staged media sent offsite. This has a dangerous side-effect of temporarily leaving all copies of backups on-site, jeapordising disaster recovery situations, and hence it’s something that I couldn’t in any way recommend.
The solution introduced around 7.3.2 however is far simpler – a mminfo flag called offsite. This isn’t to be confused with the convention of setting a volume location field to ‘offsite’ when the media is removed from site. Annoyingly, this remains unqueryable; you can set it, and NetWorker will use it, but you can’t say, search for volumes with the ‘offsite’ flag set.
The offsite flag has to be manually set, using the command:
# nsrmm -o offsite volumeName
(where volumeName typically equals the barcode).
Once this is set, then NetWorker’s standard saveset (and therefore volume) selection criteria is subtly adjusted. Normally if there are no online instances of a saveset, NetWorker will request the saveset with the lowest CloneID. However, saveset instances that are on volumes with the offsite flag set will be deemed ineligible and NetWorker will look for a saveset instance that isn’t flagged as being offsite.
The net result is that when following a traditional backup model with ADV_FILE disk backup (backup to disk, clone to tape, stage to tape), it’s very important that tape offsiting procedures be adjusted to set the offsite flag on clone volumes as they’re removed from the system.
The good news is that you don’t normally have to do anything when it’s time to pull the tape back onsite. The flag is automatically cleared* for a volume as soon as it’s put back into an autochanger and detected by NetWorker. So when the media is recycled, the flag will be cleared.
If you come from a long-term NetWorker site and the convention is still to mark savesets as suspect in this sort of recovery scenario, I’d suggest that you update your tape rotation policies to instead use the offsite flag. If on the other hand, you’re about to implement an ADV_FILE based backup to disk policy, I’d strongly recommend you plan in advance to configure a tape rotation policy that uses the offsite flag as cloned media is sent away from the primary site.
–
* If you did need to explicitly clear the flag, you can run:
# nsrmm -o notoffsite volumeName
Which would turn the flag back off for the given volumeName.
My boss, on his blog, has raised a pertinent question – if it’s so important, according to some vendors, that backup and archive are all achieved through the same product interface, then how many companies out there assign the role of archive administrator to the backup administrator? (Or vice versa).
I like this question; it’s kind of like the old conundrum of whether the dog wags the tail, or whether the tail wags the dog. That is, are companies that heavily push an integrated backup and archive interface:
- Responding to the needs of IT to meet current desired business functionality, or,
- Are they trying to drive IT in a way that perhaps doesn’t meet desired business functionality?
(Or indeed, something else entirely).
[Edit, further thoughts, 2010-03-03] I’ve been thinking more about this, and I have to say I can’t think of a single customer environment off-hand where the backup administrator is also responsible for archiving. Archiving seems to remain primarily the purdue of the storage administration teams in sites that I’m aware of, so it does beg the question – how beneficial is an integrated backup and archive administration process?
[Original wrap-up] So if you’ve got any thoughts on the integration of backup and archive administration, either at the software or the human resources layer, I’d encourage you to jump across to Mike’s blog and make your voice heard.
(As a first, I’ve disabled comments on this blog posting, so as to encourage discussion to remain in one location – the source article.)
Close enough together that I have to declare them a tie, the top stories for February were:
It’s fair to say that Carry a jukebox with you is remaining a big hit all the time – a bit like the “NSR peer information” story, and so February will be the last month that it gets included in consideration for top articles.
Towards the end of the month, with the release of NetWorker 7.5 SP2, there was quite a lot of interest in the articles “NetWorker 7.5.2 released” and “NetWorker 7.5.2 – What’s it got?“. Obviously if you’ve got Windows 2008 or Windows 7 clients that you need to backup, 7.5 SP2 is almost a no-brainer – you’ll really need to be using it. So far, based on my testing on Linux, 7.5 SP2 is looking fairly good for that platform too. As always, everyone should read the release notes before deciding whether to upgrade their environments.
While this is pertinent to all versions of NetWorker, it particularly seems relevant mentioning now, since as of 7.5.2, we’re now seeing revised messaging from NetWorker when a tape becomes prematurely full. These new messages now state:
nsrd media notice: LTO Ultrium-4 tape 800814L4 used 2039 MB of 800 GB capacity
nsrd media notice: NetWorker media: (Warning) 800814L4 marked full prematurely.
Verify possible error on the device /dev/nst4, advertised capacity is 800 GB
marked full at 2039 MB
Now, it’s worth noting here that normally if you get a tape fill up so soon that probably means there is an issue, and this version of the message, while only subtly different, is certainly more informative and that is a good thing. When we consider VTLs however, it’s a different story. In a virtual tape library, we normally want to use much smaller media sizes than the drive type we’re configured for. That way you’re writing virtual volumes that are 50GB or 100GB rather than 800GB. In my case referring to the above, my lab VTL uses virtual media sizes of 1GB (with compression).
So, how do you go about this? Well, it’s easiest to accomplish when you first setup the environment. You need to change the “Volume Default Capacity” of each virtual device to suit the allocated media sizes. To do this, in NMC turn on View->Diagnostic Mode, then when viewing device properties, enter the appropriate size in gigabytes (followed by “G” or “GB”) in the “Volume default capacity” field of the Configuration tab, shown below:

Now, if you can do that on your VTL devices before you start labelling volumes, you’re done and dusted. However, if you’ve previously labelled your media, you either have to relabel the currently blank virtual media or wait until NetWorker gets around to recycling the currently used media.
You can query mminfo to see what the default capacity is registered at – e.g.,
[root@tara ~]# mminfo -m
state volume written (%) expires read mounts capacity
800801L4 2254 MB full 02/26/2011 0 KB 5 800 GB
800802L4 0 KB 0% undef 0 KB 5 1000 MB
800804L4 0 KB 0% undef 0 KB 5 1000 MB
800805L4 0 KB 0% undef 0 KB 3 800 GB
Now, what effect does this have to how much you can write to the volumes? The short answer is none. All you’re doing is adjusting the default capacity assigned to new volumes that are labelled in these (virtual) tape drives – and we can see what happens when NetWorker breaches the default volume capacity all the time in relation to physical tape – it just keeps writing until it hits end of physical tape. Nothing more, nothing less. So this means when you fill up your virtual media, NetWorker doesn’t complain at all:
nsrd media notice: LTO Ultrium-4 tape 800802L4 on /dev/nst3 is full
nsrd media notice: LTO Ultrium-4 tape 800802L4 used 2793 MB of 1000 MB capacity
nsrd media info: WORM capable for device /dev/nst3 has been set
Is this something you must to do? Well, no, not technically. However, remembering that I advocate a zero error policy, the above is something I’d definitely strongly recommend for virtual devices. Doing so will eliminate what would otherwise be false errors on the virtual tapes within the NetWorker daemon logs. That means if you have to search for media issues, or refer your daemon logs to your support provider for analysis, they won’t be seeing bunches of “tape filled prematurely” issues.
As I mentioned in a post yesterday, NetWorker 7.5.2 (or NetWorker 7.5 SP2) has been released, and with it comes a bunch of feature enhancements as well as a slew of bug fixes.
One of the criticisms of EMC’s development process for NetWorker for a while was that a new service pack would come out with few, if any of the bug fixes added as hot-fixes and cumulative patch clusters for the previous service pack. This is something that EMC have clearly been improving on, because reading through the release notes with the almost 150 bug fixes cited, I see many “familiar” issues that were addressed in various cumulative patch clusters for 7.5.1. On this alone, I’ll give 7.5.2 high praise.
I’ll be running up 7.5.2 in my lab today and looking at a few test cases, etc., but so far of the improvements that have been made in this new service pack, I’m pretty stoked about the following:
- Auto-addition of the update enabler. Starting with this version, if you go up to a version of NetWorker that requires an update enabler, NetWorker will create it automatically for you. You still have to get it authorised of course, but this saves smaller sites from the hassle of upgrading without checking for the enabler and then hitting problems.
- Support for Windows 7 clients.
- Support for Windows 2008 R2 as a client, storage node, and server. The company I work for, IDATA, got involved in the beta testing for this and were pleased with the results.
- DFS-R Granular Recovery. A few months ago, I had an issue where a customer’s SYSTEM STATE: saveset was 18GB, due to DFS-R replication. The release notes indicate this shouldn’t be the case any more – this data should now be broken out of the SYSTEM STATE: saveset into regular file backup/recovery operations.
- VCB support for ESX v4. I know I should mention this, but I remain overall unexcited about VCB because of the lack of granular Linux support. vSphere API backups, when they come, will grab far more of my attention, I hope.
- Client parallelism on new clients is reduced from the previous (increase to) 12, back to the original 4. Client parallelism for the NetWorker server’s client instance on initial bootstrap/creation remains 12, and I’m fine with this. A reversion to client parallelism of 4 however will make performance tuning in new environments at least a little more sane.
Overall I have high hopes for 7.5 SP2. If you’re currently needing to backup Windows 2008 R2 or Windows 7 hosts, this is probably going to be a no-brainer: you’ll likely want to upgrade at least those clients to it straight away.
Before I recommend 7.5.2 more generally of course, I want to run it through its paces. I will reiterate though – even on first glance, it seems very promising. As is always the case, you should make sure that you read the release notes before you contemplate upgrading – and have a clear downgrade path if you need to. This means that if you’ve been supplied with any hot fixes, or cumulative patch clusters, you need to make sure you still have these available as you’re planning the upgrade.
Typical that it happens on a day when I’m travelling, but NetWorker 7.5 SP2 (aka NetWorker 7.5.2) has been released today.
There’s a bunch of updates and bug fixes in this release – I actually have really high hopes for it based on discussions I’ve had with various folks at EMC. I’m about to start an hour long train trip where I’ll be reading the release notes – if you have PowerLink you can access them from here.
I’ll aim to have a summary posting of new features and bug fixes in the next 24 hours. In the interim, let me just say that I’m over the moon to see that one of the first new features is a reversion to default client parallelism of 4 rather than the previous change to 12. This is very good to see changed. (Oh, not to mention support for Windows 2008 R2.)
[Edit, 2010-02-25]
An overview of new features, etc., can be found here.
The scenario:
- A clone or stage operation has aborted (or otherwise failed)
- It has been restarted
- It hangs waiting for a new volume even though there’s a partially written volume available.
This is a relatively easy problem to explain. Let’s first look at the log messages that happens. To generate this error, I started cloning some data to the “Default Clone” pool, with only one volume in the pool, then aborted. Shortly thereafter I tried to run the clone again, and when NetWorker wouldn’t write to the volume I unmounted and remounted it – a common thing that newer administrators will try in this scenario. This is where you’ll hit the following error in the logs:
media notice: Volume `800829L4' ineligible for this operation; Need a different volume
from pool `Default Clone'
media info: Suggest manually labeling a new writable volume for pool 'Default Clone'
So, what’s the cause of this problem? It’s actually relatively easy to explain.
A core component in NetWorker’s media database design is that a saveset can only ever have one instance on a piece of media. This applies as equally to failed as complete saveset instances.
The net result is that this error/situation will occur because it’s meant to – NetWorker doesn’t permit more than one instance of a saveset to appear on the same piece of physical media.
So what do you do when this error comes up?
- If you’re backing up to disk, an aborted saveset should normally be cleared up automatically by NetWorker after the operation is aborted. However, in certain instances this may not be the case. For NetWorker 7.5 vanilla and 7.5.1.1/7.5.1.2, this should be done by expiring the saveset instance – using nsrmm to flag the instance as having an expiry date within a few minutes or seconds. For all other versions of NetWorker, you should just be able to delete the saveset instance.
- When working with tape (virtual or physical), the most recommended approach would be to move on to another tape, or if the instance is the only instance on that tape, relabel the tape. (Some would argue that you can use nsrmm to delete the saveset instance from the tape and then re-attempt the operation, but since NetWorker is so heavily designed to prevent multiple instances of a saveset on a piece of media, I’d strongly recommend against this.)
Overall it’s a fairly simple issue, but knowing how to recognise it lets you resolve it quickly and painlessly.
|
|