Writing, 2147483647 sessions

 Features, NetWorker, Quibbles  Comments Off on Writing, 2147483647 sessions
Feb 062011
 

Somewhere along the way in NetWorker 7.6 SP1, an annoying reporting issue crept in where a device would report the number of sessions it was writing in an odd way. This didn’t always appear to be consistently repeatable, and in practice it wasn’t a major issue, but still, as cosmetic bugs go it wasn’t desirable.

According to Skip Hanson, a senior member of the NetWorker usability team, in a recent reply on the NetWorker mailing list, this issue has been resolved in 7.6 SP1 Cumulative Release 3 (i.e., 7.6.1.3).

If you’re experiencing this bug, you may wish to review the update notes for 7.6.1.3 and see whether it would be applicable to apply on your site.

You can always find the latest update notes for cumulative releases at the NetWorker Product Support page on Powerlink:

NetWorker Support

[Note: 2011-02-19] It would appear this issue does actually still exist in NetWorker 7.6.1.3.

The top 10 for 2009

 Architecture, Basics, NetWorker, Quibbles, Site, Support  Comments Off on The top 10 for 2009
Jan 062010
 

Looking at the stats both for this new site and the previous site, I’ve compiled a list of the top 10 read articles on The NetWorker Blog for 2009. The top 3 of course match the three articles that routinely turn out to be the most popular on any given month, which speaks something of their relevance to the average NetWorker administrator.

(Note: I’ve excluded non-article pages from the top 10.)

Number 10 – Instantiating Savesets

The very first article on the blog, Instantiating Savesets detailed the importance of distinguishing between all instances of a saveset and a specific instance of a saveset.

This distinction between using just the saveset ID, and using a saveset ID/clone ID combination becomes particularly important when staging from disk backup units. If clones exist and you stage using just the saveset ID, when NetWorker cleans up at the end of the staging operation it will remove reference to the clones as well as deleting the original from the disk backup unit. (Something you really don’t want to have happen.)

Recommendation to EMC: Perhaps it would be worthwhile requiring a “-y” argument to nsrstage if staging savesets from disk backup units and specifying only the saveset ID.

Recommendation to NetWorker administrators: Always be careful when staging that you specify both the saveset and the clone ID.

Number 9 – Basics – Important mminfo fields

In May I wrote about a few key mminfo fields – notably:

  • savetime
  • sscreate
  • ssinsert
  • sscomp
  • ssaccess

Sadly, I didn’t get the result I wanted with EMC on ssaccess. Documented as being updated whenever a saveset fragment is accessed for backup and recovery, the most I could get was an acknowledgement that it was currently broken and to lodge an RFE to get it fixed. (The alternative was to have the documentation changed to take out reference to read operations – something I didn’t want to have happen!)

Recommendation to EMC: ssaccess would be a particularly useful mminfo field, particularly when analysing recovery statistics for NetWorker. Please fix it.

Number 8 – Basics – Listing files in a backup

Want to know what files were backed up as part of the creation of a saveset? If you do, you’re not unique – this has remained a very popular article since it was written in January.

Recommendation to EMC: This information can be retrieved via a combination of mminfo/nsrinfo, but it would be handy if NMC supported drilling down into a saveset to provide a file listing.

Number 7 – Using yum to install NetWorker on Linux

NetWorker’s need for dependency resolution on Linux for installation of the client packages in particular drew a lot of people to this article.

Number 6 – Basics – mminfo, savetime, and greater than/less than

This article explained why NetWorker uses the greater than and less than signs in mminfo in a way that newcomers to the product might find backwards. If you’re not aware of why mminfo works the way it does for specifying savetimes, you should be.

Number 5 – 7.5(.1) changed behaviour – deleting savesets from adv_file devices

This was a particularly unpleasant bug introduced into NetWorker 7.5, thankfully resolved now in the cumulative service releases and NetWorker 7.6

The gist of it is that in NetWorker 7.5/7.5.1 (aka 7.5 SP1), if you deleted a saveset on a disk backup unit, NetWorker would suffer a serious failure where it would from that point have issues cleaning regular expired savesets from the disk backup unit and insist that the disk backup unit had major issues. The primary error would manifest as:

nsrd adv_file warning: Failed to fetch the saveset(ss_t) structure for ssid 1890993582

This was fixed in 7.5.1.2, thankfully.

Recommendation to EMC: Never let this bug see the light of day again, please. (So far you’re doing an excellent job, by the way.)

Number 4 – NetWorker 7.5.1 Released

I’ve recently noticed a disturbing trend among many vendors, EMC included, where once a new release is made of a product, sales and account staff become overly enthusiastic about recommending new releases. This comes on top of not really having any technical expertise. (Please be patient, I’m trying to put this as diplomatically as possible.)

One of the worst instances I’ve seen of this in the last few years was the near-hysterical pumping of 7.5 thanks to some useful features to do with virtualisation in particular. I’ll admit that my articles on the integration between Oracle Module 5 and NetWorker 7.5, as well as Probe Based Backups may have added to this. However, there was somewhat of a stampede to 7.5 when it came out, and consequently, when it had some issues, there was strong enthusiasm for the release of 7.5.1.

This is why, by the way, that IDATA maintains for its support customers a recommended versions list that is not automatically updated when new versions of products come out.

Recommendation to EMC: Remind your sales staff that existing users already have the product, and not to just go blindly convincing them to upgrade. Otherwise you’ll eventually start sounding like this.

Number 3 – Carry a jukebox with you (if you’re using Linux)

During 2009, Mark Harvey’s LinuxVTL project first got the open source LinuxVTL working with NetWorker in a single drive configuration, then eventually, in multi-drive configurations. (Mark assures me, by the way, that patches are coming real soon to allow multiple robots on the same storage node/server.)

Lesson for me: With the LinuxVTL configured on multiple lab servers in my environment, I’ve really taken to VTLs this year, and considerably changed my attitude on using them. (I’ll say again: I still resent that they’re needed, but I now respect them a lot more than I previously did.)

Lesson for others: Even Mark himself says that the open source VTL shouldn’t be used for production backups. Don’t be cheap with your backup system, this is an excellent tool for lab setups, training, diagnostics, etc., but it is not a replacement to a production-ready VTL system. If you want a VTL, buy a VTL.

Number 2 – Basics – Parallelism in NetWorker

Some would say that the high popularity of an article about parallelism in NetWorker indicates that it’s not sufficiently documented.

I’m not entirely convinced that’s the case. But it does go to show that it’s an important topic when it comes to performance tuning, and summary articles about how the various types of parallelism interact are obviously popular.

Lesson for everyone: Now that the performance tuning guide has been updated and made more relevant in NetWorker 7.6, I’d recommend people wanting an official overview of some of the parallelism options checking that out in addition to the article above.

Number 1 – Basics – Fixing “NSR peer information” errors

Goodness this was a popular article in 2009 – detailing how to fix the “NSR peer information” errors that can come up from time to time in the NetWorker logs. If you’re not familiar with this error yet, it’s likely you will eventually as a NetWorker administrator see an error such as:

39078 02/02/2009 09:45:13 PM  0 0 2 1152952640 5095 0 nox nsrexecd SYSTEM error: There is already a machine using the name: “faero”. Either choose a different name for your machine, or delete the “NSR peer information” entry for “faero” on host: “nox”

Recommendation for EMC: Users shouldn’t really need to be Googling for a solution to this problem. Let’s see an update to NetWorker Management Console where these errors/warnings are reported in the monitoring log, with the administrator being able to right click on them and choose to clear the peer information after confirming that they’re confident no nefarious activity is happening.

Wrapping Up

I have to say, it was a fantastically satisfying year writing the blog, and I’m looking forward to seeing what 2010 brings in terms of most useful articles.

Dec 142009
 

Over the years I’ve dealt with a lot of different environments, and a lot of different usage requirements for backup products. Most of these fall into the “appropriate business use” categories. Some fall into the “hmmm, why would you do that?” category. Others fall into the “please excuse my brain it’s just scuttled off into the corner to hide – tell me again” category.

This is not about the people, or the companies, but the crazy ideas that sometimes get hold within companies that should be watched for. While I could have expanded this list to cover a raft of other things outside of backups, I’ve forced myself to just keep it to the backup process.

In no particular order then, these are the crazy things I never want to hear again:

  1. After the backups, I delete all the indices, because I maintain a spreadsheet showing where files are, and that’s much more efficient than proprietary databases.
  2. We just backup /etc/passwd on that machine.
  3. But what about /etc/shadow? (My stupid response to the above statement, blurted after by brain stalled in response to statement #2)
  4. Oh, hadn’t thought about that (In response to #3).
  5. Can you fax me some cleaning cartridge barcodes?
  6. To save money on barcodes at the end of every week we take them off the tapes in the autochanger and put them on the new ones about to go in.
  7. We only put one tape in the autochanger each night. We don’t want <product> to pick the wrong tape.
  8. We need to upgrade our tape drives. All our backups don’t fit on a single tape any more. (By same company that said #7.)
  9. What do you mean if we don’t change the tape <product> won’t automatically overwrite it? (By same company that said #7 and #8.)
  10. Why would I want to match barcode labels to tape labels? That’s crazy!
  11. That’s being backed up. I emailed Jim a week ago and asked him to add it to the configuration. (Shouted out from across the room: “Jim left last month, remember?”)
  12. We put disk quotas on our academics, but due to government law we can’t do that to their mail. So when they fill up their home directories, they zip them up and email it to themselves then delete it all.
  13. If a user is dumb enough to delete their file, I don’t care about getting it back.
  14. Every now and then on a Friday afternoon my last boss used to delete a filesystem and tell us to have it back by Monday as a test of the backup system.
  15. What are you going to do to fix the problem? (Final question asked by an operations manager after explaining (a) robot was randomly dropping tapes when picking them from slots; (b) tapes were covered in a thin film of oily grime; (c) oh that was probably because their data centre was under the area of the flight path where planes are advised to dump excess fuel before landing; (d) fuel is not being scrubbed by air conditioning system fully and being sucked into data centre; (e) me reminding them we just supported the backup software.)

I will say that numbers #1 and #15 are my personal favourites for crazy statements.

Nov 252009
 

Everyone who has worked with ADV_FILE devices knows this situation: a disk backup unit fills, and the saveset(s) being written hang until you clear up space, because as we know savesets in progress can’t be moved from one device to another:

Savesets hung on full ADV_FILE device until space is cleared

Honestly, what makes me really angry (I’m talking Marvin the Martian really angry here) is that if a tape device fills and another tape of the same pool is currently mounted, NetWorker will continue to write the saveset on the next available device:

Saveset moving from one tape device to another

What’s more, if it fills and there’s a drive that currently does have a tape mounted, NetWorker will mount a new tape in that drive and continue the backup in preference to dismounting the full tape and reloading a volume in the current drive.

There’s an expression for the behavioural discrepancy here: That sucks.

If anyone wonders why I say VTLs shouldn’t need to exist, but I still go and recommend them and use them, that’s your number one reason.

Quibbles – Why can’t I rename clients in the GUI?

 NetWorker, Quibbles  Comments Off on Quibbles – Why can’t I rename clients in the GUI?
Nov 162009
 

For what it’s worth, I believe that the continuing lack of support for renaming clients as a function within NMC, (as opposed to the current, highly manual process), represents an annoying and non-trivial gap in functionality that causes administrators headaches and undue work.

For me, this was highlighted most recently when a customer of mine needed to shift their primary domain, and all clients had been created using the fully qualified domain name. All 500 clients. Not 5, not 50, but 500.

The current mechanisms for renaming clients may be “OK” if you only rename one client a year, but more and more often I’m seeing sites renaming up to 5 clients a year as a regular course of action. If most of my customers are doing it, surely they’re not unique.

Renaming clients in NetWorker is a pain. And I don’t mean a “oops I just trod on a pin” style pain, but a “oh no, I just impaled my foot on a 6 inch rusty nail” style pain. It typically involves:

  • Taking care to note client ID
  • Recording the client configuration for all instances of the client
  • Deleting all instances of the client
  • Rename the index directory
  • Recreate all instances of the client, being sure on first instance creation to include the original client ID

(If the client is explicitly named in pool resources, they have to be updated as well, first clearing the client from those pools and then re-adding the newly “renamed” client.)

This is not fun stuff. Further, the chance for human error in the above list is substantial, and when we’re playing with indices, human error can result in situations where it becomes very problematic to either facilitate restores or ensure that backup dependencies have appropriate continuity.

Now, I know that facilitating a client rename from within a GUI isn’t easy, particularly since the NMC server may not be on the same host as the NetWorker server. There’s a bunch of (potential pool changes), client resource changes, filesystem changes and the need to put in appropriate rollback code so that if the system aborts half-way through it can revert at least to the old client name.

As I’ve argued in the past though, just because something isn’t easy doesn’t mean it shouldn’t be done.

Oct 272009
 

NetWorker has an irritating quirk where it doesn’t allow you to clone or stage incomplete savesets. I can understand the rationale behind it – it’s not completely usable data, but that rationale is wrong.

If you don’t think this is the case, all you have to do to test is start a backup, cancel it mid-way through a saveset, then attempt to clone that saveset. Here’s an example:

[root@tara ~]# save -b Big -q -LL /usr
Oct 25 13:07:15 tara logger: NetWorker media: (waiting) Waiting for 1
writable volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
<backup running, CTRL-C pressed>
(interrupted), exiting
[root@tara ~]# mminfo -q "volume=BIG995S3"
 volume        client       date      size   level  name
BIG995S3       tara.pmdg.lab 10/25/2009 175 MB manual /usr
[root@tara ~]# mminfo -q "volume=BIG995S3" -avot
 volume        client           date     time         size ssid      fl   lvl name
BIG995S3       tara.pmdg.lab 10/25/2009 01:07:15 PM 175 MB 14922466  ca manual /usr
[root@tara ~]# nsrclone -b Default -S 14922466
5876:nsrclone: skipping aborted save set 14922466
5813:nsrclone: no complete save sets to clone

Now, you may be wondering why I’m hung up on not being able to clone or stage this sort of data. The answer is simple: sometimes the only backup you have is a broken backup. You shouldn’t be punished for this!

Overall, NetWorker has a fairly glowing pedigree in terms of enforced data viability:

  • It doesn’t recycle savesets until all dependent savesets are also recyclable;
  • It’s damn aggressive at making sure you have current backups of the backup server’s bootstrap information;
  • If there’s any index issue it’ll end up forcing a full backup for savesets even if it’s backed them up before;
  • It won’t overwrite data on recovery unless you explicitly tell it to;
  • It lets you recover from incomplete savesets via scanner/uasm!

and so on.

So, logically, there makes little sense in refusing to clone/stage incomplete savesets.

There may be programmatic reasons why NetWorker doesn’t permit cloning/staging incomplete savesets, but these aren’t sufficient reasons. NetWorker’s pedigree of extreme focus on recoverability remains tarnished by this inability.

Oct 202009
 

A while ago, I made a posting about a long-running annoyance I have with directive management in NetWorker.

This time I want to expand slightly upon it, thanks mainly to some recent discussions with customers that pointed out an obvious and annoying additional lack of flexibility in directive management.

That’s to do with the complete inability to apply directives – particularly the skip or the null directive, against “special” savesets. By “special” I’m referring to savesets that aren’t part of standard filesystem backups yet are still effectively just a bunch of files.

Such as say:

  • SYSTEM FILES:
  • SYSTEM STATE:
  • ASR:

(And so on.)

In short, NetWorker provides no way of skipping these savesets while still using the “All” special saveset for Windows clients. You can’t do any of the following:

  • Hand craft a server-side directive
  • Hand craft a client-side directive
  • Use the directive management option in the client GUI (winworkr) to create a directive to skip these styles of savesets.

OK, the last point is just slightly inaccurate. Yes, you can create the directive using this method – but:

  • The created directive is not honoured, either when left as is, or by transferring to a more standard directive;
  • The created directive is “lost” when you next load winworkr’s directive management option. Clearly it lets you create directives that aren’t valid and it subsequently won’t deal with.

Why does this suck? For a very important reason – in some situations you don’t want to have to back these up, or you can’t back them up. For instance, on certain OS levels and bitness using clusters, you will get an error if you try to backup the ASR: saveset.

This creates a requirement to either:

  1. Accept that you’ll get an error every day in your backup report (completely unacceptable)
  2. Switch from exclusionary backups to inclusionary backups (highly unpalatable and risky)

Clearly then the option is the second, not the first. This though is effectively removing an error by introducing poor backup systems management into the environment.

It would be nice if this problem “went away”.

Sep 222009
 

When it comes to backup and data protection, I like to think of myself as being somewhat of a stickler for accuracy. After all, without accuracy, you don’t have specificity, and without specificity, you can’t reliably say that you have what you think you have.

So on the basis of wanting vendors to be more accurate, I really do wish vendors would stop talking about archive when they actually mean hierarchical storage management (HSM). It confuses journalists, technologists, managers and storage administrators, and (I must admit to some level of cynicism here) appears to be mainly driven from some thinking that “HSM” sounds either too scary or too complex.

HSM is neither scary nor complex – it’s just a variant of tiered storage, which is something that any site with 3+ TB of presented primary production data should be at least aware of, if not actively implementing and using. (Indeed, one might argue that HSM is the original form of tiered storage.)

By “presented primary production”, I’m referring to available-to-the-OS high speed, high cost storage presented in high performance LUN configurations. At this point, storage costs are high enough that tiered storage solutions start to make sense. (Bear in mind that 3+ TB of presented storage in such configurations may represent between 6 and 10TB of raw high speed, high cost storage. Thus, while it may not sound all that expensive initially, the disk-to-data ratio increases the cost substantially.) It should be noted that whether that tiering is done with a combination of different speeds of disks and levels of RAID, or with disk vs tape, or some combination of the two, is largely irrelevant to the notion of HSM.

Not only is HSM easy to understand and shouldn’t have any fear associated with it, the difference between HSM and archive is also equally easy to understand. It can even be explained with diagrams.

Here’s what archive looks like:

The archive process and subsequent data access

The archive process and subsequent data access

So, when we archive files, we first copy them out to archive media, then delete them from the source. Thus, if we need to access the archived data, we must read it back directly from the archive media. There is no reference left to the archived data on the filesystem, and data access must be managed independently from previous access methods.

On the other hand, here’s what the HSM process looks like:

The HSM process and subsequent data access

The HSM process and subsequent data access

So when we use HSM on files, we first copy them out to HSM media, then delete (or truncate) the original file but put in its place a stub file. This stub file has the same file name as the original file, and should a user attempt to access the stub, the HSM system silently and invisibly retrieves the original file from the HSM media, providing it back to the end user. If the user saves the file back to the same source, the stub is replaced with the original+updated data; if the user doesn’t save the file, the stub is left in place.

Or if you’re looking for an even simpler distinction: archive deletes, HSM leaves a stub. If a vendor talks to you about archive, but their product leaves a stub, you can know for sure that they actually mean HSM.

Honestly, these two concepts aren’t difficult, and they aren’t the same. In the never ending quest to save user bytes, you’d think vendors would appreciate that it’s cheaper to refer to HSM as HSM rather than Archive. Honestly, that’s a 4 byte space saving alone, every time the correct term is used!

[Edit – 2009-09-23]

OK, so it’s been pointed out by Scott Waterhouse that the official SNIA definition for archive doesn’t mention having to delete the source files, so I’ll accept that I was being stubbornly NetWorker-centric on this blog article. So I’ll accept that I’m wrong and (grudgingly yes) be prepared to refer to HSM as archive. But I won’t like it. Is that a fair compromise? 🙂

I won’t give up on ILP though!

Sep 032009
 

I’m a big fan of careful management of directives – for instance, I always go by the axiom that it’s better to backup a little bit too much and waste some tape than it is to not backup enough and not be able to recover.

That being said, I’m also a big fan of correct use of directives within NetWorker – skipping files that 100% are not required, adjusting preferences for the way files are backed up (e.g., logasm), etc., are quite important to getting well running backups.

So needless to say it bugs the hell out of me that after all this time, you still can’t include a directive within a directive.

Or rather, you can, but it’s through a method called “copy and paste”, which as we know, doesn’t lend itself too well to auto updating functionality.

So the current directive format is:

<< path >>
[+]asm: criteria

For example, you might want directives for a system such as:

<< / >>
+skip: *.mp3

<< /home/academics >>
forget

<< /home/students >>
+skip: *.mov *.m4v *.wma *.dv

Now, it could be that for every Unix system, you also want to include Unix Standard Directives. Currently the only way to do this is to create a new directive where you’ve copied and pasted in all the Unix Standard Directives then added in your above criteria.

This, to use the appropriate technical term, is a dogs breakfast.

The only logical way, the way which obviously hasn’t been developed yet for NetWorker but falls into the category of “why the hell not?” would be support for include statements. That way, it could be embedded into the directive itself.

For example, what I’m talking about is that we should be able to do the following:

<< / >>
include: Unix Standard Directives

<< / >>
+skip: *.mp3

<< /home/academics >>
forget

<< /home/students >>
+skip: *.mov *.m4v *.wma *.dv

Now wouldn’t that be nice? Honestly, how hard could it be?

NB: The correct answer to “how hard could it be?” is actually “I don’t care.” That is, there’s some things that should be done regardless of whether they’re easy to do.

Quibbles – Cloning and Staging

 Quibbles  Comments Off on Quibbles – Cloning and Staging
Aug 132009
 

For the most part cloning and staging within NetWorker are pretty satisfactory, particularly when viewed from a combination of automated and manual operations. However, one thing that constantly drives me nuts is the inane reporting of status for cloning and staging.

Honestly, how hard can it be to design cloning and staging to accurately report the following at all times:

Cloning X of Y savesets, W GB of Z GB

or

Staging X of Y savesets, W GB of Z GB

While there have been various updates to cloning and staging reporting, and sometimes it at least updates how many savesets it has done, it continually breaks when dealing with the total amount staged/cloned in as much as it resets whenever a destination volume is changed.

Oh, and while I’m begging for this change, I will request one other – include in daemon.raw whenever cloning/staging occurs, the full list of ssid/cloneids that have been cloned or staged, and the client/saveset details for each one – not just minimal details when a failure occurs. It’s called auditing.

%d bloggers like this: