Looking at the stats both for this new site and the previous site, I’ve compiled a list of the top 10 read articles on The NetWorker Blog for 2009. The top 3 of course match the three articles that routinely turn out to be the most popular on any given month, which speaks something of their relevance to the average NetWorker administrator.

(Note: I’ve excluded non-article pages from the top 10.)

Number 10 – Instantiating Savesets

The very first article on the blog, Instantiating Savesets detailed the importance of distinguishing between all instances of a saveset and a specific instance of a saveset.

This distinction between using just the saveset ID, and using a saveset ID/clone ID combination becomes particularly important when staging from disk backup units. If clones exist and you stage using just the saveset ID, when NetWorker cleans up at the end of the staging operation it will remove reference to the clones as well as deleting the original from the disk backup unit. (Something you really don’t want to have happen.)

Recommendation to EMC: Perhaps it would be worthwhile requiring a “-y” argument to nsrstage if staging savesets from disk backup units and specifying only the saveset ID.

Recommendation to NetWorker administrators: Always be careful when staging that you specify both the saveset and the clone ID.

Number 9 – Basics – Important mminfo fields

In May I wrote about a few key mminfo fields – notably:

  • savetime
  • sscreate
  • ssinsert
  • sscomp
  • ssaccess

Sadly, I didn’t get the result I wanted with EMC on ssaccess. Documented as being updated whenever a saveset fragment is accessed for backup and recovery, the most I could get was an acknowledgement that it was currently broken and to lodge an RFE to get it fixed. (The alternative was to have the documentation changed to take out reference to read operations – something I didn’t want to have happen!)

Recommendation to EMC: ssaccess would be a particularly useful mminfo field, particularly when analysing recovery statistics for NetWorker. Please fix it.

Number 8 – Basics – Listing files in a backup

Want to know what files were backed up as part of the creation of a saveset? If you do, you’re not unique – this has remained a very popular article since it was written in January.

Recommendation to EMC: This information can be retrieved via a combination of mminfo/nsrinfo, but it would be handy if NMC supported drilling down into a saveset to provide a file listing.

Number 7 – Using yum to install NetWorker on Linux

NetWorker’s need for dependency resolution on Linux for installation of the client packages in particular drew a lot of people to this article.

Number 6 – Basics – mminfo, savetime, and greater than/less than

This article explained why NetWorker uses the greater than and less than signs in mminfo in a way that newcomers to the product might find backwards. If you’re not aware of why mminfo works the way it does for specifying savetimes, you should be.

Number 5 – 7.5(.1) changed behaviour – deleting savesets from adv_file devices

This was a particularly unpleasant bug introduced into NetWorker 7.5, thankfully resolved now in the cumulative service releases and NetWorker 7.6

The gist of it is that in NetWorker 7.5/7.5.1 (aka 7.5 SP1), if you deleted a saveset on a disk backup unit, NetWorker would suffer a serious failure where it would from that point have issues cleaning regular expired savesets from the disk backup unit and insist that the disk backup unit had major issues. The primary error would manifest as:

nsrd adv_file warning: Failed to fetch the saveset(ss_t) structure for ssid 1890993582

This was fixed in 7.5.1.2, thankfully.

Recommendation to EMC: Never let this bug see the light of day again, please. (So far you’re doing an excellent job, by the way.)

Number 4 – NetWorker 7.5.1 Released

I’ve recently noticed a disturbing trend among many vendors, EMC included, where once a new release is made of a product, sales and account staff become overly enthusiastic about recommending new releases. This comes on top of not really having any technical expertise. (Please be patient, I’m trying to put this as diplomatically as possible.)

One of the worst instances I’ve seen of this in the last few years was the near-hysterical pumping of 7.5 thanks to some useful features to do with virtualisation in particular. I’ll admit that my articles on the integration between Oracle Module 5 and NetWorker 7.5, as well as Probe Based Backups may have added to this. However, there was somewhat of a stampede to 7.5 when it came out, and consequently, when it had some issues, there was strong enthusiasm for the release of 7.5.1.

This is why, by the way, that IDATA maintains for its support customers a recommended versions list that is not automatically updated when new versions of products come out.

Recommendation to EMC: Remind your sales staff that existing users already have the product, and not to just go blindly convincing them to upgrade. Otherwise you’ll eventually start sounding like this.

Number 3 – Carry a jukebox with you (if you’re using Linux)

During 2009, Mark Harvey’s LinuxVTL project first got the open source LinuxVTL working with NetWorker in a single drive configuration, then eventually, in multi-drive configurations. (Mark assures me, by the way, that patches are coming real soon to allow multiple robots on the same storage node/server.)

Lesson for me: With the LinuxVTL configured on multiple lab servers in my environment, I’ve really taken to VTLs this year, and considerably changed my attitude on using them. (I’ll say again: I still resent that they’re needed, but I now respect them a lot more than I previously did.)

Lesson for others: Even Mark himself says that the open source VTL shouldn’t be used for production backups. Don’t be cheap with your backup system, this is an excellent tool for lab setups, training, diagnostics, etc., but it is not a replacement to a production-ready VTL system. If you want a VTL, buy a VTL.

Number 2 – Basics – Parallelism in NetWorker

Some would say that the high popularity of an article about parallelism in NetWorker indicates that it’s not sufficiently documented.

I’m not entirely convinced that’s the case. But it does go to show that it’s an important topic when it comes to performance tuning, and summary articles about how the various types of parallelism interact are obviously popular.

Lesson for everyone: Now that the performance tuning guide has been updated and made more relevant in NetWorker 7.6, I’d recommend people wanting an official overview of some of the parallelism options checking that out in addition to the article above.

Number 1 – Basics – Fixing “NSR peer information” errors

Goodness this was a popular article in 2009 – detailing how to fix the “NSR peer information” errors that can come up from time to time in the NetWorker logs. If you’re not familiar with this error yet, it’s likely you will eventually as a NetWorker administrator see an error such as:

39078 02/02/2009 09:45:13 PM  0 0 2 1152952640 5095 0 nox nsrexecd SYSTEM error: There is already a machine using the name: “faero”. Either choose a different name for your machine, or delete the “NSR peer information” entry for “faero” on host: “nox”

Recommendation for EMC: Users shouldn’t really need to be Googling for a solution to this problem. Let’s see an update to NetWorker Management Console where these errors/warnings are reported in the monitoring log, with the administrator being able to right click on them and choose to clear the peer information after confirming that they’re confident no nefarious activity is happening.

Wrapping Up

I have to say, it was a fantastically satisfying year writing the blog, and I’m looking forward to seeing what 2010 brings in terms of most useful articles.

 

So this morning I was looking through the stats for this blog, and I generated the list of most popular posts thus far. I can’t say any of the results surprised me. Every single one of the top 5 comes from the “Basics” series.

Number 5, on that list, was Basics – Listing Files in a backup. There’s a lot of people out there who want to know how to use nsrinfo in general, and specifically want to know about pulling file lists for savesets. Net result? I think it would be greatly beneficial if in NMC users could double-click on browsable savesets and get a complete listing of files therein.

Number 4 was Basics – mminfo, savetime and greater than/less than. Now, I’m not going to pretend that every person who visited that article was looking for details about how greater than and less than works in mminfo in relation to savetimes, though I suspect a reasonable percentage of people new to mminfo found that interesting. My take on it is that it proves there’s not really enough documentation about mminfo, and that mminfo needs some expansion. My personal preference? Having a full SQL-like query engine for mminfo would greatly expand the options available to NetWorker administrators.

Number 3 on the list is Basics – Changing saveset browse/retention times. As regularly as possible I try to check the search strings that have brought people to my blog (as recorded by wordpress), and I can practically guarantee that every day there are multiple combinations to do with savesets, browse and retention times. Sometimes those combinations reference nsrmm, sometimes they don’t. Clearly, extending saveset browse/retention times in NetWorker needs to be more manageable from within the GUI as a bare minimum. I’ll get to the command line in a moment.

Moving on to number 2, we have something that I get search results for every day without fail. That’s Basics – Fixing “NSR Peer information” errors. It’s actually a reasonably simple error to fix, but sometimes finding the information about it is a bit like the old needle-in-a-haystack. I’m hoping that the posting on it has helped quite a few sites to clear out the warnings/errors in their logs and reduce the amount of clutter being reported.

Finally, for number 1, a topic I’m completely unsurprised to see at the top, we have Basics – Parallelism in NetWorker. Not because it’s difficult, but because there’s no absolute rules, parallelism is a topic in NetWorker that many administrators, regardless of length of time with the product, find challenging at times. Set too low, and backups may overrun. Set too high, and device contention, client slow-downs, recovery performance issues, etc., may come into play. Tuning parallelism in NetWorker has to take a lot into account.

The content of this list suggests a few things to me:

  • None of this information is out of reach in the product manuals, but, since the product manuals are (necessarily) lengthy, it is logistically is out of reach for a lot of users who don’t have time to read lengthy manuals.
  • EMC product management could take a few tips from the top 5 articles on my blog – I think they represent areas that could be improved within usability of the product. While parallelism is not something that can “solved” by changes within the GUI (it is, by necessity, complex), other options, such as improving mminfo search, making saveset contents more accessible within the GUI, etc., are readily fixable.
  • It seems there might be scope for a “Getting Started with NetWorker” style manual. I think a traditional book would (a) be too expensive and (b) be unsuitable. This is the sort of information that people want readily to hand on their desktops.

On the last point, I’m interested in writing such a manual. I obviously have some experience with writing – but more so than just the book, over the years I’ve written literally thousands of pages of NetWorker instructions as part of professional services documentation, training courses, etc.

So here’s a question – would people be interested in say, an eBook along the lines of “Getting Started with NetWorker” that gives basic operational and instruction usage so that rather than having to wade through the (close to 1000+) pages of the official documentation they had something shorter, and geared towards day to day operation?

Let me know what you think.

 

Most NetWorker administrators with even a passing familiarity of mminfo will be aware of the “savetime” field, which reports when a saveset was created (i.e., when the backup was taken).

There are however some other fields that also provide additional date/time details about savesets, and knowing about them can be a real boon. Here’s a quick summary of the important date/time fields that provide information about savesets:

  • savetime – The time/date, on the client of the backup.
  • sscreate – The time/date on the server of the backup.
  • ssinsert – The time/date on the server of the last time the saveset was inserted into the media database.
  • sscomp – The time/date that the backup completed*.
  • ssaccess – The date/time that the backup was last accessed for backup or recovery purposes**.

Now, remembering that we can append, in the report specifications, a field length to any field, we can get some very useful information out of the media database for savesets. For instance, to see when the backups started and stopped for a volume, you might run:

[root@nox ~]# mminfo -q "volume=ISO_Archive.001" -r "name,
savetime(23),sscomp(23)"
 name                               date     time          ss completed
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 12:34:18 PM

So, not only do we have the date, but also the time of both the start and the finish of the backup.

To compare the client savetime with the server savetime, we’d use the sscreate field:

[root@nox ~]# mminfo -q "volume=ISO_Archive.001" -r "name,
savetime(23),sscreate(23)"
 name                               date     time           ss created
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 08:38:42 AM

Note in this second there was a 2 second skew between the backup server and the client at the time the backup was run.

I’ll leave ssinsert as an exercise to the reader – if you’ve got any recently scanned in savesets, give it a try and compare it against the output from sscreate and savetime.

However, moving on to the last field I mentioned, ssaccess, we get some very interesting results. Let’s see the output from:

[root@nox ~]# mminfo -q "volume=ISO_Archive.001,name=/d/03/share-a/ISO" -r "name,
savetime(23),ssaccess(23)"
 name                               date     time            ss access
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 08:42:25 AM
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 08:43:31 AM
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 08:46:00 AM
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 08:48:12 AM
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 08:49:04 AM
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 08:49:55 AM
/d/03/share-a/ISO               05/06/2009 08:38:40 AM 05/06/2009 08:54:10 AM
(snip)

Now, if you’ve been following the thread, the above doesn’t immediately appear to make sense. On that volume there’s only one saveset, so why are we suddenly getting entries for what appears to be multiple savesets? Well, they’re not multiple savesets – let’s try it again with SSID, rather than name:

[root@nox ~]# mminfo -q "volume=ISO_Archive.001,name=/d/03/share-a/ISO" -r "ssid,
savetime(23),ssaccess(23)"
 ssid           date     time            ss access
67158002    05/06/2009 08:38:40 AM 05/06/2009 08:42:25 AM
67158002    05/06/2009 08:38:40 AM 05/06/2009 08:43:31 AM
67158002    05/06/2009 08:38:40 AM 05/06/2009 08:46:00 AM
67158002    05/06/2009 08:38:40 AM 05/06/2009 08:48:12 AM
67158002    05/06/2009 08:38:40 AM 05/06/2009 08:49:04 AM
67158002    05/06/2009 08:38:40 AM 05/06/2009 08:49:55 AM
67158002    05/06/2009 08:38:40 AM 05/06/2009 08:54:10 AM
(snip)

An astute reader may think I’ve got some problem with my media database at this point – only one instance of a saveset can ever appear on the same volume, so the above looks like it simply shouldn’t happen.

Here’s where it gets really interesting though. NetWorker writes savesets in fragments, and each fragment of the saveset is generated and may be accessed separately – therefore, mminfo is reporting the access time for each fragment of the saveset. We can fully see this by expanding what we’re asking mminfo to report – including fragsize, mediafile and mediarec.

[root@nox 02]# mminfo -q "volume=ISO_Archive.001" -r "savetime(23),ssaccess(23),
fragsize,mediafile,mediarec"
 date     time            ss access          size file  rec
 05/06/2009 08:38:40 AM 05/06/2009 08:42:25 AM 1040 MB   2    0
 05/06/2009 08:38:40 AM 05/06/2009 08:43:31 AM 1040 MB   3    0
 05/06/2009 08:38:40 AM 05/06/2009 08:46:00 AM 1040 MB   4    0
 05/06/2009 08:38:40 AM 05/06/2009 08:48:12 AM 1040 MB   5    0
 05/06/2009 08:38:40 AM 05/06/2009 08:49:04 AM 1040 MB   6    0
 05/06/2009 08:38:40 AM 05/06/2009 08:49:55 AM 1040 MB   7    0

Now, the man page for mminfo says that the ssaccess time is updated for both backup and recovery operations, but despite various recovery tests I can’t yet get it to update. Despite this however, this is still useful – it allows us to tell how long each fragment took to backup, which lets us interrogate, at a later point, whether there were any pauses of significant delays in the data stream.

Regardless of the little discrepancy with ssaccess, you can see that there’s a great set of options available to retrieve additional date/time related details about savesets using mminfo.

(I’ve currently got a case open with EMC to determine whether ssaccess should be updated on recovery attempts, or whether the documentation has an error. I’ll update this posting once I find out.)


* The man page for mminfo does not document whether this is server time or client time. I assume, given that savetime is client time, that sscomp is also client time.

** The man page for mminfo does not document whether this is server time or client time. I assume that it’s in server time.

 

Take a basic mminfo query, add someone not familiar with how NetWorker stores and works with dates/times, and you have instant chaos*. In this post I want to help people who are just starting out with mminfo understand how it works with dates.

So let’s look at a basic query that tends to cause a lot of confusion:

# mminfo -q "client=archon,savetime<=2 weeks ago"

As a long-term NetWorker user, and implementation consultant/support consultant, not to mention a long term member of the NetWorker mailing list, this question seems to come up fairly frequently. The output appears “broken” – rather than being “savetime less than or equal to two weeks ago”, we instead get all backups for the client archon where the savetime is greater than or equal to two weeks ago.

‘Huh?’ I hear you ask.

Indeed, this is oft-used as an example of how “broken” NetWorker is. In fact, the real state is far more prosaic.

NetWorker stores and works with times as seconds since the(/an) epoch. When you supply dates to NetWorker – either in the fuzzy format above, or as a literal date string, it converts that date into a timestamp of seconds since the(/an) epoch. (You can if you want find out what a savetime is in seconds, rather than an interpreted date any time you wish in mminfo by choosing a report specification of ‘nsavetime’.)

So if you then think of the query:

# mminfo -q "client=archon,savetime<=2 weeks ago"

It has a different meaning. You’re actually asking NetWorker:

  • Convert ’2 weeks ago’ into seconds offset from ‘now’. Let’s call that Z.
  • Give me all the backups for the client ‘archon’ where the savetime is less than or equal to Z.

If you don’t like to think of it as all referring to seconds since an epoch, there’s another, perhaps simpler way of thinking about it – that being:

  • Treat “<” as meaning before.
  • Treat “>” as meaning after.

Thus, in this scenario, the query:

# mminfo -q "client=archon,savetime<=2 weeks ago"

Can be interpreted to mean, “give me all backups of the client archon taken before two weeks ago”.

You’re obviously welcome to use whichever interpretation you feel makes more sense – seconds/math or before/after – it doesn’t really matter which. Once you get the hang of this though mminfo will make a lot more sense.

* I’ve unfortunately seen someone who got < and > wrong (and didn’t check their results) relabel all tapes in a tape library that had backups younger than 3 months, rather than older than 3 months. Hence, ‘chaos’ is an appropriate term.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha