Looking at the stats both for this new site and the previous site, I’ve compiled a list of the top 10 read articles on The NetWorker Blog for 2009. The top 3 of course match the three articles that routinely turn out to be the most popular on any given month, which speaks something of their relevance to the average NetWorker administrator.

(Note: I’ve excluded non-article pages from the top 10.)

Number 10 – Instantiating Savesets

The very first article on the blog, Instantiating Savesets detailed the importance of distinguishing between all instances of a saveset and a specific instance of a saveset.

This distinction between using just the saveset ID, and using a saveset ID/clone ID combination becomes particularly important when staging from disk backup units. If clones exist and you stage using just the saveset ID, when NetWorker cleans up at the end of the staging operation it will remove reference to the clones as well as deleting the original from the disk backup unit. (Something you really don’t want to have happen.)

Recommendation to EMC: Perhaps it would be worthwhile requiring a “-y” argument to nsrstage if staging savesets from disk backup units and specifying only the saveset ID.

Recommendation to NetWorker administrators: Always be careful when staging that you specify both the saveset and the clone ID.

Number 9 – Basics – Important mminfo fields

In May I wrote about a few key mminfo fields – notably:

  • savetime
  • sscreate
  • ssinsert
  • sscomp
  • ssaccess

Sadly, I didn’t get the result I wanted with EMC on ssaccess. Documented as being updated whenever a saveset fragment is accessed for backup and recovery, the most I could get was an acknowledgement that it was currently broken and to lodge an RFE to get it fixed. (The alternative was to have the documentation changed to take out reference to read operations – something I didn’t want to have happen!)

Recommendation to EMC: ssaccess would be a particularly useful mminfo field, particularly when analysing recovery statistics for NetWorker. Please fix it.

Number 8 – Basics – Listing files in a backup

Want to know what files were backed up as part of the creation of a saveset? If you do, you’re not unique – this has remained a very popular article since it was written in January.

Recommendation to EMC: This information can be retrieved via a combination of mminfo/nsrinfo, but it would be handy if NMC supported drilling down into a saveset to provide a file listing.

Number 7 – Using yum to install NetWorker on Linux

NetWorker’s need for dependency resolution on Linux for installation of the client packages in particular drew a lot of people to this article.

Number 6 – Basics – mminfo, savetime, and greater than/less than

This article explained why NetWorker uses the greater than and less than signs in mminfo in a way that newcomers to the product might find backwards. If you’re not aware of why mminfo works the way it does for specifying savetimes, you should be.

Number 5 – 7.5(.1) changed behaviour – deleting savesets from adv_file devices

This was a particularly unpleasant bug introduced into NetWorker 7.5, thankfully resolved now in the cumulative service releases and NetWorker 7.6

The gist of it is that in NetWorker 7.5/7.5.1 (aka 7.5 SP1), if you deleted a saveset on a disk backup unit, NetWorker would suffer a serious failure where it would from that point have issues cleaning regular expired savesets from the disk backup unit and insist that the disk backup unit had major issues. The primary error would manifest as:

nsrd adv_file warning: Failed to fetch the saveset(ss_t) structure for ssid 1890993582

This was fixed in 7.5.1.2, thankfully.

Recommendation to EMC: Never let this bug see the light of day again, please. (So far you’re doing an excellent job, by the way.)

Number 4 – NetWorker 7.5.1 Released

I’ve recently noticed a disturbing trend among many vendors, EMC included, where once a new release is made of a product, sales and account staff become overly enthusiastic about recommending new releases. This comes on top of not really having any technical expertise. (Please be patient, I’m trying to put this as diplomatically as possible.)

One of the worst instances I’ve seen of this in the last few years was the near-hysterical pumping of 7.5 thanks to some useful features to do with virtualisation in particular. I’ll admit that my articles on the integration between Oracle Module 5 and NetWorker 7.5, as well as Probe Based Backups may have added to this. However, there was somewhat of a stampede to 7.5 when it came out, and consequently, when it had some issues, there was strong enthusiasm for the release of 7.5.1.

This is why, by the way, that IDATA maintains for its support customers a recommended versions list that is not automatically updated when new versions of products come out.

Recommendation to EMC: Remind your sales staff that existing users already have the product, and not to just go blindly convincing them to upgrade. Otherwise you’ll eventually start sounding like this.

Number 3 – Carry a jukebox with you (if you’re using Linux)

During 2009, Mark Harvey’s LinuxVTL project first got the open source LinuxVTL working with NetWorker in a single drive configuration, then eventually, in multi-drive configurations. (Mark assures me, by the way, that patches are coming real soon to allow multiple robots on the same storage node/server.)

Lesson for me: With the LinuxVTL configured on multiple lab servers in my environment, I’ve really taken to VTLs this year, and considerably changed my attitude on using them. (I’ll say again: I still resent that they’re needed, but I now respect them a lot more than I previously did.)

Lesson for others: Even Mark himself says that the open source VTL shouldn’t be used for production backups. Don’t be cheap with your backup system, this is an excellent tool for lab setups, training, diagnostics, etc., but it is not a replacement to a production-ready VTL system. If you want a VTL, buy a VTL.

Number 2 – Basics – Parallelism in NetWorker

Some would say that the high popularity of an article about parallelism in NetWorker indicates that it’s not sufficiently documented.

I’m not entirely convinced that’s the case. But it does go to show that it’s an important topic when it comes to performance tuning, and summary articles about how the various types of parallelism interact are obviously popular.

Lesson for everyone: Now that the performance tuning guide has been updated and made more relevant in NetWorker 7.6, I’d recommend people wanting an official overview of some of the parallelism options checking that out in addition to the article above.

Number 1 – Basics – Fixing “NSR peer information” errors

Goodness this was a popular article in 2009 – detailing how to fix the “NSR peer information” errors that can come up from time to time in the NetWorker logs. If you’re not familiar with this error yet, it’s likely you will eventually as a NetWorker administrator see an error such as:

39078 02/02/2009 09:45:13 PM  0 0 2 1152952640 5095 0 nox nsrexecd SYSTEM error: There is already a machine using the name: “faero”. Either choose a different name for your machine, or delete the “NSR peer information” entry for “faero” on host: “nox”

Recommendation for EMC: Users shouldn’t really need to be Googling for a solution to this problem. Let’s see an update to NetWorker Management Console where these errors/warnings are reported in the monitoring log, with the administrator being able to right click on them and choose to clear the peer information after confirming that they’re confident no nefarious activity is happening.

Wrapping Up

I have to say, it was a fantastically satisfying year writing the blog, and I’m looking forward to seeing what 2010 brings in terms of most useful articles.

 

So this morning I was looking through the stats for this blog, and I generated the list of most popular posts thus far. I can’t say any of the results surprised me. Every single one of the top 5 comes from the “Basics” series.

Number 5, on that list, was Basics – Listing Files in a backup. There’s a lot of people out there who want to know how to use nsrinfo in general, and specifically want to know about pulling file lists for savesets. Net result? I think it would be greatly beneficial if in NMC users could double-click on browsable savesets and get a complete listing of files therein.

Number 4 was Basics – mminfo, savetime and greater than/less than. Now, I’m not going to pretend that every person who visited that article was looking for details about how greater than and less than works in mminfo in relation to savetimes, though I suspect a reasonable percentage of people new to mminfo found that interesting. My take on it is that it proves there’s not really enough documentation about mminfo, and that mminfo needs some expansion. My personal preference? Having a full SQL-like query engine for mminfo would greatly expand the options available to NetWorker administrators.

Number 3 on the list is Basics – Changing saveset browse/retention times. As regularly as possible I try to check the search strings that have brought people to my blog (as recorded by wordpress), and I can practically guarantee that every day there are multiple combinations to do with savesets, browse and retention times. Sometimes those combinations reference nsrmm, sometimes they don’t. Clearly, extending saveset browse/retention times in NetWorker needs to be more manageable from within the GUI as a bare minimum. I’ll get to the command line in a moment.

Moving on to number 2, we have something that I get search results for every day without fail. That’s Basics – Fixing “NSR Peer information” errors. It’s actually a reasonably simple error to fix, but sometimes finding the information about it is a bit like the old needle-in-a-haystack. I’m hoping that the posting on it has helped quite a few sites to clear out the warnings/errors in their logs and reduce the amount of clutter being reported.

Finally, for number 1, a topic I’m completely unsurprised to see at the top, we have Basics – Parallelism in NetWorker. Not because it’s difficult, but because there’s no absolute rules, parallelism is a topic in NetWorker that many administrators, regardless of length of time with the product, find challenging at times. Set too low, and backups may overrun. Set too high, and device contention, client slow-downs, recovery performance issues, etc., may come into play. Tuning parallelism in NetWorker has to take a lot into account.

The content of this list suggests a few things to me:

  • None of this information is out of reach in the product manuals, but, since the product manuals are (necessarily) lengthy, it is logistically is out of reach for a lot of users who don’t have time to read lengthy manuals.
  • EMC product management could take a few tips from the top 5 articles on my blog – I think they represent areas that could be improved within usability of the product. While parallelism is not something that can “solved” by changes within the GUI (it is, by necessity, complex), other options, such as improving mminfo search, making saveset contents more accessible within the GUI, etc., are readily fixable.
  • It seems there might be scope for a “Getting Started with NetWorker” style manual. I think a traditional book would (a) be too expensive and (b) be unsuitable. This is the sort of information that people want readily to hand on their desktops.

On the last point, I’m interested in writing such a manual. I obviously have some experience with writing – but more so than just the book, over the years I’ve written literally thousands of pages of NetWorker instructions as part of professional services documentation, training courses, etc.

So here’s a question – would people be interested in say, an eBook along the lines of “Getting Started with NetWorker” that gives basic operational and instruction usage so that rather than having to wade through the (close to 1000+) pages of the official documentation they had something shorter, and geared towards day to day operation?

Let me know what you think.

 

If you’re backing up Oracle with the NetWorker module/RMAN, there are an extremely large number of options you can choose from. RMAN, after all, is a complete backup/recovery system in and of itself, and so when you combine RMAN and NetWorker you, well, find yourself swimming in options.

One such option is the allocate channel command within RMAN. If you’ve not seen a basic RMAN script before, I should put one here for your reference:

connect target rman/supersecretpassword@DB10;

run {
 allocate channel t1 type 'SBT_TAPE';
 send 'NSR_ENV=(NSR_SAVESET_EXPIRATION=14 days,
       NSR_SERVER=nox,NSR_DATA_VOLUME_POOL=Daily)';

 backup format '/%d_%p_%t.%s/'
 (database);

 backup format '/%d_%p_%t_al.%s/'
 (archivelog from time 'SYSDATE-2');

 release channel t1;
}

You’ll note that one of the first commands used in the script is the allocate channel command. This effectively tells RMAN to open up a line of communication with NetWorker. Now, you can consider an RMAN channel to be a unit of parallelism in NetWorker parlance. Thus, if you want to backup (larger) databases with higher levels of parallelism, you need to allocate more channels.

In many NetWorker/Oracle scenarios, the NetWorker administrator has very little, if no, control over the construction and the configuration of the RMAN script. (The introduction of v5 of the module may change this.)

As a consequence, there’s often a reduced level of communication between the NetWorker administrator and the Oracle DBA which can result in reduced performance or scheduling conflicts. One particular issue that can occur though is that the Oracle DBA, eager to have the database backed up as quickly as possible, will throw a lot of allocate channel commands in. That little script above may become something such as say:

connect target rman/supersecretpassword@DB10;

run {

 allocate channel t1 type 'SBT_TAPE';
 allocate channel t2 type 'SBT_TAPE';
 allocate channel t3 type 'SBT_TAPE';
 allocate channel t4 type 'SBT_TAPE';
 allocate channel t5 type 'SBT_TAPE';
 allocate channel t6 type 'SBT_TAPE';
 allocate channel t7 type 'SBT_TAPE';
 allocate channel t8 type 'SBT_TAPE';

 send 'NSR_ENV=(NSR_SAVESET_EXPIRATION=14 days,
       NSR_SERVER=nox,NSR_DATA_VOLUME_POOL=Daily)';

 backup filesperset 4
 format '/%d_%p_%t.%s/'
 (database);

 backup format '/%d_%p_%t_al.%s/'
 (archivelog from time 'SYSDATE-2');

 release channel t1;
 release channel t2;
 release channel t3;
 release channel t4;
 release channel t5;
 release channel t6;
 release channel t7;
 release channel t8;
}

However, there’s a catch to lots of channels being allocated – channel allocation has no bearing on or is in any way impacted by NetWorker client parallelism. You see, the NetWorker client instance has a single saveset – the RMAN script name (or equivilant thereof, when using the Wizard in v5). Thus, to NetWorker, any Oracle client instance only has one saveset. Thus, that client parallelism will not affect the number of channels that can be allocated, but instead the number of simultaneous instances of the client that can be initiated.

The net result? Consider a client with parallelism of 4, that has 6 databases to be backed up. This would have 6 client instances, one per database. Assuming they’re all in the same group*, then at any one instance NetWorker will only allow the backup for 4 of those instances to be running. However, each instance, or each Oracle RMAN script, can start as many channels as it wants. If each RMAN script has been “tweaked” to allocate say, 8 channels like the above script example, this would mean that backing up 4 instances simultaneously would potentially see the client trying to send 32 savesets simultaneously to NetWorker.

Thus, if using multiple Oracle channels in RMAN backups with NetWorker, and particularly if backing up multiple Oracle databases simultaneously, it’s very important to have the NetWorker administrator and the DBA responsible for the RMAN scripts to communicate effectively and plan overall levels of parallelism/number of channels to avoid swamping the NetWorker server, swamping the network, or swamping the Oracle server.


* There are other considerations for starting multiple Oracle backups on the same machine and at the same time. In other words I’m not necessarily calling this best practice, just using an example.

 

While doing a few tests for this blog on a lab server, I noticed what looked like odd behaviour – I had started a manual save running on the NetWorker server for local data. That backup was writing to tape, and while it was going I kicked off a group for an altogether different client.

The backup for the client ran, but then seemed to hang on completion. As the backup-to-tape was merely to test filling tape, and therefore could be restarted at any time, I cancelled out on a hunch, and the savegroup completed almost immediately.

It was “hung” waiting for a free unit of parallelism for the NetWorker server in order to write the client indices. It turned out that I’d forgotten a change I’d made on Friday to test some other settings – that change being to reduce the parallelism of the client instance of the NetWorker server to 1.

With this in place, the backup server couldn’t complete the savegroup because it couldn’t write its indices, and it couldn’t write its indices because it was only allowed a client parallelism of 1, and that unit of parallelism was occupied writing to tape.

So it lead me to think – how easy would it be, given this, for companies to experience delays in their backups due to too low a setting for client parallelism for the NetWorker server? The answer – quite easy. After all, the first, most golden rule of client performance tuning on NetWorker is to eliminate client parallelism, to reduce it to 1, and work your way up based on client hardware and data configuration.

This means that it’s actually fairly critical that the NetWorker server have sufficient parallelism to ensure that index backups do not become an impediment to groups finishing. Based on this I’d recommend aiming for client parallelism for the NetWorker server to:

  • Never be set to 1.
  • For small environments (under 30 servers) be set to at least 4.
  • For medium environments (say, 31-100) be set to at least 8.
  • For larger environments (100+), be set to at least 8, but preferably one of:
    • The same as the actual server parallelism, or
    • The same as the highest group parallelism, if group parallelism is used.

Note that the above entirely assumes that the backup server is a dedicated backup server. If the backup server is also say, a file server*, then obviously different settings will need to be considered to avoid swamping the system.

In essence, while the main goal for regular clients is to achieve as low a client parallelism as possible – i.e., to optimise the balance between number of savesets and throughput, for the backup server the goal should be to have as high a client parallelism as necessary to ensure that index backups are not delayed, so as to ensure that groups finish when they are ready to finish.


* For what it’s worth, my recommendation is that in 100% of times, a backup server should be dedicated. That is, the primary and sole function of the server is to act as a backup server.

 

Parallelism in NetWorker is effectively multiplexing by another name. There are three areas where you have traditionally been able to set this:

  • Client parallelism – how many savesets a client can simultaneously send in a backup
  • Server parallelism – how many savesets a backup server will simultaneously allow to be active for the purposes of backup
  • Target sessions – the optimal number of savesets you want running to a backup device

As of NetWorker 7, we saw the introduction of:

  • Savegrp parallelism – the maximum number of backup savesets that can be running for a particular group.

As of NetWorker 7.3, we saw the introduction of:

  • Max sessions – the maximum number of savesets you’ll permit running to a backup device

Somewhere in the 7.x tree – I don’t recall when – there was another parallelism setting introduced, this time for the pool:

  • Max parallelism – The maximum number of savesets that can be simultaneously written to media belonging to a particular pool.

Also, we’ve seen the introduction of:

  • Max active devices – a setting maintained in the device resource, but is shared by all devices common to a single storage node, rather refers to the maximum number of devices that can be active on the storage node at any one time.

All of these settings serve one key purpose – they let you tune the performance of your NetWorker datazone.

Note: It’s worth pointing out something fairly critical here – all of these settings affect  backup savesets, they don’t affect recovery savesets. NetWorker will always allow new recovery savesets to be initiated, even if it can’t immediately facilitate the recovery.

Client parallelism is actually one of the most difficult parallelism settings to tune, and I’ve been somewhat disappointed by the new “default” setting of 12 (up from 4) in NetWorker 7.4.x onwards. I strongly believe it should be set to 1 for all new clients so as to ensure people think about the performance implications before they increase it.

I won’t go further into client parallelism here – I covered in considerable detail in my book, so if you want details of evaluating client parallelism settings you should check it out*. 

Server parallelism is a lot easier to understand – how powerful is your server, and how many devices do you have? In an optimal environment, your backup environment should be able to handle the processing of enough streams to keep every single backup device in your datazone streaming at full speed**. We’ll get to this in a moment, but optimally you want to keep that to as few savesets as possible – i.e., in a perfect world, we’d like to be able to keep every backup device running at full speed from individual savesets. This doesn’t always happen though, so you need to be able to plan for the appropriate number of savesets. 

(Even when the backup server is not actually backing anything up (e.g., all client backups are conducted by storage nodes, with the backup server just acting in a director role), every active saveset does consume resources on the backup server – this includes general coordination resources as well as index resources, etc.)

Device target sessions is an interesting one. It’s not actually a hard limit. In the first pass, it refers to how many savesets should be running on a device before new savesets are started on the next device. So, if every device in the environment has target sessions of 4, then one by one NetWorker will want to get 4 savesets running to each device. But what happens when every device is running 4 savesets, and NetWorker needs to start a new saveset? In that instance, NetWorker just ‘cycles’ through all the devices, tacking on another saveset to each device until they’re say, all running 5 savesets. Then if another comes along it starts building each device up to 6, and so on. In effect, it’s a primitive form of load balancing. 

The newly introduced setting of max sessions for devices does act as a hard limit – a device will never exceed the number of active savesets as defined by the max sessions parameter; this by default is set to 512, effectively not placing a limit on the number of sessions running to the device***.

So what about the other settings? Where would you use them?

The savegrp parallelism setting is a great option to use if you have multiple groups running in such a way that they overlap, and one or more of the groups has large numbers of clients. You see, traditionally, the code for a group assumes that when it starts, it can query the server’s parallelism setting and start up to that many savesets. However, if you’ve got multiple groups running, then you could exceed the number of permitted savesets. This can result in timeouts or failures. If however you’ve say, got server parallelism of 64, and one group with 100 clients, and two other groups with say, 4 clients each, you might set the large group to have parallelism of 60, and the other two groups to each have parallelism of 2. This would enable all three groups to simultaneously run.

Max parallelism for pools is not something I’ve really played around with. However, I can immediately imagine it would be useful if you had specific pools for disk backup units that are all connected via the same FC or SCSI bus – you could set a maximum parallelism setting for all the pool so you don’t swamp the interface. That’s just one example after only a couple of seconds of thinking about it, so I know there’ll be other options there.

Max active devices for storage nodes is again something I’ve not played around with, but, I can see that I’d particularly make use of it in a situation where the actual storage node machine itself is not capable of driving all the backup devices attached to it at full speed; in this instance, limiting the number of active devices would allow you to say, have 3 of 6 devices running at full speed, rather than 6 of 6 devices running at a very sub-optimal speed.

So, there’s a good starting point at parallelism. 

 


* Not necessarily to be construed as a sales pitch. I went to a lot of effort to explain all the factors of client parallelism in my book, and it’s far too long to repeat in a blog entry.

** By full speed, when referring to drives that do hardware compression, I refer to the streaming compression speed.

*** If you need devices that can handle more than 512 active sessions, I really want to sell you the arrays you’ll need to achieve it!

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha