Basics – Parallelism in NetWorker

Parallelism in NetWorker is effectively multiplexing by another name. There are three areas where you have traditionally been able to set this:

  • Client parallelism – how many savesets a client can simultaneously send in a backup
  • Server parallelism – how many savesets a backup server will simultaneously allow to be active for the purposes of backup
  • Target sessions – the optimal number of savesets you want running to a backup device

As of NetWorker 7, we saw the introduction of:

  • Savegrp parallelism – the maximum number of backup savesets that can be running for a particular group.

As of NetWorker 7.3, we saw the introduction of:

  • Max sessions – the maximum number of savesets you’ll permit running to a backup device

Somewhere in the 7.x tree – I don’t recall when – there was another parallelism setting introduced, this time for the pool:

  • Max parallelism – The maximum number of savesets that can be simultaneously written to media belonging to a particular pool.

Also, we’ve seen the introduction of:

  • Max active devices – a setting maintained in the device resource, but is shared by all devices common to a single storage node, rather refers to the maximum number of devices that can be active on the storage node at any one time.

All of these settings serve one key purpose – they let you tune the performance of your NetWorker datazone.

Note: It’s worth pointing out something fairly critical here – all of these settings affect  backup savesets, they don’t affect recovery savesets. NetWorker will always allow new recovery savesets to be initiated, even if it can’t immediately facilitate the recovery.

Client parallelism is actually one of the most difficult parallelism settings to tune, and I’ve been somewhat disappointed by the new “default” setting of 12 (up from 4) in NetWorker 7.4.x onwards. I strongly believe it should be set to 1 for all new clients so as to ensure people think about the performance implications before they increase it.

I won’t go further into client parallelism here – I covered in considerable detail in my book, so if you want details of evaluating client parallelism settings you should check it out*. 

Server parallelism is a lot easier to understand – how powerful is your server, and how many devices do you have? In an optimal environment, your backup environment should be able to handle the processing of enough streams to keep every single backup device in your datazone streaming at full speed**. We’ll get to this in a moment, but optimally you want to keep that to as few savesets as possible – i.e., in a perfect world, we’d like to be able to keep every backup device running at full speed from individual savesets. This doesn’t always happen though, so you need to be able to plan for the appropriate number of savesets. 

(Even when the backup server is not actually backing anything up (e.g., all client backups are conducted by storage nodes, with the backup server just acting in a director role), every active saveset does consume resources on the backup server – this includes general coordination resources as well as index resources, etc.)

Device target sessions is an interesting one. It’s not actually a hard limit. In the first pass, it refers to how many savesets should be running on a device before new savesets are started on the next device. So, if every device in the environment has target sessions of 4, then one by one NetWorker will want to get 4 savesets running to each device. But what happens when every device is running 4 savesets, and NetWorker needs to start a new saveset? In that instance, NetWorker just ‘cycles’ through all the devices, tacking on another saveset to each device until they’re say, all running 5 savesets. Then if another comes along it starts building each device up to 6, and so on. In effect, it’s a primitive form of load balancing. 

The newly introduced setting of max sessions for devices does act as a hard limit – a device will never exceed the number of active savesets as defined by the max sessions parameter; this by default is set to 512, effectively not placing a limit on the number of sessions running to the device***.

So what about the other settings? Where would you use them?

The savegrp parallelism setting is a great option to use if you have multiple groups running in such a way that they overlap, and one or more of the groups has large numbers of clients. You see, traditionally, the code for a group assumes that when it starts, it can query the server’s parallelism setting and start up to that many savesets. However, if you’ve got multiple groups running, then you could exceed the number of permitted savesets. This can result in timeouts or failures. If however you’ve say, got server parallelism of 64, and one group with 100 clients, and two other groups with say, 4 clients each, you might set the large group to have parallelism of 60, and the other two groups to each have parallelism of 2. This would enable all three groups to simultaneously run.

Max parallelism for pools is not something I’ve really played around with. However, I can immediately imagine it would be useful if you had specific pools for disk backup units that are all connected via the same FC or SCSI bus – you could set a maximum parallelism setting for all the pool so you don’t swamp the interface. That’s just one example after only a couple of seconds of thinking about it, so I know there’ll be other options there.

Max active devices for storage nodes is again something I’ve not played around with, but, I can see that I’d particularly make use of it in a situation where the actual storage node machine itself is not capable of driving all the backup devices attached to it at full speed; in this instance, limiting the number of active devices would allow you to say, have 3 of 6 devices running at full speed, rather than 6 of 6 devices running at a very sub-optimal speed.

So, there’s a good starting point at parallelism. 


* Not necessarily to be construed as a sales pitch. I went to a lot of effort to explain all the factors of client parallelism in my book, and it’s far too long to repeat in a blog entry.

** By full speed, when referring to drives that do hardware compression, I refer to the streaming compression speed.

*** If you need devices that can handle more than 512 active sessions, I really want to sell you the arrays you’ll need to achieve it!

27 thoughts on “Basics – Parallelism in NetWorker”

  1. Hi

    Thanks for your inputs. really helpful.

    Now, i am in a different situation like I need to upgrade my NetWorker 7.3.2 running in RHEL server to 7.5 version or 7.4.4 as now 7.3.2 end of support. 7.3.2 was very stable for me, please guide me on this, which version should i use for easier migration and licensing transfers along with stable like 7.3.2

    Thanks a lot

    cheers
    VJ

    1. Without knowing your site I can’t be 100% certain, but as a general rule, I’m not recommending 7.5 at the moment. I want to see a full service pack come out for it. On the other hand, 7.4.4 has proved to be a very stable version, and I’ve got a lot of customers either on it, and quite happy with it, or about to go to it.

  2. Thanks Preston. Your suggestions are valuable. My site is having one RHEL 4 32 bit server with Quantum library managing 350 clients (solaris 8,10,RHEL and windows servers).

    Again, How is it difficult from migrating 7.3.2 to 7.4.4 with preserving all configurations related to huge /nsr/mm and /nsr/index directories

    Thanks a million.

  3. It’s actually quite simple. The one recommendation I always make with Linux versions of NetWorker is use rpm –erase to actually erase the NetWorker packages and then install the new versions. There is a long running bug in NetWorker’s installation scripts in the RPM version of its packages that causes the /etc/init.d/networker script to be removed when you do an RPM upgrade. This is inconvenient, so it’s best to do an erase followed by an install, rather than the upgrade.

    As with all upgrades, you should read the upgrade notes in case there’s anything specific to your site, but generally speaking for 7.3.x -> 7.4.x at most you should just have to do a full backup of your backup server, uninstall the old packages and install the new packages. (The media database and indices are compatible across the two versions.)

  4. Thanks.
    I have planned the following steps
    1. backup of entire /nsr filesystems that includes /nsr/mm and /nsr/index directories along with /etc/init.d/networker start script
    2. rpm erase of 7.3.2 version
    3. rpm installation of 7.4.4
    4. copy back of /nsr/mm and /nsr/index so that i can bring back the entire configuration of clients/schedules/groups

    Is this steps OK ?

    My only worry is /nsr/index is 350 gb of space,no space left out to copy this directory for backup.

  5. Erasing NetWorker’s packages doesn’t erase the configuration files (/nsr/mm), the media database (/nsr/mm) or the indices (/nsr/index). So Step (4) isn’t required.

    EMC strongly recommend against just “copying” these files (even though I’ve done it when I’ve found it necessary) in case something goes wrong. You should start by doing a full bootstrap backup (savegrp -O groupName) where “groupName” is the name of a group that has every client in it – this just backs up the media database, indices and res database. That way you’ve got a full backup to recover from _if_ (and only if) you have problems. Alternatively, just doing a full backup of the backup server before the upgrade will backup the media database, res database and the index for the backup servers. Indices for other clients can be recovered as required if necessary, but again, these aren’t deleted by the uninstall process.

  6. Thanks man for this input , for past 2 weeks i got 50% delay of backups , freaking legato support could not told me i need to increase server parralelism , now after i fixed it it works .

    1. You’re welcome! I think like any company, EMC’s customers can have both good and not so good support experiences. In the case of parallelism, it’s sometimes (unfortunately) one of those things that gets “assumed” to be OK and move on. However, like name resolution it can be the case that if you don’t look for it first, you may spend some time chasing more complex problems that actually aren’t there.

  7. hi iam running with legato 7.4 networker with one as a media server and other as a dedicated storage node .
    all the drives are visible on the storage node.
    on setting up the parllalism on server and storage node( one of the client) to 24 and target session to 6 its not taking 24 savesets whereas the same has been tested with other clients with same configration. they aretaking 24save sets..this client is only taking 10 savesets..
    please lemme noe waht could be the problem behind this

    1. How many unique savesets are there on the client where you’re not getting that number of simultaneous savesets? I.e., is the client actually physically configured in such a way that allows it to generate that many savesets? (I’m afraid I’m not quite understanding whether your problem is that an individual client isn’t generating 24 savesets, or that a particular tape drive is not writing 24 savesets.)

      1. hi

        actually its only a single server where clients have been made according to day wise suppose for the same server clients are been made as DB-pool monday for monday backup for the same client database ,DB-pool tuesday for tuesday backup
        on friday ie DB pool friday it has sucessfully taken 24 savesets . the client server is the same for monday to sunday pool..

        if with the same setting its has taken 24 savesets then how can its getting restricted to fixed ie (10) parellelism for monday client on same server which is a dedicated storage node

        1. Unfortunately I’m still not able to understand your configuration from your description. You say at the start it’s a “single server”, but then you refer to it being a dedicated storage node as well. (I’m also concerned by my take on your pool description. Having different pools for each day of the week sounds arbitrarily complex.)

          Are these filesystem backups or database backups? There’s a variety of reasons why a machine may backup on different days with different levels of parallelism. For instance, database servers, where different streams are allocated based on the amount of changes to be backed up (e.g., when doing incremental backups), may generate more or fewer savesets.

          For filesystem backups, if the backups are incrementals and there’s very little change happening you could also find that some backups complete fast enough that they don’t appear for very long in monitoring and therefore aren’t readily observable.

          (Hint: When describing NetWorker problems, it’s best to stick to NetWorker nomenclature. That is – there’s only one server, the backup server, and all other machines are either clients, storage nodes, or dedicated storage nodes. Mixing the terms makes the description ambiguous.)

  8. Just read your stuff on //-ism. Something not right on our site. A group starts with 10 clients, all with client //-ism of 2. Device target sessions=20. Pool Max //-ism of 20. savegrp //-ism=0.
    Networker asks for all 10 drives in the library to be loaded. That’s no good because a different Storage Node wants some tape drives later. But anyway I can’t work out why 10 drives are required. I think it should want only 2. We get round it by selecting only 4 drives (LTO4) for the pool. Puzzlement.

    1. That certainly seems odd behaviour on the outset – though FWIW having 20 target sessions to each drive will result in “massive” level of multiplexing being done to tape, which will have a significant detriment to complete filesystem recovery performance, etc.

      Everything (bar one) in your list of parallelism settings seems relatively normal. The one that strikes me as odd is the Pool Max parallelism – maybe that’s interacting with the device target sessions in an odd way. That’s used to set an upper limit on the number of parallel sessions writing to media in the pool, and perhaps the way it interacts with the other settings is (given the high target sessions per drive) is to try to enforce that setting first, thus spreading the sessions out across multiple drives. I’d suggest eliminating that setting first – tuning it down to zero, to see what happens. To be honest, I’ve not yet found a day-to-day situation that requires use of the pool max parallelism setting, so it may be that you’re overcomplicating it, given I’ve accomplished similar configurations (albeit with slightly smaller device target sessions) without needing to use it.

      Also check the server parallelism setting, which you haven’t mentioned. It should be a multiple of desired device target sessions (at minimum). So if you want to get 20 streams going to 2 drives you’d want it at 40, etc. I’m presuming based on your description though that you’ve got it set to 20?

      1. Thanks for the suggestion about pool max //le-ism. Reduced it to zero, but unfortunately made no difference. And you’re right about its usefulness.
        BTW, I assume you meant server //le-ism 120 which is what it is set to.

        1. Actually I did assume server parallelism of 20 … i.e., I thought your intent was to ensure that one drive would run at 20 target sessions, not that all 10 would.

          In theory, everything you’ve described then means that we should see the ramp-up of target sessions you want. Out of curiosity, have you considered stopping NetWorker and clearing out the nsr/tmp directory to see whether that makes a difference? (Obviously when you do this, you’ll lose any state of groups that have been aborted or failed that you want to restart, so be sure to do it at a suitable time.) It may be that there’s something lingering in that state directory that’s causing problems for you. Also, what version of NetWorker are you using?

  9. Networker server version 7.4.4.7 on Solaris. We saw a media db corruption on 7.4.4 that 7.4.4.6 didn’t fix. Was fixed on 7.4.4.7, but that’s another story.
    /nsr/tmp is removed on every Networker restart.

    1. In theory everything you’re describing should work, and I can’t recall/find any bugs that would be causing this off-hand. I wonder whether it’s a level-of-parallelism issue – I’d suggest dropping to 16 or below where you’re currently using 20, and adjusting the server parallelism appropriately based on the current metric you’re using, to see whether that makes a difference. If not, it might be worthwhile logging a case with EMC.

  10. I set the target sessions to 1 for 23 devices that are all adv_type but I started getting the error message in the alerts: “Waiting for 1 writable volumes to backup pool Default disk” Even though I was reviewing the device section of the monitor and I can see each device with some showing 2 sessions. I can’t see why I am getting the error. If I go back and change the target sessions setting to 2 the error clears. Any Ideas?

  11. I just started using Networker after 10 years of using Netbackup. I also just started in an environment that was already setup so I trying to work through some of the issues.

    Parellism for the Server setting is 64, savegrp is 0, and the clients are set at 4 some are two. We are using 7.4.4.Build.634

    When I made the setting and the jobs got all the way where I had 23 devices running I got the messages, even though some of the devices showed 2 sessions

    1. Truth be told I can’t say why you’re seeing what you’re seeing. Off-hand the first thing that sprung to mind was I thought I remembered a bug in 7.4.x regarding target sessions, but having searched through the 7.5.x and 7.6 release notes, I’ve not seen anything that confirms what I’d been thinking. The only other thing I’d be curious about is whether you’ve had any other devices (tape or disk) unmounted when you’ve had target sessions set to 1? NetWorker will sometimes refuse to start new sessions on currently mounted devices if there are unmounted devices that could (potentially) be used for the backup.

  12. I reviewed the server and one of the adv_file devices is in a read only mode temporarily so that some of the save sets can expire getting the disk under 100% full. Maybe that’s the cause for the messages? Once the file system get’s below 80% full I will turn it backup on see if the messages still appear. Right now I have the target sessions set to 1 for all of them. Thanks for your insight.

  13. Hi Presto,
    Iam into a new company and asked to work with Networker in Backup team. I need a clear explanation of how to perform back up and recovery . I want recovery using commands for HP,Unix and Solaris servers.

    Thanks in advance

    1. Your best option I believe is to start working through the NetWorker manuals and consider a training course if there’s one available. The NetWorker admin guide should give you a good understanding of how NetWorker operates.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.