There’s a few very important rules to follow when it comes to NetWorker groups:
- Maintain an air-gap: Ensure there’s at least 5, preferably 10 minutes, between the start time for any two NetWorker groups. This even includes groups that won’t run due to schedules. (E.g., a Daily group, and a Monthly group for the same data should not start simultaneously even if only one of the groups will ever actually be backing up at once.) Multiple groups can overlap, of course, but they must not start at the same time.
- Client/group ratio is important: While the NetWorker performance tuning guide suggests a maximum of 50 clients per group, it’s a little more complex than this. The more clients you have in a group, the more likely you’ll be to need large server resources (e.g., much more RAM in the server) or to use group parallelism to limit the number of clients that can start simultaneously. Equally, avoid having lots of groups with few clients in them where possible. (E.g., if you have 100 clients, it’s better to have 4 groups of 25 clients than say, 10 groups of 10 clients each.)
- Group parallelism should be used more regularly: A typical rule of thumb is that if numClients x sum(client parallelism) for a group exceeds the parallelism of the NetWorker server, you must use group parallelism to limit the number of savesets, and if it exceeds around 150 units of parallelism, you should use group parallelism, regardless of what the server parallelism is.
While not “official” EMC recommendations, after 16+ years of using NetWorker, I’d also suggest the following should be considered rules, too:
- Never name a group based on its start time: You call the group “2245 Daily Servers”. Sounds like a great idea, until you need for some reason to alter the start time to 22:15 instead.
- Don’t mix filesystem and non-filesystem backups: Wherever possible, keep non-filesystem backups in their own groups. E.g., have a “Daily Filesystem” and a “Daily Databases” group, and so on. Definitely avoid having a module backup and filesystem backup for the same client run in the same group.
- Don’t mix retention times: If you do data/pool allocation by group, set all clients in the a group to have the same retention time.
- Don’t set inactivity timeout to 0: An inactivity timeout of zero means the group will never timeout, and therefore may never complete if ‘hang’ conditions happen. If you insist on having a zero-timeout on a group, only do so if you’ve got another process watching and alerting on the group running for too long.
- Avoid high client retries: The normal client retries for a group is 1, meaning two attempts will be made on any saveset. Be careful at increasing this beyond 2 (3 attempts) – savesets that sensitive may need to be monitored externally instead.
- Use comments, not elaborate names: Don’t try to put all the details about the group in the name, which you can’t change later. Instead, keep the group name as simple and generic as possible, and if additional information needs to be annotated for the group, put it in the comment field, which is displayed in NMC.
Thanks for this. Quite helpful.
I’ve heard the “time gap” thing a few times already and always wondered why they *must* never start at the same time ?
It comes down to the background scheduler. When it runs for a group it determines what resources are available and starts the group based on those resources. If you have two groups started at exactly the same time it can make incorrect assumptions. 9 times out of 10 this isn’t a problem in smaller environments, but if you have growth, or if you’re a larger environment, it can create odd problems. It’s best to just stick to the rule and avoid the possibility.
Hi Preston,
I’ve been following your blog for the past year now, has helped me quite a bit being new to Networker (and the backup world in generaly); so i’d like to thank you very much for taking your time to do this blog! We use a batch scheduling program to execute savegrp commands to execute the group to backup. We do NOT have an air-gap at all, we do have several minor issues (especially ones that EMC attributes to us being on a “buggy” version of Networker – will be upgrading soon I believe). Would you be able to go in depth and give specific examples as to common problems that may occur? It could possibly help and make my case (being such a junior) as to why we should start adding gaps between automatic scheduled start times with the batch scheduler (and while at it… creating more work for myself!). Thanks in advance.
It’s best described as random instability. There’s not necessarily any one thing that you can attribute as being “oh, I’m seeing this, it’s a group air-gap problem”, but rather sporadic odd problems – savesets that fail at the start of a group for no apparent reason, or savesets that don’t start at a time you might expect, or more savesets timing out than you’d expect, etc.
Always find good and interesting info on your site. Can you explain the reason for the comment “avoid having lots of groups with few clients in them”?
I’m looking to do just that to avoid using the Networker scheduler and use Autosys instead. I’m hoping to setup a system that’s completely automated based on various system and application dependencies. We already do that with our systems running on Netbackup in other regions in our firm.
I have seen this air-gap issue in our environment and I agree with this.
we have seen jobsdb skipping few groups with ‘exited with return code 1’ or
‘failed to connect to nsrjobd ‘ and we staggered the start times.
things seems to have improved a bit…