One of the settings that can be made within a group is the ‘inactivity timeout’ setting. This refers to client inactivity. This is often erroneously considered to be a group timeout setting, but it’s not.
Now, to start with, the architecture of having a client inactivity timeout setting is, I believe, flawed, and should be addressed by adding heartbeat functionality between the NetWorker server and the client backup process*.
There are a plethora of situations that don’t fall into client inactivity. These include:
- Blocking IO call failing (can happen to just about any product)
- Saveset initiation request sent but not responded to. (This is a tricky one to define – that seems to be the point where the failure happens, but it’s almost impossible to diagnose.)
- Backup server’s bootstrap/index:server saveset waiting for media on the backup server.
There have been various attempts to fix these situations over the years – for instance, most recently there were patches introduced into the 7.5 service pack stream to try to prevent a situation where a group would hang on startup probe. As is always the case with hanging situations, it’s difficult to say for sure whether those potential issues were well and truly dealt with.
What it remains clear though, and it’s really important to remember, is that just because you’ve set a “client inactivity” timeout within a group doesn’t guarantee that the group will timeout after a certain period of inactivity. I.e., it doesn’t excuse you from confirming on a daily basis whether your groups have finished or not.
Monitoring can be achieved a few different ways:
- Literally checking each group that is still running in NMC at a certain point in the day and determining whether it should be running or if it is hung.
- Paying special attention to savegroup completion reports that tell the group is “aborted, already running” (though that means missing a hung group for around 24 hours).
- Scripting a check and alert for still-running groups – like the NMC option, but automated.
It would be great to say that there should never be a case where a group hangs and doesn’t complete, but I recognise this is one of those things that’s difficult to program, and in actual fact is almost impossible to guarantee. Could it be handled better? Undoubtedly; it’s just I’m enough of a pragmatist to know that it’s never going to be perfect.
The catch-cry of the backup administrator should be “constant vigilance!” As I’ve discussed previously in posts about enacting zero error policies, it’s not about trying to configure a “set and forget” system where there’ll never be an issue, it’s about always having your finger on the pulse and never, ever accepting that there will be regular alerts for “events-that-look-like-errors-but-you-know-they’re-not”.
So while the client inactivity timeout in a group will save you from some mundane aspects of group administration, it won’t let you ignore monitoring your groups for unexpected states.
__________
* By flawed, I mean:
Currently the backup process works as follows:
- Server instructs client to start backing up
- Client starts sending data to appropriate storage node/nsrmmd process
- If client fails to send any data for ‘inactivity timeout’ minutes, backup is considered to have failed, and restart is run if necessary.
This doesn’t suit situations where there’s a dense filesystem walk taking place, and in fact it really, really should work as follows:
- Server instructs client to start backing up
- Client starts sending data to appropriate storage node/nsrmmd process
- Every X (e.g., 90) seconds or so when no data has been sent, the storage node/nsrmmd process asks the client if the save is still running.
- If the client responds within X seconds, keep waiting.
That’s the sort of heartbeat mechanism that should be used…
I have an open call similar to this topic.
When a Windows client reboots during backup the networker server doesn’t recognize this situation and keeps the savegroup running and running… You may stop the savegroup but jobsdb still has wrong infos for the rebooted clients. The group will hang again next run. I had to stop nw server, clear /nsr/res/jobsdb/* and start nw server to clean up the situation.
This sometimes happens in our firm when backup overlaps with scheduled task “reboot Windows” (in order to apply updates/patches from WSUS).
This can happen when people from different IT groups do scheduling tasks (backup and reboot) on the same machine.
Otmar, this can be fairly easily solved by setting keepalives to something else then the default 2 hours. I think it matters on the storage node actually. It might be enough setting it through NSR_KEEPALIVE_WAIT variable, but we did some testing only with the OS defaults.
As far as client inactivity is concerned, this settings has quite a lot of impact when you use custom savedb scripts. First, the custom save commands implementation by NetWorker is simply flawed; I hope this will be somewhat addressed in 7.5.2.2. However, if you consider the pre-save, then save and then post-save, you cannot heartbeat the pre-save and post-save parts. And you can get quite time-consuming tasks run there without any ability to implement some heartbeat. And that’s why I think we won’t be able to do away with the client inactivity timeout 🙂
My problem has been solved. I got a hotfix for nsrjobd from EMC.
The hotfix mentioned before is now included in Networker 7.5.2.3:
NW116691: Savegrp did not end despite inactivity timeout reached
ftp://ftp.legato.com/pub/NetWorker/Cumulative_Hotfixes/7.5/cumulative_7.5_readme.txt