{"id":2195,"date":"2010-04-28T10:00:00","date_gmt":"2010-04-28T00:00:00","guid":{"rendered":"http:\/\/nsrd.info\/blog\/?p=2195"},"modified":"2018-12-11T18:42:31","modified_gmt":"2018-12-11T08:42:31","slug":"client-inactivity-timeout-is-not-group-timeout","status":"publish","type":"post","link":"https:\/\/nsrd.info\/blog\/2010\/04\/28\/client-inactivity-timeout-is-not-group-timeout\/","title":{"rendered":"Client inactivity timeout is not group timeout"},"content":{"rendered":"<p>One of the settings that can be made within a group is the &#8216;inactivity timeout&#8217; setting. This refers to <em>client inactivity<\/em>. This is often erroneously considered to be a <em>group<\/em> timeout setting, but it&#8217;s not.<\/p>\n<p>Now, to start with, the architecture of having a client inactivity timeout setting is, I believe, flawed, and should be addressed by adding heartbeat functionality between the NetWorker server and the client backup process*.<\/p>\n<p>There are a plethora of situations that don&#8217;t fall into <em>client inactivity<\/em>. These include:<\/p>\n<ul>\n<li>Blocking IO call failing (can happen to just about any product)<\/li>\n<li>Saveset initiation request sent but not responded to. (This is a tricky one to define &#8211; that <em>seems<\/em> to be the point where the failure happens, but it&#8217;s almost impossible to diagnose.)<\/li>\n<li>Backup server&#8217;s bootstrap\/index:server saveset waiting for media on the backup server.<\/li>\n<\/ul>\n<p>There have been various attempts to fix these situations over the years \u2013 for instance, most recently there were patches introduced into the 7.5 service pack stream to try to prevent a situation where a group would hang on startup probe. As is always the case with hanging situations, it&#8217;s difficult to say for sure whether those potential issues were well and truly dealt with.<\/p>\n<p>What it remains clear though, and it&#8217;s really important to remember, is that just because you&#8217;ve set a &#8220;client inactivity&#8221; timeout within a group doesn&#8217;t guarantee that the group will timeout after a certain period of inactivity. I.e., it doesn&#8217;t excuse you from confirming on a daily basis whether your groups have finished or not.<\/p>\n<p>Monitoring can be achieved a few different ways:<\/p>\n<ul>\n<li>Literally checking each group that is still running in NMC at a certain point in the day and determining whether it should be running or if it is hung.<\/li>\n<li>Paying special attention to savegroup completion reports that tell the group is &#8220;aborted, already running&#8221; (though that means missing a hung group for around 24 hours).<\/li>\n<li>Scripting a check and alert for still-running groups \u2013 like the NMC option, but automated.<\/li>\n<\/ul>\n<p>It would be great to say that there should never be a case where a group hangs and doesn&#8217;t complete, but I recognise this is one of those things that&#8217;s difficult to program, and in actual fact is almost impossible to guarantee. Could it be handled better? Undoubtedly; it&#8217;s just I&#8217;m enough of a pragmatist to know that it&#8217;s never going to be perfect.<\/p>\n<p>The catch-cry of the backup administrator should be &#8220;constant vigilance!&#8221; As I&#8217;ve discussed previously in posts about enacting zero error policies, it&#8217;s not about trying to configure a &#8220;set and forget&#8221; system where there&#8217;ll never be an issue, it&#8217;s about always having your finger on the pulse and <strong>never, ever<\/strong> accepting that there will be regular alerts for &#8220;events-that-look-like-errors-but-you-know-they&#8217;re-not&#8221;.<\/p>\n<p>So while the client inactivity timeout in a group will save you from some mundane aspects of group administration, it won&#8217;t let you ignore monitoring your groups for unexpected states.<\/p>\n<p>__________<br \/>\n* By flawed, I mean:<\/p>\n<p>Currently the backup process works as follows:<\/p>\n<ol>\n<li>Server instructs client to start backing up<\/li>\n<li>Client starts sending data to appropriate storage node\/nsrmmd process<\/li>\n<li>If client fails to send any data for &#8216;inactivity timeout&#8217; minutes, backup is considered to have failed, and restart is run if necessary.<\/li>\n<\/ol>\n<p>This doesn&#8217;t suit situations where there&#8217;s a dense filesystem walk taking place, and in fact it really, <em>really<\/em> should work as follows:<\/p>\n<ol>\n<li>Server instructs client to start backing up<\/li>\n<li>Client starts sending data to appropriate storage node\/nsrmmd process<\/li>\n<li>Every X (e.g., 90) seconds or so when no data has been sent, the storage node\/nsrmmd process asks the client if the save is still running.<\/li>\n<li>If the client responds within X seconds, keep waiting.<\/li>\n<\/ol>\n<p>That&#8217;s the sort of heartbeat mechanism that should be used&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the settings that can be made within a group is the &#8216;inactivity timeout&#8217; setting. This refers to client&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[16],"tags":[220,413,457,1004],"class_list":["post-2195","post","type-post","status-publish","format-standard","hentry","category-networker","tag-client","tag-group","tag-inactivity","tag-timeout"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pKpIN-zp","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/2195","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/comments?post=2195"}],"version-history":[{"count":1,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/2195\/revisions"}],"predecessor-version":[{"id":7557,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/2195\/revisions\/7557"}],"wp:attachment":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media?parent=2195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/categories?post=2195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/tags?post=2195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}