{"id":10168,"date":"2021-03-22T10:36:16","date_gmt":"2021-03-22T00:36:16","guid":{"rendered":"https:\/\/nsrd.info\/blog\/?p=10168"},"modified":"2021-03-22T10:36:18","modified_gmt":"2021-03-22T00:36:18","slug":"of-cascading-failures","status":"publish","type":"post","link":"https:\/\/nsrd.info\/blog\/2021\/03\/22\/of-cascading-failures\/","title":{"rendered":"Of Cascading Failures"},"content":{"rendered":"\n<p>I got a real-world reminder in cascading failures this weekend, and that&#8217;s as good a time as any for me to pull up a chair, and chat about why we backup.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Cascading Failures Averted<\/h2>\n\n\n\n<p>I have a few RAID systems in my environment at home. The latest is an OWC 8-bay Thunderbolt-3 enclosure attached to my Mac Mini. When I retired my 2013 Mac Pro last year, I kept it around to run vSphere within VMware Fusion, and attached to it is a 4-drive Thunderbolt-2 Promise system.<\/p>\n\n\n\n<p>In addition to those direct-attach units, I&#8217;ve got three Synology systems as well. A DS1513+ (5-drive) with an extension pack, a DS414 (4-drive) and a DS414j (4-drive). The DS1513+ is our &#8220;home storage&#8221; platform, and that&#8217;s where I likely <em>averted<\/em> a cascading failure late last year.<\/p>\n\n\n\n<p>That system had been set up with 5 x 3TB drives initially, and around November last year, I realised the drive stats were showing that those drives had managed to log over 61,000 hours of operation \u2013\u00a0pretty much 7 years of run time. That made me a wee bit nervous, so I made use of the Synology Hybrid RAID function to not only replace the drives but also expand the home-share capacity using 6TB drives.<\/p>\n\n\n\n<p>What I didn&#8217;t think to do was to check the drives in the DS414. It turned out, they&#8217;d been running for almost as long.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Cascading Failures in Play<\/h2>\n\n\n\n<p>So on Friday, I got an email from my DS414 warning that the number of bad sectors being remapped on one of its drives had increased quite a bit. So, I logged on and triggered a deep S.M.A.R.T. test, which a few hours prompted an alert that the drive was failing and should be replaced ASAP.<\/p>\n\n\n\n<p>Of course, all 4 drives in the Synology had been purchased at the same time, were the same age, and as you&#8217;d expect, had similar run hours. In fact, like the DS1513+ when I replaced its drives, these were all now coming up to the 61,000 hour mark.<\/p>\n\n\n\n<p>So with some trepidation, I pulled the failing drive, added a spare 3TB drive, and let it start rebuilding the RAID unit.<\/p>\n\n\n\n<p>Then Saturday morning, I woke to news of multiple IO failures during the rebuild. Thankfully, they weren&#8217;t total drive failures, but two different drives had suffered IO timeouts that eventually rendered the data filesystem corrupted during the rebuild. (Thankfully, the OS partitions had been spared \u2013&nbsp;so I was at least able to get an email about the news.) I was offered the opportunity to reboot and run fsck \u2013 and 7 hours later from that, I was told the filesystem was unrecoverable.<\/p>\n\n\n\n<p>So it was time to pull out my wallet and fork out some $$$ for four new drives to replace what was rapidly turning into a house of horrors within the DS414.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Lucky Failure<\/h2>\n\n\n\n<p>All things considered, I&#8217;m counting myself lucky with the failure. The DS414 was a backup storage volume. While I lost some backups, they weren&#8217;t long-term retention backups, so the data loss could be dealt with (in the scheme of home storage failures) by running new backups. (My long-term retention\/critical backups get cloud as well as local protection.)<\/p>\n\n\n\n<p>I can summarise a few quick notes from the overall process:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Synology has an excellent online compatibility guide for drive models these days. <strong>However<\/strong>, you <em>also<\/em> have to check the CPU compatibility guide. I bought 4 x 6TB drives which would present 16.36TB of storage after RAID-5, and the system obstinately refused to create a volume\/filesystem greater than 16TB <em>because the CPU in that model is only 32-bit<\/em>. So, some fiddling later,<em> <\/em>I managed to get two volumes created \u2013 a 16TB filesystem and another &lt;1TB.<\/li><li>I often joke that there is nothing in IT slower than installing an Apple Watch OS update.  This was my periodic reminder that there is something slower: Linux <strong>md<\/strong> RAID rebuilds.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2021\/03\/RAID-Rebuilds.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"673\" src=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2021\/03\/RAID-Rebuilds-1024x673.png\" alt=\"\" class=\"wp-image-10171\" srcset=\"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2021\/03\/RAID-Rebuilds-1024x673.png 1024w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2021\/03\/RAID-Rebuilds-300x197.png 300w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2021\/03\/RAID-Rebuilds-768x505.png 768w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2021\/03\/RAID-Rebuilds-1536x1009.png 1536w, https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2021\/03\/RAID-Rebuilds.png 1583w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>24 hours in, ~38 hours to go <em>&lt;Jeremy_Clarkson_Screaming_Go_Faster.jpg&gt;<\/em><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Data Protection is Born from Cascading Failures<\/h2>\n\n\n\n<p>A good data protection architecture is built on four pillars:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>F<\/strong>ault tolerance \u2013&nbsp;Hardware-level protection against component failure<\/li><li><strong>A<\/strong>vailability \u2013&nbsp;Multipathed access to data<\/li><li><strong>R<\/strong>edundancy \u2013&nbsp;Higher level &#8216;failover&#8217; protection for services<\/li><li><strong>R<\/strong>ecoverability&nbsp;\u2013 Being able to get the data back if all else fails<\/li><\/ul>\n\n\n\n<p>I know I&#8217;ve been banging on a bit about the FARR model of late, but the reason I&#8217;ve been doing so is that it&#8217;s <em>so damn important<\/em>. We build up protection using all four of those pillars \u2013&nbsp;rather than a single one \u2013&nbsp;because <em>cascading failures happen all the time<\/em>. Our lives in data protection would be simpler if we could just limit ourselves to one of the four pillars. But that&#8217;s just not how the world \u2013&nbsp;or technology \u2013&nbsp;works.<\/p>\n\n\n\n<p>In the <strong><em><a href=\"https:\/\/www.routledge.com\/Data-Protection-Ensuring-Data-Availability\/Guise\/p\/book\/9780367256777\" target=\"_blank\" rel=\"noreferrer noopener\">second edition of Data Protection<\/a><\/em><\/strong>, I spent a fair chunk of a chapter elaborating on the FARR model, and I&#8217;m glad I went to that effort. I genuinely believe it represents the fundamental pillars you have to establish as part of a data protection model, and perhaps most importantly, <strong>you can use those pillars to get the budget owners in the business to understand<\/strong>, too. And this weekend reminded me personally why the FARR model is so important.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I got a real-world reminder in cascading failures this weekend, and that&#8217;s as good a time as any for me&hellip;<\/p>\n","protected":false},"author":1,"featured_media":10169,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[3,1133],"tags":[200,781],"class_list":["post-10168","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-architecture","category-best-practice","tag-cascading-failures","tag-raid"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/nsrd.info\/blog\/wp-content\/uploads\/2021\/03\/bigStock-Dominos.jpg","jetpack_shortlink":"https:\/\/wp.me\/pKpIN-2E0","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/10168","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/comments?post=10168"}],"version-history":[{"count":5,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/10168\/revisions"}],"predecessor-version":[{"id":10178,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/10168\/revisions\/10178"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media\/10169"}],"wp:attachment":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media?parent=10168"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/categories?post=10168"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/tags?post=10168"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}