Needing a few interesting things to read at the end of the week?

Here’s a few things I’ve found fascinating this week:

  • Why do IT operations suck? An insightful article by Steve O’Donnell. Steve asks why our staff who have primary involvement with systems 24×7 (operators) are often the least skilled, least trained and least paid. (As a consultant, I’ve frequently experienced companies who consider it a waste of time to properly train operators, and as a result their systems usually suffer for it.)
  • Over at Daring Fireball, John Gruber has an article called The Original Tablet. (It’s a great historical perspective on why Microsoft can’t exclusively claim ownership of the tablet idea.)
  • Like many others, I found Google’s slap in the face to China’s net censorship and cyber-warfare activities well timed and highly appropriate. On the other hand, others such as John Obeto over at Absolutely Windows found it not much more than petty PR. Somewhere in the middle is probably the whole story…
  • Over at IT Depends, I found Terri McClure’s views on Microsoft’s requirements for accessing their Azure SLAs to be the same as mine – staggeringly stupid. (According to Microsoft Fanboy site The Register, Microsoft are reviewing their decision on that one.)
  • Storagebod got me thinking again about Availability and Uptime with his article about how availability is measured.
  • Not technically reading, but I’ve finally jumped on board the growing number of listeners to Infosmack. This podcast is run by Greg Knieriemen and Marc Farley, and frequently has guests from many of the storage vendors and other storage bloggers. I’m really regretting that I haven’t been listening to it for longer. It’s definitely going to be a regular podcast for me from now on.
  • Over at Storage Monkeys, Sunshine Mugrabi’s article on EMC’s heavy involvement in social networking is definitely worth reviewing. (For what it’s worth, if you haven’t ever read it, you need to read The Cluetrain Manifesto if you think that all this social networking stuff is rubbish or just a passing fad. It isn’t. Written years before its time, The Cluetrain Manifesto is a clear and articulate series of essays about exactly how important social networking is.)
  • Finally, there’s been some interesting discussions on VMware and application level VSS backups through VCB/vSphere. Check my posting here for the summary of the important links to be following about it.

Finishing up, a little about what you’ve been reading: the NetWorker Power Users Guide to nsradmin. The number of downloads has been staggering – far more than I hoped for, and I hope like the main blog, the guide proves useful to many a NetWorker administrator.

 

Over at Storagebod, Martin Glassborow currently has a short and insightful post, How do you measure availability?

Martin’s point is thus:

If a vendors says that that their array is 99.999% available, what does that really mean to you? Probably not a lot in practical terms. Does it mean that individual components are 99.999% available? Or does it mean that the array itself in some shape or form is available?

This cuts to the heart of insufficiently quantifiable availability/uptime measurements.

Availability isn’t a sufficient measuring stick. Access is. To put it more accurately, availability by itself isn’t a sufficient measurement – what is important is availability of user services. The difference? An array may be completely available in that it is servicing IO requests and all drives are functional. However, it may be simultaneously unavailable, as far as users are concerned, because some esoteric bug is causing it to service those IO requests at say, one tenth the normal speed. It’s up, but not from an end user perspective, available.

True availability is a series of distinct measurements against locally defined requirements, not something that you get just by buying an array (or any other piece of hardware) that a vendor quotes an availability percentage for. It can’t be bought, it can only be architected and implemented.

For a complete outline of my argument on this, check out an article I wrote some time ago: Uptime is an inappropriate metric.

 

I’m going to start with a statement that may make some people angry. Some might suggest that I’m just goading the blogosphere for traffic, but that’s not my point.

System administrators and managers who focus on keeping uptime as high as possible for no other reason than having good numbers are usually showing an arrogant disrespect to the users of their systems.

There – I said it. (I believe I am now required to walk the Midrange plank and dive into the Mainframe sea of Mediocrity). As a “long term” Unix admin and user, I found it galling when I initially realised that the midrange environment has always had the wrong attitude towards uptime. Uptime for the sake of uptime, that is. These days, I use a term you might more expect to hear in the mainframe world: negotiated uptime.

You see, there’s uptime, and there’s system usefulness. That’s the significant difference between uptime and agreed uptime. Confusing these items only achieves one thing: unhappy users.

Here’s just a few examples of rigid adherence to ‘uptime’ gone wrong:

  • Systems performing badly for days at a time while system administrators hunt for the cause when they know that a suitable workaround would be to reboot the system of a night time – or even during the time that most users take a lunch break.
  • Systems that don’t get patched for months at a time because the patching would require a reboot and that would affect uptime. (If it’s not broken, don’t fix it, can be used for regular patch avoidance, but very, very rarely for security patching.)
  • Applications that don’t get upgraded, despite obvious (or even required!) fixes in new releases, because the application administrators don’t like restarting the application.

I’ll go so far as to say that uptime, measured at the individual server level, is irrelevant and inappropriate. Uptime should never be about the servers, or even the applications – it’s about the services for the business.

Clusters typically represent a far healthier approach to uptime: a recognition that one or more nodes can fail so long as the service continues to be delivered. There are clusters (particularly OpenVMS clusters) that are known to have been presenting services for a decade or more, all the while continuing to get OS upgrades and hardware replacements and undoubtedly having single node failures/changes.

The healthiest approach to uptime however is to recognise that individual system or application uptime is irrelevant. The net effect experienced by users for availability of services is what should be measured. All the uptime stats, SNMP monitoring stats, etc., in the world are irrelevant when compared with how useful an actual IT service is to a business function.

The challenge of course is that availability is significantly harder to measure than is uptime. Uptime is after all, dead simple – on any Unix platform there’s a single command – ‘uptime’ to get you that measurement. Presumably on Windows there’s easy ways to get that information too. (E.g., without any other experience myself in trying to measure this, I know you can (usually) at least get last boot time out of the event logs.) Half of why it’s simple is the ease at which the statistic can be gathered.

What makes availability harder to measure though is that it’s not all boolean measurements. The other half of why uptime is an easy measurement is because it’s a boolean statistic. A host is either up or down (and when transitioning between it’s usually considered to be ‘down’ unless it’s fully ‘up’).

Services however can be up but not available. That is, they can be technically, yet not practically available.

Here’s an old example I like to dredge out regarding availability. An engineering company invested staggering amounts of money in putting together a highly customised implementation of SAP. Included in this implementation was a fairly in-depth timesheet module that did all sorts of calculations on the fly for timesheet entry.

Over time, administering this system, complaints grew practically on a week-by-week basis that come Friday (when timesheets had to be entered by – which caused a load rush at the end of each week), the SAP server was getting slower, and slower. Memory was upgraded. Hard drive layout was tweaked, etc., but in the end the system just got slower and slower and slower.

Eventually it was determined that the problem wasn’t in the OS or the hardware, but in the SQL coding in the timesheet system. You see, every time a user went to add a new timesheet entry, a logical error in the SQL code would first retrieve all timesheet entries made by that employee since the system was commissioned or the employee started, whichever was first. As you can imagine as the months and years went by, this amounted to a lot of heavy selects going on each week.

With that corrected, users reacted with awe – they thought the system had been massively upgraded, but instead it had just been a (relatively minor) SQL tweak.

What does this have to do with availability, you may be wondering? Well, everything.

You see, the SAP server was up for lengthy periods of time. The application also was up for lengthy periods of time. Yet the service – timesheets, and more generally the entire SAP database was increasingly unavailable. Timesheet entry for a week for many users took 2+ elapsed hours of initiating a new entry, waiting infuriatingly long numbers of minutes for the system to respond and then often inputting the entry later, after having switched away to something else while waiting for the system to respond. Under no stretch of the imagination could that service be said to be available.

So how do you measure availability? Well, the act of measuring is perhaps more challenging, and going to be handled on a service-type by service-type basis. (E.g., measuring web services will be different from measuring desktop services which will be different from measuring local server services, etc.)

The key step is defining useful, quantifiable metrics. That is, a metric such as “users should get snappy response” is vague, useless and (regrettably) all too easy to define. The real metrics are timing/accuracy based metrics. (Accuracy metrics are mainly useful for systems with analogue styled inputs.) Sticking to timing based metrics for simplicity, measuring availability comes down to having specific timings associated with events. The following are closer to being valid metrics:

  • All searches should start presenting data to the user within 3 seconds, and finish within 8 seconds.
  • Confirmation of successful data input should take place within 0.3 seconds.
  • A 20 page text-only document should complete printing within 11 seconds.
  • Scrubbing through raw digital media should occur with no more than a 0.2 second lag between mouse position and displayed frame.

(Using weight scales as an example, an analogue metric might be that the scales will be accurate to within 10 grams.)

While metrics are more challenging to quantify than boolean statistics, they allow the usability and availability of a system to be properly measured. Without accurate metrics, uptime is like digging for fool’s gold.

 

I’ve recently discovered a site with a prosaic name of “DailyWTF” … obviously aimed at technical people, it frequently covers some of the more nonsensical happenings in IT. I thoroughly recommend periodically visiting it.

I was amused to read this story about SLAs regarding uptime today – it reminded me of a company I once was involved with that promised 1 hour restoration time on backups, yet sent media to an offsite location 1.5 hours away as soon as backups completed without keeping clones on site.

This raises the obvious point so frequently missed – ensure that SLAs are achievable.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha