Uptime is an inappropriate metric

I’m going to start with a statement that may make some people angry. Some might suggest that I’m just goading the blogosphere for traffic, but that’s not my point.

System administrators and managers who focus on keeping uptime as high as possible for no other reason than having good numbers are usually showing an arrogant disrespect to the users of their systems.

There – I said it. (I believe I am now required to walk the Midrange plank and dive into the Mainframe sea of Mediocrity). As a “long term” Unix admin and user, I found it galling when I initially realised that the midrange environment has always had the wrong attitude towards uptime. Uptime for the sake of uptime, that is. These days, I use a term you might more expect to hear in the mainframe world: negotiated uptime.

You see, there’s uptime, and there’s system usefulness. That’s the significant difference between uptime and agreed uptime. Confusing these items only achieves one thing: unhappy users.

Here’s just a few examples of rigid adherence to ‘uptime’ gone wrong:

  • Systems performing badly for days at a time while system administrators hunt for the cause when they know that a suitable workaround would be to reboot the system of a night time – or even during the time that most users take a lunch break.
  • Systems that don’t get patched for months at a time because the patching would require a reboot and that would affect uptime. (If it’s not broken, don’t fix it, can be used for regular patch avoidance, but very, very rarely for security patching.)
  • Applications that don’t get upgraded, despite obvious (or even required!) fixes in new releases, because the application administrators don’t like restarting the application.

I’ll go so far as to say that uptime, measured at the individual server level, is irrelevant and inappropriate. Uptime should never be about the servers, or even the applications – it’s about the services for the business.

Clusters typically represent a far healthier approach to uptime: a recognition that one or more nodes can fail so long as the service continues to be delivered. There are clusters (particularly OpenVMS clusters) that are known to have been presenting services for a decade or more, all the while continuing to get OS upgrades and hardware replacements and undoubtedly having single node failures/changes.

The healthiest approach to uptime however is to recognise that individual system or application uptime is irrelevant. The net effect experienced by users for availability of services is what should be measured. All the uptime stats, SNMP monitoring stats, etc., in the world are irrelevant when compared with how useful an actual IT service is to a business function.

The challenge of course is that availability is significantly harder to measure than is uptime. Uptime is after all, dead simple – on any Unix platform there’s a single command – ‘uptime’ to get you that measurement. Presumably on Windows there’s easy ways to get that information too. (E.g., without any other experience myself in trying to measure this, I know you can (usually) at least get last boot time out of the event logs.) Half of why it’s simple is the ease at which the statistic can be gathered.

What makes availability harder to measure though is that it’s not all boolean measurements. The other half of why uptime is an easy measurement is because it’s a boolean statistic. A host is either up or down (and when transitioning between it’s usually considered to be ‘down’ unless it’s fully ‘up’).

Services however can be up but not available. That is, they can be technically, yet not practically available.

Here’s an old example I like to dredge out regarding availability. An engineering company invested staggering amounts of money in putting together a highly customised implementation of SAP. Included in this implementation was a fairly in-depth timesheet module that did all sorts of calculations on the fly for timesheet entry.

Over time, administering this system, complaints grew practically on a week-by-week basis that come Friday (when timesheets had to be entered by – which caused a load rush at the end of each week), the SAP server was getting slower, and slower. Memory was upgraded. Hard drive layout was tweaked, etc., but in the end the system just got slower and slower and slower.

Eventually it was determined that the problem wasn’t in the OS or the hardware, but in the SQL coding in the timesheet system. You see, every time a user went to add a new timesheet entry, a logical error in the SQL code would first retrieve all timesheet entries made by that employee since the system was commissioned or the employee started, whichever was first. As you can imagine as the months and years went by, this amounted to a lot of heavy selects going on each week.

With that corrected, users reacted with awe – they thought the system had been massively upgraded, but instead it had just been a (relatively minor) SQL tweak.

What does this have to do with availability, you may be wondering? Well, everything.

You see, the SAP server was up for lengthy periods of time. The application also was up for lengthy periods of time. Yet the service – timesheets, and more generally the entire SAP database was increasingly unavailable. Timesheet entry for a week for many users took 2+ elapsed hours of initiating a new entry, waiting infuriatingly long numbers of minutes for the system to respond and then often inputting the entry later, after having switched away to something else while waiting for the system to respond. Under no stretch of the imagination could that service be said to be available.

So how do you measure availability? Well, the act of measuring is perhaps more challenging, and going to be handled on a service-type by service-type basis. (E.g., measuring web services will be different from measuring desktop services which will be different from measuring local server services, etc.)

The key step is defining useful, quantifiable metrics. That is, a metric such as “users should get snappy response” is vague, useless and (regrettably) all too easy to define. The real metrics are timing/accuracy based metrics. (Accuracy metrics are mainly useful for systems with analogue styled inputs.) Sticking to timing based metrics for simplicity, measuring availability comes down to having specific timings associated with events. The following are closer to being valid metrics:

  • All searches should start presenting data to the user within 3 seconds, and finish within 8 seconds.
  • Confirmation of successful data input should take place within 0.3 seconds.
  • A 20 page text-only document should complete printing within 11 seconds.
  • Scrubbing through raw digital media should occur with no more than a 0.2 second lag between mouse position and displayed frame.

(Using weight scales as an example, an analogue metric might be that the scales will be accurate to within 10 grams.)

While metrics are more challenging to quantify than boolean statistics, they allow the usability and availability of a system to be properly measured. Without accurate metrics, uptime is like digging for fool’s gold.

4 thoughts on “Uptime is an inappropriate metric”

  1. I agree, uptime for uptime’s sake is not worth it. I’ve got a monthly maintenance downtime (usually on the last Sunday of the month) where my servers get updated with their patches and rebooted. If nothing else, if there’s a problem, I find it during a five hour window when everyone already knows that the server won’t be available rather than during the middle of the workday when everyone needs to have that server up and working.

  2. >suitable workaround would be to reboot the system

    I have never seen this fix ANYTHING on a midrange platform.

    Its not windows.

    1. Whether these days Windows belongs in its own category or is arguably part of midrange is not part of my point.

      The fantastic thing about midrange systems though is their wide variety – you’ve obviously been exposed to different issues and challenges than I and others I know have been. I’ve certainly seen situations that are resolvable via reboots in midrange systems, bearing in mind that my point is that if a reboot can solve an issue quickly as opposed to having the system operate in a degraded or less-than-useful way for a lengthy period of time that impacts users, the reboot should be recognised as a valid options.

      The sorts of scenarios I’m discussing include such things as say
      – significant issues with runaway processes, or
      – daemons wedged in uninterruptible IO waiting for something on the other end of a SCSI chain to time out, or
      – crashed processes that will not restart due to some combination of shared memory and who-knows-what-else blocking new processes starting

      etc.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.