Your cloud based data may be hanging by a thread and you wouldn’t even know.

Clouds: Is your data hanging by a thread?

Introduction

The recent Sidekick debacle proved one thing: it’s insufficient to “just trust” companies that are currently offering cloud based services. Instead, industry standards and regulations must be developed to permit use of the term.

I’ll be blunt: as per previous articles here, I don’t believe in “The Cloud” as a fundamental paradigm shift. I see it as a way of charging more for delivering the same thing for private clouds, and (as exemplified by Sidekick), something which may be fundamentally unreliable as a sole repository of data in the public instance.

Regardless of that however, it’s clear that the “cloud” moniker will be around for a while, and businesses will continue to trade on being providing “cloud” services (and thus being buzzword compliant). So, like it or lump it, we need to come up with some rules.

Recently SNIA has started an initiative to try to setup some standards for Cloud based activities. However, as is SNIAs right, and their focus, this primarily looks at data management, which is less than half of the equation for public cloud services. The lions share of the equation for public cloud services, as proven by the Sidekick debacle is trust.

Currently the cloud computing industry is like the wild west. Lots of people are running around promising fabulous new things that can solve any number of problems. But when those fabulous new things fail or fall over even temporarily, a lot of people can be negatively affected.

How can people trust that their cloud data is safe? Regulation is a good starting point.

If you are one of those people who at the first hint of the word regulation throws up your hands and says “that’s too much government intervention”, then I’d invite you to stop and think for a few minutes about the global financial crisis. If you’re one of those people who insists “industries should be self regulating”, I’d invite you to look at a certain Microsoft subsidiary called Danger that was offering a service called Sidekick. In short, self regulation doesn’t work without rigid transparency.

So, what needs to be done?

Well, there’s three key factors that need to be addressed in order to achieve true and transparent trust within cloud based businesses. These are:

  • Foundation of ethical principles of operation
  • Periodic certified (mandatory) audit process
  • Reporting

Let’s look at each of these individually.

Ethical Principles of Operation

Whenever I start thinking about ethics in IT, I think of two different yet equally applicable sayings:

  • Common sense is not that common (usually incorrectly attributed to Voltaire)
  • When you assume you make an ass out of u and me. (Unknown source.)

Extending beyond the notion of “cloud”, we can say that companies should strive to understand the ethical requirements of data hosting, so as to ensure that whenever they hold data for and on behalf of another company or individual they:

  1. At all times aim to keep the data available within the stated availability times/percentages.
  2. At all times ensure the data is recoverable.
  3. At all times be prepared to handover said data on request/on termination of services.

These should be self evident in that if the situation were reversed we would expect the same thing. Companies that offer cloud services should work such ethical goals into their mission requirements and individual goals of every individual employee. (If the company offers cloud application services as well as just data services, the same applies.)

Mandatory, Periodic, Independently Certified Auditing of Compliance

In a perfect world, ethics alone would be sufficient to garner trust. However, as we all know, we need more than ethics in order to generate trust. Trust will primarily come from mandatory periodic independently certified auditing of compliance to ethical principles of cloud data storage.

What does this mean?

So let’s look at each word in that statement to understand what company* should have to do in order to offer “cloud” data/services:

  • mandatory – it must, in order to keep referring to itself as “cloud”
  • periodic – every 6-12 months (more likely every 12 months – 6 would be preferable in the fast moving world of the internet however)
  • independently – to be done by companies or consultants who do not have any affiliation that would cause a conflict of interest
  • certified auditing – said companies or consultants doing the auditing must have certification from SNIA for following appropriate practices
  • compliance – if found to be non-compliant, SNIA (or some other designated agency) must post a warning on their web-site within 1 month of the audit, and the company be given 3 months to rectify the issue. If after 3 months they have not, then SNIA should flag them as non-compliant. This should also result in the company taking down any reference to “cloud”.

Obviously unless legally enforced, a company could choose to sidestep the entire compliancy check and just declare themselves to be cloud services regardless. Therefore there must be a “Known Compliant” list kept up to date, country-by-country, that would be advertised not only by SNIA but by actual cloud-compliant companies which partake in the process, so that end-users and businesses could reference this to determine who have exhibited certified levels of trust.

In order to achieve that certification, companies would need to be able to demonstrate to the auditor that they have:

  • Designed their systems for sufficient redundancy
  • Designed adequate backup and per-customer data recoverability options (see note below)
  • Have disaster recovery/contingency planning in place
  • Have appropriate change controls to manage updates to infrastructure or services

Note/Aside regarding adequate backup and per-customer data recoverability options. Currently this is an entirely laughable and inappropriate state. If companies wish to offer cloud based data services, and encourage users to store their data within their environment, they must also offer backup/recovery services for that data. They may choose to make this a “local-sync” style option – keeping a replica of the cloud-data in a designated local machine for the user, or, if not done this way, they must offer a minimum level of data recoverability service to their users. For example, something even as basic as “Any file stored in our service for more than 24 hours will be recoverable for 6 weeks from time of storage.” I.e., it doesn’t necessarily have to be the same level of data recovery we expect from private enterprise networks, but it must be something.

It would be easy and entirely inappropriate to say instead of all this auditing that companies must simply publish all the above information. However, that represents a potential data security issue, and it also potentially gives away business-sensitive information, so I’m firmly against that idea. The only workable alternative to that however is the certified auditing process.

Reporting

Currently there is far too cavalier an approach to reporting by cloud vendors about the state of their systems. Reporting must be publicly available, fulfilling the following categories:

  1. Compliancy – companies should ensure that any statement of compliancy is up to date.
  2. Availability – companies should keep their availability percentile (e.g., “99.9% available”) publicly available in the way that many primary industries for instance publish their “days without an injury” statistics.
  3. Failures – companies must publish failure status reports/incident updates at minimum every half an hour, starting from the time of the incident and finishing after the incident is resolved. It’s important for cloud vendors to start to realise that their products may be used by anyone else in the world, so it’s not sufficient to just wake IT staff on an incident, management or other staff must be available to ensure that updates continue to be generated without requiring IT staff to stop working on resolution. I.e., round-the-clock services require round-the-clock reporting.
  4. Incident reports – all incidents that result in unavailability should have a report generated on which will be reviewed by the auditor on the next compliancy check.

In conclusion

Does this sound like a lot of work? Well, yes.

It’s all too easy for those of us in IT to take a cavalier attitude towards user data – they should know how to backup, they should understand the risks, they should … well, you get the picture. Yes, there’s a certain level of education we would like to see in end users, but think of the flip-side. They’re not IT people. They don’t necessarily think like IT people. For the most part, they’ve been trained not to think about backup and data protection because it’s not something that’s been pushed home within the operating systems they’re using. (A trend that seems to be readily reversing in Mac OS X thanks to Time Machine.)

Ultimately, cloud failures can’t be palmed off with trite statements that users should have kept local copies of their data. Cloud services are being marketed and promoted as “data available anywhere” style systems, which creates an expectation of protection and availability.

So in short, while this is potentially a lot of work to setup, it’s necessary. It should be considered to be a moral imperative. In order to actually garner trust, the current wild-west approach to Clouds must be reined in and be given certified processes that enable users (or at least trusted IT advisers of users) to confidently point at a service and say: “that’s been independently checked: it’s trustworthy“.

Anything short of this would be a scandalous statement about deniability, legal weaseling out of responsibility and a “screw you” attitude towards end-user data.


* Obviously some individuals, moving forward, may in various ways choose to offer cloud access. Due to hosting and bandwidth, it’s likely in most instances that such access would be as a virtual private cloud – a cloud that’s “out there” in internet land, but is available only to select users. As such, it would fall into the realm of private clouds, which will undoubtedly have a do whatever the hell you feel like doing approach. However, in the event of individuals rather than corporates specifically offering full public-cloud style access to data, there should be a moniker for “uncertified” individual cloud offerings – available only to individuals; never to corporates.

 

For a while now I’ve been working with EMC support on an issue that’s only likely to strike sites that have intermittent connectivity between the server and storage nodes and that stage from ADV_FILE on the storage node to ADV_FILE on the server.

The crux of the problem is that if you’re staging from storage node to server and comms between the sites are lost for long enough that NetWorker:

  • Detects the storage node nsrmmd processes have failed, and
  • Attempts to restart the storage node nsrmmd processes, and
  • Fails to restart the storage node nsrmmd processes

Then you can end up in a situation where the staging aborts in an ‘interesting’ way. The first hint of the problem is that you’ll see a message such as the following in your daemon.raw:

68975 10/15/2009 09:59:05 AM  2 0 0 526402000 4495 0 tara.pmdg.lab nsrmmd filesys_nuke_ssid: unable to unlink /backup/84/05/notes/c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342 on device `/backup’: No such file or directory

(The above was rendered for your convenience.)

However, if you look for the cited file, you’ll find that it doesn’t exist. That’s not quite the end of the matter though. Unfortunately, while the saveset file that was being staged didn’t stay on disk, its media database details did. So in order to restart staging, it becomes necessary to first locate the saveset in question and delete the media database entry for the (failed) server disk backup unit copy. Interestingly, this is only ever to be found on the RW device, not the RO device:

[root@tara ~]# mminfo -q "ssid=c452f569-00000006-fed6525c-4ad6525c-00051c00-dfb3d342"
 volume        client       date      size   level  name
Tara.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001       fawn      10/15/2009 1287 MB manual  /usr/share
Fawn.001.RO    fawn      10/15/2009 1287 MB manual  /usr/share

We had hoped that it was fixed in 7.5.1.5, but my tests aren’t showing that to be the case. Regardless, it’s certainly around in 7.4.x as well and (given the nature of it) has quite possibly been around for a while longer than that.

As I said at the outset, this isn’t likely to affect many sites, but it is something to be aware of.

 

Many of us with NetWorker have been in the situation where a backup has started (particularly when it’s for a newly configured group), and instead of going to the pool we want it to go to, it’s goes to the Default pool. For sites using multiple pools, it’s usually the case that no media will be in the Default pool, and hence the backup won’t go anywhere.

In those situations, determining why NetWorker is suddenly requesting media in the Default pool is quite easy. Sometimes however, the answer is not so easy. A media request may come out of the blue, with no server-initiated activities behind it, and nothing may be logged to indicate what is causing the request. It could be that an end-user is attempting to run a backup, or that a backup process that was server initiated has gone awry, restarted, and for some reason targeted the Default pool.

This leads me to what I’d call “Default pool debugging 101″ … or “how to save yourself a lot of hair tearing”. I had a customer once who called me and expressed a level of exasperation over having already spent several days off and on chasing down what might be causing the persistent request for “1 writable volume in the Default pool”.

My solution in such situations is simple: if you can’t spot what is going wrong – why NetWorker is asking for the media in the wrong pool, then label a volume into that pool and see what writes to it. In such cases one of three things will typically happen:

  1. The volume will be loaded but then not used because a process requested it, was aborted, and for some reason NetWorker didn’t detect the abort.
  2. The volume will be loaded and written to by a manual backup process, in which case the metadata for the backup can be used to identify who (or what) has sent the data to the wrong pool.
  3. The volume will be loaded and written to by an errant scheduled backup process that experienced some failure “a while ago”, in which case it can be staged, upon completion, to the correct pool.

I’m the first person to jump to the defense of elegant and well considered solutions. Doing the mundane thing of just labeling media into the “incorrect” pool that NetWorker is requesting media for smacks of inelegance or even a pseudo “brute force” approach. However, sometimes the easiest solution is also the best – instead of wasting considerable amounts of time chasing phantoms, why not just cut to the chase in such media situations where the solution isn’t obvious, and let NetWorker tell you where the request is coming from?

 

Never trust anything that can think for itself if you can’t see where it keeps its brain.
J.K. Rowling, “Harry Potter and the Chamber of Secrets”

Regular readers of this blog will know that I’m a strong disbeliever in The Cloud – for some very key reasons. The reasons are distinctly different depending on whether a vendor is talking about a private cloud or a “out there in the internet” public cloud.

For private clouds, I think it’s nothing more than the emperor’s new clothes … it’s nothing more than an attempt to stick a buzzword compliant label on something already done in datacentres and charge more for it.

For public clouds, my primary concern is the that it’s a variant of trusting trust. Businesses who put their data, apps and services in the hands of cloud vendors have to trust that the data will be well managed and highly available.

(Aside: Yes, I acknowledge I use Mozy. I use it for limited and personal backups only. I use it for immediate offsite backups of a few key chunks of data that I also backup via other mechanisms. I.e., if Mozy disappears tomorrow, all I’ve lost is a bit of convenience – not my data.)

In addition to the plethora of traditional Internet based companies that are ramming cloud down our throats every spare moment, lots of “traditional” IT companies are banging on about cloud computing in the most obnoxiously hyped up ways these days. EMC falls heavily into that camp. So does IBM. So does Microsoft. Indeed, it seems impossible to find a company these days that isn’t willing to jump up and down shouting “us too, us too, look at us, we do cloud! Our clouds are ever so pretty and oh so reliable!”

Thin provision this. OpEx vs CapEx that. Data replication that. Anywhere access it all. It brings a little lump of bile to the back of my throat every time another vendor jumps up and down about cloud. It’s all a load of hype.

You want thin provisioning? That’s called virtualisation – or at a pinch, blade servers – and paravirtualisation. You want OpEx vs CapEx? Charge-out for processor cycles used has been around in the mainframe world since practically the year dot (IT wise). You want replication? That’s been around for ages too. You want internet available data? Um, yeah, that’s been around for a while as well.

You want to pay an extra 50% to 100% and have a buzzword compliant “Cloud” sticker on it? Excellent! I have a bridge I want to sell you with your leftover budget.

If that all came across as me jumping up and down on top of a soap box, you’d probably be right. Sometimes it seems that the only person of senior ranks in the IT industry with the chutzpah to tell the truth about cloud is Larry Ellison. And even Larry admits that cloud has reached such a level of hype that Oracle will be forced to stick some buzzword compliant stickers on their marketing material as a result.

So what does this have to do with Sidekick? Well, everything.

Despite what some pundits would tell you as they desperately scramble to protect the “good name” of cloud from yet another tarry lining, sidekick is cloud. Sidekick was in fact cloud at its strongest level of hubris. Data in the cloud with no ready provisioning for seamless local backup and restore. Cloud goes, data goes. It’s that simple. You couldn’t get a more buzzword compliant appearance of cloud than that.

Now I know that people will leap to the defense of cloud and say “well, it’s not the cloud fault, but the implementation fault – they didn’t understand ILP properly”, for instance. There’s a level of truth in that, but truth and trust don’t go hand in hand. You see, the end user doesn’t know that some vendors when they talk about cloud mean replicating, self repairing data services that are highly available. They just, thanks to all the buzz and hype generated by the industry hear “cloud” and think “wow, that’s secure!”

This isn’t a matter of truth, it’s a matter of trust. It’s a matter of a monumental breach of trust.

You see, the biggest, most misleading claim about cloud computing is that public clouds – clouds hosted by big corporates, are hosted properly and will provide high availability. We’re only barely across the starting line of companies offering cloud based services – companies that have supposedly been doing high availability themselves for ages – and yet we’re already seeing situations, time and time again, where cloud “vendors” are letting their users down. Sidekick is the latest and perhaps worst example. However, Google Mail has had systemic failures, Apple’s MobileMe has suffered issues as well – cloud failures are all around us, just waiting to be looked at.

The cloud system is hopelessly unbalanced in favour of the supplier. Massive companies with massive budgets with lots of very very small customers. So what if the cloud goes down for a few minutes – what’s a single person going to do about it?

Well, judging by the number of search hits I’ve had in the last couple of days due to a previous article I wrote about Sidekick, I have to imagine that the term class action lawsuit is springing to mind for a lot of those small and otherwise disenfranchised users.

Anyone who trusts the notion of a public cloud that doesn’t offer to seamlessly and automatically keep data locally available after the sidekick debacle is a fool.

With a bit of luck, one good thing may come out of the Sidekick debacle – the silver bullet/magic solution hype that has surrounded cloud for far too long may finally be pierced with some cold hard facts.

It’s time for people to wake up and smell the trust.

[Edit]

Current reports would seem to indicate that some, if not all of the Sidekick data may have been restored.

This this cause for celebration? For the end users, yes. Does it mean that Sidekick is trustworthy? Hell no – a significant data loss event taking such a lengthy period of time to recover is not, under any circumstances, a sign of trust.

 

There’s a Top 10 Reasons marketing document from EMC now explaining why you should be using NetWorker.

While obviously this is a marketing document, it has one serious flaw above and beyond any issues that people may have over standard marketing documents. It puts recovery at the wrong numbered item.

The first reason, according to the document, is about driving costs out of the environment. Sure, that’s a reason for deployment of an integrated solution, but it’s not the number 1 reason. The second reason is apparently due to closed backup windows. That is a good second reason, but it doesn’t have priority over the real number one reason.

The third reason they cite on the document is:

If your backups don’t recover, neither will your business.

NetWorker is the leader in recovery performance – up to 100 percent faster than the alternatives. It delivers fast, secure, and reliable data and server recovery to ensure you meet your required service levels.

Who in their right mind would put this at #3 on a list of reasons? Backup is recovery – or rather, it’s all about recovery. Whoever put this marketing document together was wrong on one very, key point:

Recovery is and has always been Number One.

(If you think that’s wrong, just go talk to all the Sidekick users…)

 

The net has been rife with reports of an extreme data loss event occurring at Microsoft/Danger/T-Mobile for the Sidekick service over the weekend.

As a backup professional, this doesn’t disappoint me, it doesn’t gall me – it makes me furious on behalf of the affected users that companies would continue to take such a cavalier attitude towards enterprise data protection.

This doesn’t represent just a failure to have a backup in place (which in and of itself is more than sufficient for significant condemnation), but a lack of professionalism in the processes. I.e., there should be some serious head kicking going on regarding this, most notably regarding the following sorts of questions:

  • Why wasn’t there a backup?
  • Where was their change control that prevented the work being done due to the backup not being available?
  • Why wasn’t the system able to handle the failure of a single array?
  • When will the class action law suits start to roll in?

I don’t buy into any nonsense that maybe the backup couldn’t be done because of the amount of data and the time required to do it. That’s just a fanciful workgroup take on what should be a straight forward enterprise level of data backup. Not only that, the system was obviously not designed for redundancy at all … I’ve got (relatively, compared to MS, T-Mobile, etc) small customers using array replication so that if a SAN fails they can at least fall back to a broken off replica. Furthermore, this begs the question: For such a service, why aren’t they running a properly isolated DR site? Restoring access to data should have been as simple as altering the paths to a snapped off replica on an alternate, non-upgraded array.

This points to an utterly untrustworthy system – at the absolute best it smacks of a system where bean counters have prohibited the use of appropriate data protection and redundancy technologies for the scope of the services being provided. At worst, it smacks of an ineptly designed system, an ineptly designed set of maintenance procedures, an inept appreciation of enterprise data protection strategies, and a perhaps even level of contempt for the data of users.

(For any vendor that would wish to crow, based on the reports, that it was a Hitachi SAN that was being upgraded by Hitachi staff and therefore it’s a Hitachi problem: pull your heads in – SANs can fail, particularly during upgrade processes where human errors can creep in, and since every vendor continues to employee humans, they’re all susceptible to such catastrophic failures.)

 

In many environments with storage nodes, a common requirement is to share backup devices between the server and/or storage nodes (regardless of whether the storage nodes are dedicated or full). The primary goal is to reduce the number of devices, or the number of tape libraries required in order to minimise cost while still maximising flexibility of the environment.

There are two mechanisms available for device sharing. These are:

  • Library sharing – free from any licensing, this is the cheapest but least flexible
  • Dynamic drive sharing – requiring additional licenses, this is more flexible but comes at a higher cost in terms of maintenance, debugging and complexity.

It’s easiest to gain an understanding of how these two options work with some diagrams.

First, let’s consider library sharing:

Conceptual diagram of library sharing

Conceptual diagram of library sharing

(Not shown: connection to tape robot head – i.e., control port connection.)

In this configuration, more than one host connects to specific devices in the tape library. These are hard or permanent connections; that is, once a device has been allocated to one server/storage node, it stays allocated to that host until the library is reconfigured.

This is a static allocation of resources that has the backup administrator allocate a specific number of devices per server/storage node based on the expected requirements of the environment. For instance, in the example above, the server has permanent mappings to 3 of the 6 tape drives in the library; the full storage node has permanent mappings to 2 of the drives, and the dedicated storage node has a permanent mapping to the one remaining drive in the tape library.

The key advantages of this allocation method are:

  • Zero licensing cost,
  • Guaranteed device availability,
  • Per-host/device isolation, preventing faults on one system from cascading to another.

The disadvantages of this allocation method are:

  • No dynamic reallocation of resources in the event of requirement spikes that were not anticipated
  • Can’t be reconfigured “on the fly”
  • If a backup device fails and a host only has access to one device, it won’t be able to backup or recover without configuration changes.

Where you would typically use this allocation method:

  • In VTLs – since NetWorker licenses VTLs by capacity*, you can allocate as many virtual drives as you want, providing each host with more than a certain amount of data in the datazone with one or more virtual drives, significantly reducing LAN impact of backup.
  • In PTLs where backup/recovery load is shared reasonably equally by two or more storage nodes (counting the server as a storage node in this context) and having only one library is desirable.

Moving on to dynamic drive sharing, this model resembles the following:

Conceptual overview of dynamic drive sharing

Conceptual overview of dynamic drive sharing

In this model, licenses are purchased, on a per-drive basis, for dynamic sharing. (So if you have 6 tape drives, such as in the above example, you would need up to 6 dynamic drive sharing licenses – you don’t have to share every device in the library – some could remain statically mapped if desired.)

When a library with dynamically shared drives is setup within NetWorker, the correct path to the device, on a per-host basis, will be established for that device within the configuration. This might mean that the device “/dev/nst0″ on the backup server might be known in the configuration as being all three of the following:

  • /dev/nsto
  • rd=stnode:/dev/rmt/0cbn
  • rd=dedstnode:\\.\Tape0

When a host with dynamic access to drives needs a device (for either backup or recovery), the NetWorker server (or whichever host has control over the actual tape robot) will load a tape into a free, mappable drive, then notify the storage node which device it should use to access media. The storage node will then use the media until it no longer needs to, with the host that controls the robot handling any post-use unmounts or media changes.

The advantages of dynamic drive sharing are:

  • With maximal device sharing enabled, resource spikes can be handled by dynamically allocating a useful number of drives to host(s) that need them at any given time,
  • Fewer drives are typically required than would be in a library sharing/statically mapped allocation method.

The disadvantages of dynamic drive sharing are:

  • With multiple hosts able to see the same devices, isolating devices from SCSI resets and other SAN events from non-accessing hosts requires constant vigilance. HBAs and SAN settings must be configured appropriately, and these settings must be migrated/checked every time drivers change, systems are updated, etc.
  • It is relatively easy to misconfigure dynamic drive sharing by planning to use too few physical tape drives than are really necessary. (I.e., it’s not that cheap, from a hardware perspective either.)
  • Each drive that is dynamically shared requires a license.
  • Unlike products such as say, NetBackup, due to the way nsrmmd’s work and don’t share nicely with each other, volumes must be unmounted before devices are transferred from one host with dynamic drive sharing to another, even if both hosts will be using the same volume. (This falls into the “lame” feature category.)

Where you would typically use dynamic drive sharing includes areas such as:

  • A small, select number of hosts with significant volumes of data require LAN-free backups,
  • With small/isolated storage nodes that are still SAN connected (e.g., DMZ storage nodes).

A lot of the architectural reasons as to why dynamic drive sharing was originally developed has in some senses gone away with greater penetration of VTLs into the backup arena. Given that it’s a straight forward proposition to configure a large number of virtual tape drives, instead of messing about with dynamic drive sharing one can instead choose to just use library sharing in VTL environments to achieve the best of both worlds.


* Currently non-EMC VTLs, while still licensed by capacity, typically co-receive unlimited autochanger licenses. Even so, such licenses are not limited by the the number of virtual tape drives.

 

I’m going to start with a statement that may make some people angry. Some might suggest that I’m just goading the blogosphere for traffic, but that’s not my point.

System administrators and managers who focus on keeping uptime as high as possible for no other reason than having good numbers are usually showing an arrogant disrespect to the users of their systems.

There – I said it. (I believe I am now required to walk the Midrange plank and dive into the Mainframe sea of Mediocrity). As a “long term” Unix admin and user, I found it galling when I initially realised that the midrange environment has always had the wrong attitude towards uptime. Uptime for the sake of uptime, that is. These days, I use a term you might more expect to hear in the mainframe world: negotiated uptime.

You see, there’s uptime, and there’s system usefulness. That’s the significant difference between uptime and agreed uptime. Confusing these items only achieves one thing: unhappy users.

Here’s just a few examples of rigid adherence to ‘uptime’ gone wrong:

  • Systems performing badly for days at a time while system administrators hunt for the cause when they know that a suitable workaround would be to reboot the system of a night time – or even during the time that most users take a lunch break.
  • Systems that don’t get patched for months at a time because the patching would require a reboot and that would affect uptime. (If it’s not broken, don’t fix it, can be used for regular patch avoidance, but very, very rarely for security patching.)
  • Applications that don’t get upgraded, despite obvious (or even required!) fixes in new releases, because the application administrators don’t like restarting the application.

I’ll go so far as to say that uptime, measured at the individual server level, is irrelevant and inappropriate. Uptime should never be about the servers, or even the applications – it’s about the services for the business.

Clusters typically represent a far healthier approach to uptime: a recognition that one or more nodes can fail so long as the service continues to be delivered. There are clusters (particularly OpenVMS clusters) that are known to have been presenting services for a decade or more, all the while continuing to get OS upgrades and hardware replacements and undoubtedly having single node failures/changes.

The healthiest approach to uptime however is to recognise that individual system or application uptime is irrelevant. The net effect experienced by users for availability of services is what should be measured. All the uptime stats, SNMP monitoring stats, etc., in the world are irrelevant when compared with how useful an actual IT service is to a business function.

The challenge of course is that availability is significantly harder to measure than is uptime. Uptime is after all, dead simple – on any Unix platform there’s a single command – ‘uptime’ to get you that measurement. Presumably on Windows there’s easy ways to get that information too. (E.g., without any other experience myself in trying to measure this, I know you can (usually) at least get last boot time out of the event logs.) Half of why it’s simple is the ease at which the statistic can be gathered.

What makes availability harder to measure though is that it’s not all boolean measurements. The other half of why uptime is an easy measurement is because it’s a boolean statistic. A host is either up or down (and when transitioning between it’s usually considered to be ‘down’ unless it’s fully ‘up’).

Services however can be up but not available. That is, they can be technically, yet not practically available.

Here’s an old example I like to dredge out regarding availability. An engineering company invested staggering amounts of money in putting together a highly customised implementation of SAP. Included in this implementation was a fairly in-depth timesheet module that did all sorts of calculations on the fly for timesheet entry.

Over time, administering this system, complaints grew practically on a week-by-week basis that come Friday (when timesheets had to be entered by – which caused a load rush at the end of each week), the SAP server was getting slower, and slower. Memory was upgraded. Hard drive layout was tweaked, etc., but in the end the system just got slower and slower and slower.

Eventually it was determined that the problem wasn’t in the OS or the hardware, but in the SQL coding in the timesheet system. You see, every time a user went to add a new timesheet entry, a logical error in the SQL code would first retrieve all timesheet entries made by that employee since the system was commissioned or the employee started, whichever was first. As you can imagine as the months and years went by, this amounted to a lot of heavy selects going on each week.

With that corrected, users reacted with awe – they thought the system had been massively upgraded, but instead it had just been a (relatively minor) SQL tweak.

What does this have to do with availability, you may be wondering? Well, everything.

You see, the SAP server was up for lengthy periods of time. The application also was up for lengthy periods of time. Yet the service – timesheets, and more generally the entire SAP database was increasingly unavailable. Timesheet entry for a week for many users took 2+ elapsed hours of initiating a new entry, waiting infuriatingly long numbers of minutes for the system to respond and then often inputting the entry later, after having switched away to something else while waiting for the system to respond. Under no stretch of the imagination could that service be said to be available.

So how do you measure availability? Well, the act of measuring is perhaps more challenging, and going to be handled on a service-type by service-type basis. (E.g., measuring web services will be different from measuring desktop services which will be different from measuring local server services, etc.)

The key step is defining useful, quantifiable metrics. That is, a metric such as “users should get snappy response” is vague, useless and (regrettably) all too easy to define. The real metrics are timing/accuracy based metrics. (Accuracy metrics are mainly useful for systems with analogue styled inputs.) Sticking to timing based metrics for simplicity, measuring availability comes down to having specific timings associated with events. The following are closer to being valid metrics:

  • All searches should start presenting data to the user within 3 seconds, and finish within 8 seconds.
  • Confirmation of successful data input should take place within 0.3 seconds.
  • A 20 page text-only document should complete printing within 11 seconds.
  • Scrubbing through raw digital media should occur with no more than a 0.2 second lag between mouse position and displayed frame.

(Using weight scales as an example, an analogue metric might be that the scales will be accurate to within 10 grams.)

While metrics are more challenging to quantify than boolean statistics, they allow the usability and availability of a system to be properly measured. Without accurate metrics, uptime is like digging for fool’s gold.

 

Sometimes I feel like a NetWorker old-timer. (When I don’t feel like a NetWorker old-timer, it doesn’t change the fact that I am.) These days, given the huge architectural gulf between them, I’d suggest that anyone who has been using NetWorker since v4.x or v5.x days is a NetWorker “old timer”. Since I’ve been using it with a trailing edge of v3.x days and heavily from v4.x, that puts me well into that territory.

One of the things NetWorker old timers will remember is that for a lot of its history, it was impossible to change anything to do with pools whenever a backup was running. If for instance, you had a backup going to a Monthly pool, and you wanted to configure a new group that would go to the Daily pool, you had to wait for all backups to complete, even though there was no overlap in pool interests, before you could add that group to the Daily pool.

When the restriction was first relaxed, the NetWorker GUIs would prompt with a warning when pool changes were being made during backup activities to indicate that it wasn’t recommended, but giving a proceed/OK button to force the change. These days, NetWorker is more permissive, but not always more forgiving.

A customer experienced a problem recently where he’d configured a new group, and started the group, only to realise that he’d not configured it to go to the correct pool. Rather than stopping the group and making the pool change, he hoped that he could change the pool settings and see the group start requesting media from the correct pool instead of the Default pool. Once the change was made though, the group kept on asking for media in the Default pool, so he stopped the group, waited a few minutes, and restarted.

NetWorker kept asking for media in the Default pool. The NMC pool configuration pane clearly showed that the group was now configured for the correct pool, but plain as day, the group wanted to still write to the Default pool.

Stopping and starting NetWorker didn’t seem to help either.

When he logged the case, and explained about the pool-change-during-backup, I immediately thought back to how NetWorker previously wouldn’t have allowed such a change to happen, and how it in an interim period used to allow the change to happen after issuing a warning. But what if, I thought, there’s still some locking that can happen which would cause a screw-up if the pool were changed for a group while the group was already requesting media?

So I suggested two courses of action to the customer:

  • EMC engineering’s hated solution: Stop NetWorker, clean out /nsr/tmp, restart, and see if that fixes it.
  • Stop the backup, take the group back out of the pool, restart and allow it to write to Default, then put the group back in the correct pool and run the backup again.

In this case, the customer chose the first option – cleaning out /nsr/tmp. While it wasn’t tested, I equally suspect that the second option would have worked too.

There is a lesson with this: avoid making changes to pools for data which is already actively trying to be written to media. Even though it’s technically supported, operationally it can still cause issues.

 

With NetWorker having many components that can link together in a variety of ways, it’s not always easy (particularly for new-comers) to have a mental map of how all those components interact. Having made repeated stabs over the years to come up with a coherent diagram showing those relationships, I have a frustrated understanding of the difficulty of drawing the relationships.

Lately I decided to take a slightly different approach – to reduce the level of the diagram to the bare basic components so as to try to give a big overview rather than every possible detail. It’s highly likely I’ve left stuff off, and my diagramming skills aren’t the best – but hopefully if you’re not sure of how everything fits together in NetWorker it may help to improve your mental map of it.

NetWorker Resource Relationships

NetWorker Resource Relationships

For the most part, I’ve tried to stick to components that are defined resource types within NetWorker. A couple of notable exceptions are “Volume” and “Level” … neither of these are defined resources as per the NetWorker resource database, but knowing where they appear in usage helps to fill in a few gaps that would otherwise be confusing.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha