Your cloud based data may be hanging by a thread and you wouldn’t even know.

Clouds: Is your data hanging by a thread?

Introduction

The recent Sidekick debacle proved one thing: it’s insufficient to “just trust” companies that are currently offering cloud based services. Instead, industry standards and regulations must be developed to permit use of the term.

I’ll be blunt: as per previous articles here, I don’t believe in “The Cloud” as a fundamental paradigm shift. I see it as a way of charging more for delivering the same thing for private clouds, and (as exemplified by Sidekick), something which may be fundamentally unreliable as a sole repository of data in the public instance.

Regardless of that however, it’s clear that the “cloud” moniker will be around for a while, and businesses will continue to trade on being providing “cloud” services (and thus being buzzword compliant). So, like it or lump it, we need to come up with some rules.

Recently SNIA has started an initiative to try to setup some standards for Cloud based activities. However, as is SNIAs right, and their focus, this primarily looks at data management, which is less than half of the equation for public cloud services. The lions share of the equation for public cloud services, as proven by the Sidekick debacle is trust.

Currently the cloud computing industry is like the wild west. Lots of people are running around promising fabulous new things that can solve any number of problems. But when those fabulous new things fail or fall over even temporarily, a lot of people can be negatively affected.

How can people trust that their cloud data is safe? Regulation is a good starting point.

If you are one of those people who at the first hint of the word regulation throws up your hands and says “that’s too much government intervention”, then I’d invite you to stop and think for a few minutes about the global financial crisis. If you’re one of those people who insists “industries should be self regulating”, I’d invite you to look at a certain Microsoft subsidiary called Danger that was offering a service called Sidekick. In short, self regulation doesn’t work without rigid transparency.

So, what needs to be done?

Well, there’s three key factors that need to be addressed in order to achieve true and transparent trust within cloud based businesses. These are:

  • Foundation of ethical principles of operation
  • Periodic certified (mandatory) audit process
  • Reporting

Let’s look at each of these individually.

Ethical Principles of Operation

Whenever I start thinking about ethics in IT, I think of two different yet equally applicable sayings:

  • Common sense is not that common (usually incorrectly attributed to Voltaire)
  • When you assume you make an ass out of u and me. (Unknown source.)

Extending beyond the notion of “cloud”, we can say that companies should strive to understand the ethical requirements of data hosting, so as to ensure that whenever they hold data for and on behalf of another company or individual they:

  1. At all times aim to keep the data available within the stated availability times/percentages.
  2. At all times ensure the data is recoverable.
  3. At all times be prepared to handover said data on request/on termination of services.

These should be self evident in that if the situation were reversed we would expect the same thing. Companies that offer cloud services should work such ethical goals into their mission requirements and individual goals of every individual employee. (If the company offers cloud application services as well as just data services, the same applies.)

Mandatory, Periodic, Independently Certified Auditing of Compliance

In a perfect world, ethics alone would be sufficient to garner trust. However, as we all know, we need more than ethics in order to generate trust. Trust will primarily come from mandatory periodic independently certified auditing of compliance to ethical principles of cloud data storage.

What does this mean?

So let’s look at each word in that statement to understand what company* should have to do in order to offer “cloud” data/services:

  • mandatory – it must, in order to keep referring to itself as “cloud”
  • periodic – every 6-12 months (more likely every 12 months – 6 would be preferable in the fast moving world of the internet however)
  • independently – to be done by companies or consultants who do not have any affiliation that would cause a conflict of interest
  • certified auditing – said companies or consultants doing the auditing must have certification from SNIA for following appropriate practices
  • compliance – if found to be non-compliant, SNIA (or some other designated agency) must post a warning on their web-site within 1 month of the audit, and the company be given 3 months to rectify the issue. If after 3 months they have not, then SNIA should flag them as non-compliant. This should also result in the company taking down any reference to “cloud”.

Obviously unless legally enforced, a company could choose to sidestep the entire compliancy check and just declare themselves to be cloud services regardless. Therefore there must be a “Known Compliant” list kept up to date, country-by-country, that would be advertised not only by SNIA but by actual cloud-compliant companies which partake in the process, so that end-users and businesses could reference this to determine who have exhibited certified levels of trust.

In order to achieve that certification, companies would need to be able to demonstrate to the auditor that they have:

  • Designed their systems for sufficient redundancy
  • Designed adequate backup and per-customer data recoverability options (see note below)
  • Have disaster recovery/contingency planning in place
  • Have appropriate change controls to manage updates to infrastructure or services

Note/Aside regarding adequate backup and per-customer data recoverability options. Currently this is an entirely laughable and inappropriate state. If companies wish to offer cloud based data services, and encourage users to store their data within their environment, they must also offer backup/recovery services for that data. They may choose to make this a “local-sync” style option – keeping a replica of the cloud-data in a designated local machine for the user, or, if not done this way, they must offer a minimum level of data recoverability service to their users. For example, something even as basic as “Any file stored in our service for more than 24 hours will be recoverable for 6 weeks from time of storage.” I.e., it doesn’t necessarily have to be the same level of data recovery we expect from private enterprise networks, but it must be something.

It would be easy and entirely inappropriate to say instead of all this auditing that companies must simply publish all the above information. However, that represents a potential data security issue, and it also potentially gives away business-sensitive information, so I’m firmly against that idea. The only workable alternative to that however is the certified auditing process.

Reporting

Currently there is far too cavalier an approach to reporting by cloud vendors about the state of their systems. Reporting must be publicly available, fulfilling the following categories:

  1. Compliancy – companies should ensure that any statement of compliancy is up to date.
  2. Availability – companies should keep their availability percentile (e.g., “99.9% available”) publicly available in the way that many primary industries for instance publish their “days without an injury” statistics.
  3. Failures – companies must publish failure status reports/incident updates at minimum every half an hour, starting from the time of the incident and finishing after the incident is resolved. It’s important for cloud vendors to start to realise that their products may be used by anyone else in the world, so it’s not sufficient to just wake IT staff on an incident, management or other staff must be available to ensure that updates continue to be generated without requiring IT staff to stop working on resolution. I.e., round-the-clock services require round-the-clock reporting.
  4. Incident reports – all incidents that result in unavailability should have a report generated on which will be reviewed by the auditor on the next compliancy check.

In conclusion

Does this sound like a lot of work? Well, yes.

It’s all too easy for those of us in IT to take a cavalier attitude towards user data – they should know how to backup, they should understand the risks, they should … well, you get the picture. Yes, there’s a certain level of education we would like to see in end users, but think of the flip-side. They’re not IT people. They don’t necessarily think like IT people. For the most part, they’ve been trained not to think about backup and data protection because it’s not something that’s been pushed home within the operating systems they’re using. (A trend that seems to be readily reversing in Mac OS X thanks to Time Machine.)

Ultimately, cloud failures can’t be palmed off with trite statements that users should have kept local copies of their data. Cloud services are being marketed and promoted as “data available anywhere” style systems, which creates an expectation of protection and availability.

So in short, while this is potentially a lot of work to setup, it’s necessary. It should be considered to be a moral imperative. In order to actually garner trust, the current wild-west approach to Clouds must be reined in and be given certified processes that enable users (or at least trusted IT advisers of users) to confidently point at a service and say: “that’s been independently checked: it’s trustworthy“.

Anything short of this would be a scandalous statement about deniability, legal weaseling out of responsibility and a “screw you” attitude towards end-user data.


* Obviously some individuals, moving forward, may in various ways choose to offer cloud access. Due to hosting and bandwidth, it’s likely in most instances that such access would be as a virtual private cloud – a cloud that’s “out there” in internet land, but is available only to select users. As such, it would fall into the realm of private clouds, which will undoubtedly have a do whatever the hell you feel like doing approach. However, in the event of individuals rather than corporates specifically offering full public-cloud style access to data, there should be a moniker for “uncertified” individual cloud offerings – available only to individuals; never to corporates.

 

Never trust anything that can think for itself if you can’t see where it keeps its brain.
J.K. Rowling, “Harry Potter and the Chamber of Secrets”

Regular readers of this blog will know that I’m a strong disbeliever in The Cloud – for some very key reasons. The reasons are distinctly different depending on whether a vendor is talking about a private cloud or a “out there in the internet” public cloud.

For private clouds, I think it’s nothing more than the emperor’s new clothes … it’s nothing more than an attempt to stick a buzzword compliant label on something already done in datacentres and charge more for it.

For public clouds, my primary concern is the that it’s a variant of trusting trust. Businesses who put their data, apps and services in the hands of cloud vendors have to trust that the data will be well managed and highly available.

(Aside: Yes, I acknowledge I use Mozy. I use it for limited and personal backups only. I use it for immediate offsite backups of a few key chunks of data that I also backup via other mechanisms. I.e., if Mozy disappears tomorrow, all I’ve lost is a bit of convenience – not my data.)

In addition to the plethora of traditional Internet based companies that are ramming cloud down our throats every spare moment, lots of “traditional” IT companies are banging on about cloud computing in the most obnoxiously hyped up ways these days. EMC falls heavily into that camp. So does IBM. So does Microsoft. Indeed, it seems impossible to find a company these days that isn’t willing to jump up and down shouting “us too, us too, look at us, we do cloud! Our clouds are ever so pretty and oh so reliable!”

Thin provision this. OpEx vs CapEx that. Data replication that. Anywhere access it all. It brings a little lump of bile to the back of my throat every time another vendor jumps up and down about cloud. It’s all a load of hype.

You want thin provisioning? That’s called virtualisation – or at a pinch, blade servers – and paravirtualisation. You want OpEx vs CapEx? Charge-out for processor cycles used has been around in the mainframe world since practically the year dot (IT wise). You want replication? That’s been around for ages too. You want internet available data? Um, yeah, that’s been around for a while as well.

You want to pay an extra 50% to 100% and have a buzzword compliant “Cloud” sticker on it? Excellent! I have a bridge I want to sell you with your leftover budget.

If that all came across as me jumping up and down on top of a soap box, you’d probably be right. Sometimes it seems that the only person of senior ranks in the IT industry with the chutzpah to tell the truth about cloud is Larry Ellison. And even Larry admits that cloud has reached such a level of hype that Oracle will be forced to stick some buzzword compliant stickers on their marketing material as a result.

So what does this have to do with Sidekick? Well, everything.

Despite what some pundits would tell you as they desperately scramble to protect the “good name” of cloud from yet another tarry lining, sidekick is cloud. Sidekick was in fact cloud at its strongest level of hubris. Data in the cloud with no ready provisioning for seamless local backup and restore. Cloud goes, data goes. It’s that simple. You couldn’t get a more buzzword compliant appearance of cloud than that.

Now I know that people will leap to the defense of cloud and say “well, it’s not the cloud fault, but the implementation fault – they didn’t understand ILP properly”, for instance. There’s a level of truth in that, but truth and trust don’t go hand in hand. You see, the end user doesn’t know that some vendors when they talk about cloud mean replicating, self repairing data services that are highly available. They just, thanks to all the buzz and hype generated by the industry hear “cloud” and think “wow, that’s secure!”

This isn’t a matter of truth, it’s a matter of trust. It’s a matter of a monumental breach of trust.

You see, the biggest, most misleading claim about cloud computing is that public clouds – clouds hosted by big corporates, are hosted properly and will provide high availability. We’re only barely across the starting line of companies offering cloud based services – companies that have supposedly been doing high availability themselves for ages – and yet we’re already seeing situations, time and time again, where cloud “vendors” are letting their users down. Sidekick is the latest and perhaps worst example. However, Google Mail has had systemic failures, Apple’s MobileMe has suffered issues as well – cloud failures are all around us, just waiting to be looked at.

The cloud system is hopelessly unbalanced in favour of the supplier. Massive companies with massive budgets with lots of very very small customers. So what if the cloud goes down for a few minutes – what’s a single person going to do about it?

Well, judging by the number of search hits I’ve had in the last couple of days due to a previous article I wrote about Sidekick, I have to imagine that the term class action lawsuit is springing to mind for a lot of those small and otherwise disenfranchised users.

Anyone who trusts the notion of a public cloud that doesn’t offer to seamlessly and automatically keep data locally available after the sidekick debacle is a fool.

With a bit of luck, one good thing may come out of the Sidekick debacle – the silver bullet/magic solution hype that has surrounded cloud for far too long may finally be pierced with some cold hard facts.

It’s time for people to wake up and smell the trust.

[Edit]

Current reports would seem to indicate that some, if not all of the Sidekick data may have been restored.

This this cause for celebration? For the end users, yes. Does it mean that Sidekick is trustworthy? Hell no – a significant data loss event taking such a lengthy period of time to recover is not, under any circumstances, a sign of trust.

 

The net has been rife with reports of an extreme data loss event occurring at Microsoft/Danger/T-Mobile for the Sidekick service over the weekend.

As a backup professional, this doesn’t disappoint me, it doesn’t gall me – it makes me furious on behalf of the affected users that companies would continue to take such a cavalier attitude towards enterprise data protection.

This doesn’t represent just a failure to have a backup in place (which in and of itself is more than sufficient for significant condemnation), but a lack of professionalism in the processes. I.e., there should be some serious head kicking going on regarding this, most notably regarding the following sorts of questions:

  • Why wasn’t there a backup?
  • Where was their change control that prevented the work being done due to the backup not being available?
  • Why wasn’t the system able to handle the failure of a single array?
  • When will the class action law suits start to roll in?

I don’t buy into any nonsense that maybe the backup couldn’t be done because of the amount of data and the time required to do it. That’s just a fanciful workgroup take on what should be a straight forward enterprise level of data backup. Not only that, the system was obviously not designed for redundancy at all … I’ve got (relatively, compared to MS, T-Mobile, etc) small customers using array replication so that if a SAN fails they can at least fall back to a broken off replica. Furthermore, this begs the question: For such a service, why aren’t they running a properly isolated DR site? Restoring access to data should have been as simple as altering the paths to a snapped off replica on an alternate, non-upgraded array.

This points to an utterly untrustworthy system – at the absolute best it smacks of a system where bean counters have prohibited the use of appropriate data protection and redundancy technologies for the scope of the services being provided. At worst, it smacks of an ineptly designed system, an ineptly designed set of maintenance procedures, an inept appreciation of enterprise data protection strategies, and a perhaps even level of contempt for the data of users.

(For any vendor that would wish to crow, based on the reports, that it was a Hitachi SAN that was being upgraded by Hitachi staff and therefore it’s a Hitachi problem: pull your heads in – SANs can fail, particularly during upgrade processes where human errors can creep in, and since every vendor continues to employee humans, they’re all susceptible to such catastrophic failures.)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha