The net has been rife with reports of an extreme data loss event occurring at Microsoft/Danger/T-Mobile for the Sidekick service over the weekend.
As a backup professional, this doesn’t disappoint me, it doesn’t gall me – it makes me furious on behalf of the affected users that companies would continue to take such a cavalier attitude towards enterprise data protection.
This doesn’t represent just a failure to have a backup in place (which in and of itself is more than sufficient for significant condemnation), but a lack of professionalism in the processes. I.e., there should be some serious head kicking going on regarding this, most notably regarding the following sorts of questions:
- Why wasn’t there a backup?
- Where was their change control that prevented the work being done due to the backup not being available?
- Why wasn’t the system able to handle the failure of a single array?
- When will the class action law suits start to roll in?
I don’t buy into any nonsense that maybe the backup couldn’t be done because of the amount of data and the time required to do it. That’s just a fanciful workgroup take on what should be a straight forward enterprise level of data backup. Not only that, the system was obviously not designed for redundancy at all … I’ve got (relatively, compared to MS, T-Mobile, etc) small customers using array replication so that if a SAN fails they can at least fall back to a broken off replica. Furthermore, this begs the question: For such a service, why aren’t they running a properly isolated DR site? Restoring access to data should have been as simple as altering the paths to a snapped off replica on an alternate, non-upgraded array.
This points to an utterly untrustworthy system – at the absolute best it smacks of a system where bean counters have prohibited the use of appropriate data protection and redundancy technologies for the scope of the services being provided. At worst, it smacks of an ineptly designed system, an ineptly designed set of maintenance procedures, an inept appreciation of enterprise data protection strategies, and a perhaps even level of contempt for the data of users.
(For any vendor that would wish to crow, based on the reports, that it was a Hitachi SAN that was being upgraded by Hitachi staff and therefore it’s a Hitachi problem: pull your heads in – SANs can fail, particularly during upgrade processes where human errors can creep in, and since every vendor continues to employee humans, they’re all susceptible to such catastrophic failures.)
There are actually people who argue that data shouldn’t be backed up because there’s too much of it? How can someone even say that with a straight face?
And yeah, the Hitachi SAN failed. Big deal. It could have been EMC, or Sun or any other storage solution. As you asked, why wasn’t there another?
The real tragedy here is that this completely non-trivial service was allowed to run on rubber bands and duct tape.
Some sites do have enough data that planning/configuring/budgeting for backups are far more challenging – as an example, given the volumes of data that that Large Hadron Collider is meant to be able to produce when it’s running, I’d like to know that I had a bucket load of budget in order to configure a backup environment for that.
However, in this case, there’s no excuse. The level of data that would have needed to be backed up would have been relatively average, if not on the small side, compared to the average large corporate. So it comes back to the “no excuse” category.
To see two major failures in such a short period of time (Sidekick, and the IBM debacle for Air New Zealand), I’ve decided to start a Hall of Shame page that will cover stunning ineptness at data protection from companies that should know better…