In “Distribute.IT reveals shared server data loss – News – iTnews Mobile Edition” (June 21, 2011), we’re told:

Distribute.IT has revealed that production data and backups for four of its shared servers were erased in a debilitating hack on its systems over a week ago.

“In assessing the situation, our greatest fears have been confirmed that not only was the production data erased during the attack, but also key backups, snapshots and other information that would allow us to reconstruct these Servers from the remaining data,” the company reported.

You may think that I’m saying the hack is wrong – and anyone conducting such a malicious attack is certainly being particularly unpleasant. But the simple truth is that such an attack should not be capable of rendering a company unable to recover its data.

It suggests multiple design failures on behalf of Distribute.IT:

  • Backups were not physically isolated; regardless of whether you can erase the current backup, or all the backups on nearline storage, there should be backup copies that are sent off-site and removed from such attack;
  • Alternatively, if there were offsite backups – if they were physically isolated, they were not sufficiently secured;
  • Retention policies seem inappropriately small; why could they not recover from say, a week ago, or two weeks ago? The loss of some data even under a sustained hack should be somewhat reversible if longer-term backups can be recovered from. Instead, we’re told: “we have been advised by the recovery teams that the chances for recovery beyond the data and files so far retrieved are slim”.

It’s also worth noting that this goes to demonstrate a worst case scenario about snapshots – they’re typically reliant on some preservation of original data (either running disks, or ensuring that the amount of data deleted/corrupted doesn’t exceed snapshot capacity).

I’m not crowing about data loss – I completely sympathise with Distribute.IT on this incident. However, it is undoubtedly the case that with an appropriately designed backup system, this level of data destruction should not have happened to them.

 

The folks over at 37 Signals published a little piece of what I would have to describe as crazy fiction, about how the combination of cloud and more technically savvy users means that we’re now seeing the end of the IT department.

I thought long and hard about writing a rebuttal here, but quite frankly, their lack of logic made me too mad to publish the article on my main blog, where I try to be a little more polite.

So, if you don’t mind a few strong words and want to read a rebuttal to 37 Signals, check out my response here.

 

I don’t like having to do this, particularly since I’m on holidays and only logged into my work email to send one, rather than read, but I noticed an email come in on a support case that I’ve been keenly dealing with, and wanted to check what the latest update from EMC on it was.

But on this case, I’ve been passed a response from EMC NetWorker engineering which is so boneheaded and stupid that I can’t help but have a short rant about it.

(I’ll qualify one thing here: I’m talking EMC NetWorker engineering – the back-end people, not the support people.)

In short, as of 7.6, there’s a new media database field called ‘validcopies’, which, according to the man page is:

The number of successful copies (instances or clones) of the save set, all with the same save time and save set identifier.

Now, digging a little bit further, we’ve got the release notes for 7.6, which states:

mminfo changed to allow query for valid save set copies in order to prevent data loss

There was no convenient method to query for save sets with valid clone copies on other volumes using mminfo. This made certain tasks more difficult to perform, such as determining if space could be cleared on the EDLs.

(Italicised emphasis mine, bold from the release notes.)

Now, in addition to validcopies initially being entirely FUBAR as a reporting mechanism (I’m happy with the patch I’ve been testing, and I’m hoping it will get into the first service pack for 7.6), I noted in the support case that I didn’t think it was appropriate for NetWorker to return 2 ‘validcopies’ for savesets on ADV_FILE devices. (I.e., one for the read-only volume, one for the read-write volume.) Sure, in the classic use of the ‘copies’ flag, we’re used to this, but ‘validcopies’, being something new, and being about preventing data loss, should have only reported 1 valid copy per entire disk backup unit, not 2.

Instead, EMC NetWorker engineering have adamantly said that it will report 2 valid copies per disk backup unit, 1 per read-only device, one per read-write device.

This is boneheaded. If the validcopies flag is all about preventing data loss, then it must be accurate as to the number of distinct, usable copies.

If engineering is so confident that a backup to ADV_FILE represents two distinct valid copies for the purposes of preventing data loss if a copy is lost, let’s see them delete a whole bunch of uncloned savesets from the read-write ADV_FILE devices on EMC’s production backups and then recover. What? You can’t do that? But you said you had two valid copies, and you only deleted one of them? Boo-hoo to you too.

I’ll end my grumpy rant with the following advice: don’t say or do something stupid that might allow a customer to do something stupid that might result in data loss. Haven’t you read this, after all?

 

The net has been rife with reports of an extreme data loss event occurring at Microsoft/Danger/T-Mobile for the Sidekick service over the weekend.

As a backup professional, this doesn’t disappoint me, it doesn’t gall me – it makes me furious on behalf of the affected users that companies would continue to take such a cavalier attitude towards enterprise data protection.

This doesn’t represent just a failure to have a backup in place (which in and of itself is more than sufficient for significant condemnation), but a lack of professionalism in the processes. I.e., there should be some serious head kicking going on regarding this, most notably regarding the following sorts of questions:

  • Why wasn’t there a backup?
  • Where was their change control that prevented the work being done due to the backup not being available?
  • Why wasn’t the system able to handle the failure of a single array?
  • When will the class action law suits start to roll in?

I don’t buy into any nonsense that maybe the backup couldn’t be done because of the amount of data and the time required to do it. That’s just a fanciful workgroup take on what should be a straight forward enterprise level of data backup. Not only that, the system was obviously not designed for redundancy at all … I’ve got (relatively, compared to MS, T-Mobile, etc) small customers using array replication so that if a SAN fails they can at least fall back to a broken off replica. Furthermore, this begs the question: For such a service, why aren’t they running a properly isolated DR site? Restoring access to data should have been as simple as altering the paths to a snapped off replica on an alternate, non-upgraded array.

This points to an utterly untrustworthy system – at the absolute best it smacks of a system where bean counters have prohibited the use of appropriate data protection and redundancy technologies for the scope of the services being provided. At worst, it smacks of an ineptly designed system, an ineptly designed set of maintenance procedures, an inept appreciation of enterprise data protection strategies, and a perhaps even level of contempt for the data of users.

(For any vendor that would wish to crow, based on the reports, that it was a Hitachi SAN that was being upgraded by Hitachi staff and therefore it’s a Hitachi problem: pull your heads in – SANs can fail, particularly during upgrade processes where human errors can creep in, and since every vendor continues to employee humans, they’re all susceptible to such catastrophic failures.)

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha