One of the stories I sometimes hear from companies is that some technology X doesn’t work in their environment because X sucks, or X is broken, or X … well, you get the picture.
Years ago, when I first got into backup, the the main reasons I had to do recovery were due to system or hardware failures. Hard drive reliability was IMHO much lower, operating systems were frequently less stable, etc. Reliability was about getting to 99% availability, let alone 99.9% or anything grandiose like that.
These days, hardware/OS/app failure is, I’d suggest, one of the least likely reasons for a recovery being conducted in most organisations. Instead, it’s mainly related to soft issues – user error, audits, compliance checking, etc.
There’s a point here, and I’m almost ready to make it.
Back when I first started with backup, I’d have agreed that technology could be firmly blamed for a lot of errors. These days? Rarely – even when I blame it.
I periodically go on a rant about just how painful Linux is sometimes, but at the core I also admit that it’s a lack of training and time on my part – I’ve not made learning the ins and outs of Linux firewalls a field of study in the past, so now that I’m having to construct them by hand for a personal project it’s about as fun as tasering myself in the genitals. Technology is partly the problem – as is always the case with Linux, it’s designed for programmers and developers to manipulate, not for end users, or people like me who have concentrated on other things and just want the damn thing to work.
Ahem, where was I?
The simple fact is that we often blame technology because it’s easy. It’s like kids picking on the “easy target” at school with bullying; we bully technology and blame it for all our woes and issues because well, it doesn’t really fight back. (Hopefully we’ll get out of this habit before the singularity…)
As techos though, let’s be honest. The technology is rarely the issue. Or to be more accurate, if there’s an issue, technology is the tip of the iceberg – the visible tip. And using the iceberg analogy, you know I mean that technology is rarely going to be the majority of the issue.
The ‘issue’ iceberg in IT looks like this:
It’s probably best here that I stop and differentiate between issues and problems. A problem to me, is an isolated or an atomic failure – like, a faulty tape drive, or a failed hard drive. They’re clearly technology related, but they’re not really issues. An issue is a deeper, systemic and compound failure. E.g., something like “on any one day, 30% of my backups fail”, or “Performance across all systems is generally 50% worse at end of month”, etc.
When technology gets blamed in those instances, I’m reminded of someone who say, never has their car serviced, then when it eventually breaks down complains that the car was a lemon. Was it that the car failed the person, or more accurately that the person failed the car?
As I said, it’s easy to blame the thing that can’t defend itself.
In environments with ongoing, long-term issues, there reaches a point where you have to sit back and ponder – is the technology causing the issue, or is the environment causing the technology to have an issue?
The inevitable and hard truth is that in some cases, it’s the latter, not the former.
Let’s consider a basic scenario – the “on any given day 30% of our backups fail” scenario. So, does that mean that on any given day 30% of servers crash and reboot during the backup? Or does the backup software agent crash on 30% of servers when a backup is attempted? Maybe, in the most exceptional of circumstances, this may be the case.
In reality though? In reality we have to start looking at the rest of that iceberg:
High systemic failure rates, if attributed to the deployed technology, should result in a law suit. How often do we see that happening?
>queue the cicadas<
That’s right.
When there are systemic failure rates, a business must, eventually, turn to face the truth that they have to review their:
- Policies – Are there any governing rules to the company which are contributing to the problem? For instance, does the company require the technology to be adapted in such a way that it wasn’t designed for? This can be hard and real policies, or they can be implicitly allowed policies – such as empire building.
- Processes – Are there operating methods which are triggering the issue? Imagine a business for instance where change control has become such a consuming process that backup failures are repeatedly allowed to occur because a change window isn’t available. Is that the fault of the backup technology?
- People and Education – I’m not suggesting that staff at sites are incompetent. Far from it. Incompetent is such a harsh, unpleasant word that in the 15+ years I’ve been consulting, it’s been a very rarely used word. Education though is a factor. No, I’m not picking on people without tertiary skills, but training is a factor. For example, managers who have no day to day technical experience may decide that some technology, based on a half hour vendor pitch, is easy enough that staff won’t need training in it. If said staff then go on to say, accidentally delete a LUN from a production server, because they weren’t trained , how is that the fault of the SAN?
Navel gazing, introspection, call it what you will, it’s not always a pleasant task. It’s about objectively looking at how we’re doing things, and ask, “are we partly to blame?”
Yet, if you aren’t prepared to do this, you’re doomed (yes, doomed) to keep making the same mistake again, and again, and again. The pile of failed technology builds up, the quest for the silver bullet becomes more frenetic, and the chances of a major failure happening increase. In the worst scenarios, it can become decidedly toxic.
But it doesn’t need to be. Evaluating your processes, your policies and your people (particularly the training of your people) can be – well, cathartic. And the benefits to the business, in terms of literal cost savings and efficiencies, ensures that the introspection is well worth it.
As a consultant, you might assume that it’s my job to ensure that customers buy the best and the most expensive technology out there that I can sell them. That’s a cynical attitude that comes from a few shoddy operators. As a consultant, my job is to partner with you and your company and help you achieve your best. (If you think I’m just blowing smoke up your proverbial, check my “13 traits of a great consultant” article.)
Sometimes that means highlighting that there are issues, not problems, and those issues require a deeper fix than plugging in a new piece of technology.
Excellent article!
I too consult, and I too have worked with technical teams who fervently believe that the technology they don’t like is fundamentally broken. When I was younger, I engaged in the same behaviour. And yet, elsewhere in the world, many people pay huge sums of money to the vendor in question and use this ‘broken’ technology quite effectively.
I’d expand on your point and put people at the bottom of the pyramid, because it is people who set the policies, and develop the processes. Get that wrong, and no technology can help you.