Introduction

When I entered the work force, for the first few months I trained as a MIMS consultant, but was then seconded to a system administration team on the other side of the country for 3 months (which became 6 months). Shortly after I returned from that, I joined the BHP IT Unix System Administration team in Newcastle and spent 4 years there. I built a lot of technical knowledge in that group, but what I got most out of that group was an understanding about what makes an excellent system administrator.

I was no means an excellent system administrator when I joined that team – I was wet around the ears, somewhat naïve, and probably too opinionated. The team that I joined taught me what makes an excellent system administrator, but in doing so also gave me an excellent foundation to some of the core requirements to be a good consultant too, and I thought it was time I shared these.

So here are what I’d call the 7 rules for system administrators, distilled from the experience of working with the best system administration team I’ve ever worked with. (While I’m at it – hello to Dave, Scott, Andrew, John, Jason and Russell.)

Knowledge Centric Approach

It wasn’t until after I stopped working with the Newcastle Unix team that I realised (to my horror) there were other ways system administration groups could run. There’s two distinct approaches:

  • Knowledge centric approach – everyone knows a little about what everyone else is doing, and while any one person will be an expert on certain things, everyone is capable of getting involved with anything.
  • Person centric approach – each system, application or function is administered by one or two people in the group at most, and the ability of the group to maintain those systems without the individuals being around is negligible at best.

My absolute belief is that any system administration team built around a person centric approach has it wrong. They do their users and the business a disservice.

Paranoia

While sometimes some of the people I’ve worked with have taken paranoia and security to extremes I find overboard, paranoia is a trait that should be considered a healthy mental attitude for system administrators. Paranoia in this case means not being overly trusting – having an idea of what processes should be running, requiring empirical evidence that the system is functional, and not making dangerous assumptions.

Testing

If you want to avoid testing, assume it doesn’t work. This is the mentality of a good system administrator. Since assuming everything doesn’t work means you have to assume that everything needs to be fixed, the alternative – having a testing regime and ensuring that changes don’t go into production without appropriate testing seems much easier.

Documentation

Documentation is vital to good systems operation and system administration. That covers the full gamut – system build documentation, procedural documentation, change control, etc. Why? Quite simply, if your systems and processes aren’t documented, then it means that you’re slipping into a person-centric approach to a system administration team.

Being Lazy

A good sysadmin is a lazy sysadmin. Lazy system administration is about automation. If a task that you perform requires you to run three commands, taking the output of each prior command and using it as input to the next command, you should be automating it. Every time you do repetitive, mundane tasks that can be scripted, you’re wasting your own time and company time. In my experience system administrators that religiously avoid scripting repetitive tasks lose up to an hour a day in mundane tasks that could be better spent elsewhere – self training, research, etc. (Of course, every bit of that automation needs good documentation!)

Only make a mistake once

We all make mistakes. Demanding people make no mistakes doesn’t account for how people learn. The trick of course is ensuring that we learn from our mistakes. That means that you should acknowledge that you’ll periodically make mistakes, but be ever determined to not make the same mistake twice.

Ask questions, listen to the answer

I used to say that the only stupid question is the one you don’t ask. This remains partially true, but it could equally be said that a stupid question is one that you ask, but don’t listen to the answer.

All system administrators should be prepared to ask one another questions (again, coming back to the knowledge-centric approach to system administration) – no one person in the team will have the answers to every single situation. But asking the question is only the first part of it. In fact, it’s probably only the first 30% at most.

The larger part – 70% of the effort, is taking the time to listen to the answer and making sure you understand the answer. In many cases that probably means asking some follow-up questions: question TLAs, question terms you don’t understand in the answer, and if the answer itself still doesn’t make sense, ask for more clarification. Sometimes you’ll have it explained to you, and sometimes you may be told that you need to do some research yourself. But don’t pretend to understand the answer when inside you’re just as confused.

In conclusion

While I’ve couched this from the perspective of rules for system administration, the techniques equally apply to just about any IT endeavor – backup administration, application administration, database administration, etc. All of these disciplines and more can follow the above principles and achieve an approach which is more satisfying – to the business, as well as from both a personal and professional perspective to the individual.

 

A recent twitter posting by Matt over at Standalone Sysadmin reminded by of the law of least astonishment.

If you’re not familiar with this law/principle, and you work in IT (not to mention backup!), you should be. Over at Wikipedia, it’s defined thusly:

[W]hen two elements of an interface conflict, or are ambiguous, the behaviour should be that which will least surprise the human user or programmer at the time the conflict arises.

I can’t stress just how important it is that this rule is applied, both to general IT architecture, and to backups as a specific instance.

This is why, for instance, I recently covered the idea that if you can’t diagram your backup environment on the back of a napkin, it’s too complex.

The more arbitrarily complex a system is, the more chance there is of misunderstanding what it does. In data protection in particular, misunderstandings can lead to data loss. Thus, arbitrarily introducing complexity at the cost of comprehension is a very, very bad idea.

Take for instance, you’ve got a script that would arbitrarily remove all indices for backups older than 3 months old. No, I don’t know why you’d have such a script, but I want to use it as an example regardless. You don’t normally run this, but in an emergency if a fileserver does a absolutely huge backup with millions upon millions of files day after day, you may periodically find yourself in the situation of needing to scrub old index data to reclaim space. (Obviously, there should be more space allocated to indices. I’m using this as an example, remember…)

You might think that for such a simple script, there’s no “law of least astonishment” to follow, but trust me, there is, and in this case, it’s all in the name.

Consider a few potential names for such a script:

  • index-maintenance
  • scrub-indices
  • clean-indices
  • purge-indices-3months-and-older

I would argue that all bar the last proposed script name is a violation of the law of least astonishment. Why? The name in the first 3 could easily be misinterpreted by someone to do something else. Who’s that someone? Maybe it’s a contractor that comes in when you’re unexpectedly sick for a month. Or maybe it’s a colleague who takes over when you’re away on holidays but you didn’t get a chance to train him or her before you left. Maybe it’s a new person you’re training.

Of course, backup and system administrators should review scripts before they run them, but let’s be honest: it doesn’t always happen. Some people as well will automatically run scripts/etc., with a “-h” option to see what they do (i.e., to get usage information), and if you haven’t programmed that in and your script just starts blowing away old indices, it’s not a good result.

There is little – practically no – cost to using more meaningful script names. Sure, it means that you may have to type a little more, and maybe a few more bytes here and there are used in directory storage within filesystems, but this is so trivial it’s not worth talking about.

The benefits to using better naming structures though are significantly more pronounced – scripts are named by their function, which means a significant reduction in the chances that someone new to your system will accidentally run them when they shouldn’t, or misinterpret what they do.

In backup and in NetWorker, I’d argue that the law of least astonishment should be applied at every level of the system. This means that groups, policies, pools, schedules, etc. – all the configuration resources – should be named appropriately. Another way of considering it is that if you need a comment for every single resource, your system is too complex. Some resources should be completely obvious. Of course, comments are important at times, but that doesn’t mean that every single aspect of the system should be commented.

It also means when you’re documenting the system, or talking about the system, you should use the local nomenclature. I really dislike the complexity of the terms ”cumulative incremental” and “differential incremental” in NetBackup, but when I’m talking NetBackup with people, I recognise that referring to them as “differentials” and “incrementals” respectively will just muddy the discussion. So I adjust to suit their nomenclature. Failing to follow the local nomenclature for a system just introduces more confusion, makes mistakes more likely. In terms of documentation, it means clearly following the local terms. If you can’t always follow those terms, it means you have to establish the exceptions from the outset, and periodically remind of them, so that chances of confusion are minimised. Preferably it should be avoided, but when it can’t be, it must be accounted for.

Within backup and system administration, one could argue that the primary purpose of the law of least astonishment is to eliminate, or at least substantially reduce, the risk of human errors. When people are confronted with one choice that’s clearly elucidated, they’re unlikely to choose the wrong thing. When they’ve got multiple choices, and they’re all clear as mud, the chances of them making the wrong choices or doing something that leads to error just keeps on ramping up with each fork.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha