My employer, IDATA Australia, is expanding, and we’re looking for subject matter experts around EMC products, with a strong focus on backup and archiving skills. Regardless of whether you’re located in Australia or the other side of the world, we may have a position for you. The roles will cover both support and services, so you need to have great communication skills and a strong customer focus.

Specifically, we’re looking for people who are comfortable working in a combined support and services team – i.e., the role would involve both deployments and training on customer sites, as well as customer remote support. If you’ve got experience in one or the other rather than both, that’s not a problem so long as you’re eager to learn.

The Australasian market is a region of high growth, so working for IDATA will keep you challenged to stay in top technical form.

If you’re interested in working for IDATA Australia, contact me.

(Please note, IDATA does not accept solicitations from recruitment agencies.)

 

A while ago, I ran a post titled Ethical Obligations of Backup Administrators. Following up from that now I want to talk about the procedural obligations implicit to working in the role of being a backup administrator.

Now, to start with, if you think that the primary procedural obligation of a backup administrator is to ensure that the backups work or run, then you need to think more about the end obligation than the start obligation. (This is a primary topic of consideration in my book.)

Before I set out the procedural obligations, I need to define recoverable. You may think this is a self-obvious definition – however, if it were, a lot of problems that regularly occur in backup systems wouldn’t happen at all. Thus, by recoverable I mean the following:

  1. The item that was backed up can be retrieved from the backup media.
  2. The item that is retrieved from the backup media is usable as a replacement to the data that was backed up.
  3. The item can be retrieved within the required window.

A backup should not be deemed to be recoverable unless it meets all three of the above requirements. No ifs, no buts, no maybes. (Indeed, it’s worth noting that many “soft” recovery failures are caused by a failure to meet the third requirement – getting the data back in time is equally as important in mission critical systems as getting the data back.)

Since most people work well with lists, I’ll define these procedural obligations as a list, ordered in priority starting at the highest:

  1. To ensure that all required data is recoverable. By “data” I’m not just referring to raw data, but all items, files, information, databases, systems, etc., designated as requiring recovery.
  2. To maintain a zero error policy. There is no such thing as 100% certainty, but the closest you can get to it is by maintaining a zero error policy. In essence, by maintaining a zero error policy, you become immediately aware of any issues that may compromise the above rule.
  3. To maintain documentation for the environment. No system is complete without documentation. In particular, if someone with adequate skills cannot interact with it after reading the documentation, then the system is not documented and is not a system.
  4. To maintain an issues register. This is somewhat implicit in the maintenance of a zero error policy, but it is worth remembering that not all issues in a backup system are to do with errors. Issues may be that department heads approve of, or insist on non-standard backups, or that a system went into production without adequate testing, etc.
  5. To be across ongoing capacity management and forecasting requirements. A backup system can’t reliably work if it could halt due to capacity restraints at any random moment or minor data growth. Thus, the backup administrator must have a finger on the pulse of the capacity of the system.
  6. To maintain reports. A backup system does not work in isolation, and thus a backup administrator must ensure that reports (both daily/operational and long term/management) are accurate and timely.
  7. To document all data that is not required for recovery. There should be no “unknowns” in a backup system. Thus, any systems or data that are designated to not require recovery (e.g., QA systems) must be documented as such, and periodically rechecked to confirm this remains the case.

As I said from the outset, many of these obligations are implicit to the role of being a backup administrator. However, for organisations wanting to formalise their processes and their role descriptions, thus achieving higher guarantees of reliability within their backup system, clearly documenting these obligations are vital.

 

It’s easy to get confused on ‘supported’. That is, when EMC (or any other vendor) publishes a guide on say, what operating systems are supported, many will ask whether that means if some operating system X that does not appear in the list will work.

The terms ‘work’ and ‘supported’ are not synonymous, and should not be confused.

I’ll be the first to point out that I routinely use CentOS in my lab – a Linux distribution that is most definitely not on the supported operating systems list. It’s a repackaged RedHat Enterprise Server, and I can install it as many times as I want at zero cost. On the other hand, if I needed to actually buy a RedHat Enterprise Server license for every Linux test VM, I’d be very, very poor.

So clearly, CentOS works with NetWorker, even though it’s not supported.

Would I recommend it being used at a customer site in a full production environment? Not without rigorous caveats.

You see, backup is one of those fundamentally low-level scenarios where taking risks is just plain wrong. It’s like the difference between leading edge and bleeding edge. There’s nothing wrong with being leading edge in the backup environment; many companies depend on being leading edge so they can meet their backup and recovery windows. Bleeding edge though – going out and using untested or uncertified configurations, just asks for trouble. Indeed, the term says it all – bleeding.

There are typically two key reasons why something may ‘work’ but be ‘unsupported’. These are:

  • The vendor has not had a chance to test that particular configuration. I.e., it’s unqualified. For example, a Widgets Inc. Tape Library with LTO-5 drives and four robot heads may just not have made it to the vendor labs for qualification; so, while it may technically work, it’s never been tested.
  • The vendor is not comfortable with the supplier support for the product.

Now, in the case of a solution or a configuration option being unqualified, there’s a solution. EMC for instance will work with customers and partners to determine whether a particular configuration can be qualified – indeed, most vendors have a similar process. While everyone would undoubtedly prefer that they get all the qualification done in their labs, we must also accept that it’s practically impossible to achieve, so some level of on-site qualification must be accepted as required from time to time.

In the second instance though, things are a little more difficult. If a vendor isn’t comfortable that the supplier of a product will be able to suitably support that product at an enterprise level, then getting it qualified is unlikely at best.

In these instances, if you want to deploy unsupported components in your system, ask yourself these questions:

  1. Is there a supported option available?
  2. What are the pros and cons of the supported option vs the unsupported option?
  3. What is the risk to the business if the unsupported option has issues and the vendor refuses to support it?
  4. If the unsupported option is chosen, can a test lab be setup using the supported option so as to prove, at any point, that the use of the unsupported product does not contribute to an issue?

The last point may seem a little odd – after all, if you can afford the supported option for a lab, why wouldn’t you deploy in production? I’ve actually seen this scenario with CentOS – a company couldn’t afford RedHat Enterprise Server licenses for all their production machines, so they deployed CentOS, but they also did buy a RedHat Enterprise Server license for a lab machine. Whenever an issue occurred that required escalation to the vendor, they’d first reproduce it on the RedHat Enterprise Server. That way, when it went to the vendor, they could (rightly) claim an issue on a supported operating system.

Even so, this isn’t necessarily ideal. What was obviously not accounted for here was the potential for a high severity issue occurring. E.g., if a severity-1 fault occurred on a system, where data recovery was imperative, but recreating the configuration would take a long period of time, the risk remained that either (a) an escalation based on an unsupported operating system would be rejected or (b) the SLAs might be blown out of the water recreating the issue on a supported platform in order to get a successful escalation.

In short – the decision to use unsupported software/hardware is not the decision of IT staff. It must be the decision of senior management. It must be signed off, and stakeholders of affected systems and processes must be aware of the potential consequences.

While unsupported does not necessarily imply doesn’t work, it’s important to remember that unsupported can most definitely mean unsupported when it stops working.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha