I’d hazard a guess that at least 50% to 75% of people who work in IT have formal job and role descriptions. For the most part, that’s going to comprise people who work for either larger companies, or companies that have put some effort into structuring the work environment in such a way as to assist staff to have some basic level of direction and purpose outside of day to day activities.
For those who are at the coal seam of backups – the backup administrators and system administrators – it’s very likely that some of those job/role descriptions will encompass the backups, and usually, will do so in such a way as to suggest that it’s a requirement to ensure backups are working, or that systems are recoverable, etc.
Those are what I’d called functional or operational requirements of a job, and are actually largely irrelevant to the topic at hand. By irrelevant, I mean of less importance, in as much as I’ll argue that as a backup administrator, or a system administrator responsible for backups, you have overriding ethical obligations that supercede any contractually stated obligations towards backups.
For the purposes of this article, I’ll refer from now on to the role as “backup administrator” as a means of encompassing both those who are employed in a formal role of “backup administrator”, and for those who have responsibility for backups as part of their role.
As a backup administrator, regardless of your functional obligations, you have the following three ethical obligations:
- To ensure that recovered data is usable.
- To ensure that data can be recovered.
- To ensure that backups are successful.
Those obligations are in priority order – that is, your overriding concern should be that data which is recovered should be usable, then that data can be recovered, then that backups are successful*. They also compliment other ethical considerations of IT staff**.
It’s worth noting that there’s a lot of meaning associated with the first obligation, that being ensuring that recovered data is “usable”. At a simple level, it means ensuring that the data which is recovered is not corrupt. However, “usable” means more than this – if it takes you 16 hours to recover data which is required in less than 4, it’s not usable; if you can recover the data, but it comes back without requisite meta data for security, it’s not truly in a usable form***, etc.
I’ll qualify the term “to ensure” as well; I don’t mean “must, at all costs”, or anything so harsh. Rather, “to ensure” in this usage refers to a combination of the following four things:
- To wherever possible make sure the criteria is achieved.
- To wherever possible be aware of as many potential failure conditions that might make the criteria unachievable as possible.
- To try to dissuade the company from introducing failure conditions or single points of failure.
- To document and make management aware of designed or introduced failure conditions.
So, let’s consider the first ethical requirement of a backup administrator, “to ensure that recovered data is usable”. In the context of “to ensure”, we mean:
- To wherever possible make sure that the recovered data will be usable.
- For each system, application or database backed up, know as many potential failure conditions as possible. (This might be simple – tape failure, or it might be more complex, such as scripted dumps that don’t run until after the backup completes, etc.)
- Present rational and cogent arguments both (a) for eliminating designed failure scenarios and (b) against introducing designed failure scenarios.
- To maintain a register or otherwise alert management of designed/introduced failure scenarios. (E.g., “By scripting the database backup to occur outside the control of the backup program, the dump may be backed up before it is complete, rendering it unusable.”)
Obviously, risk vs cost will come into any design, and a risk vs cost decision may very well introduce designed failure scenarios. This is the nature of backup, and data protection – no matter how much money you spend, there’s always other potential failure scenarios. Thus, it isn’t the responsibility of a backup administrator to argue against designed failure scenarios to the point of losing his job, or bankrupting the company she works for; rather, pointing out the costs of those levels of protection that can be afforded, and documenting where that protection ends/what isn’t protected against.
Scenario: in smaller companies (e.g., with less than 50 employees), backup administration is definitely not a single, overriding role. The role will instead usually fall to one or two individuals who exhibit either (a) a particular interest in it or (b) have the best IT stills in the company (particularly when the focus of the company is not IT). In such small companies, it’s very typical to find that backups are significantly less rigorous and complete as would be found in an enterprise environment. Examples of this might include:
- Backups may be run less frequently (e.g., weekly, instead of daily);
- Only select “key” parts of systems may be backed up (e.g., just critical data, with operating systems and application areas left for “rebuild only”);
- Limited number of operational procedures for handover;
- Limited amount of testing.
While these would not be considered acceptable in an enterprise environment, they may be considered acceptable in a smaller environment but with the following two caveats:
- The principal stakeholders (i.e., owners of the business) are aware of the limitations of the existing backup regime;
- The backup administrator still makes best endeavours with what is available.
(It should be noted that a common scenario when the ‘backup administrator’ for a small company goes on leave is that no-one can be bothered to change media because “it’s not their job”. My response is that such behaviour is at best lazy, or worse, unethical.)
The next ethical concern presented was “to ensure that data can be recovered”; so rather than just talking about recovered data being usable, we’re instead referring to the obligation to ensure that data can be recovered. In the context of what we’ve discussed previously, this means:
- To wherever possible make sure that data can be recovered – e.g., know where media is, know that media has been verified, etc.
- To be aware of potential failure conditions for data recovery – e.g., media or device failure during recovery, media lost, media unavailable within the required timeframe, etc.
- Arguing against situations that introduce the backup environment as a single point of failure – e.g., not duplicating/cloning backups (or failing this in smaller products, running multiple backup sets), having media stored in such a way that makes it susceptible to primary site failure, storing media unsafely (e.g., in the boot of a car), etc.
- Ensuring management are aware of potential faults – e.g., “without backup duplication any single recovery can fail due to a single piece of media failing”.
Our final ethical concern is “to ensure that backups are successful”; this covers:
- To confirm that each backup is successful, and where backups are not successful have an appropriate strategy for either re-running, or in a risk-vs-cost decision that has been signed off by management, decide not to re-run the backup.
- To be aware of potential backup failures; again, it’s not possible to be aware of every potential failure or have a contingency for it (e.g., “meteor crashes into the primary site and shrapnel bounces ten kilometres to take out the backup site” is likely to be a bit over the top); instead the goal here is to at least be aware that backup failures can occur, and thus the success of backups should not be taken for granted – i.e., when referring to backups, as much as anything the need to be aware of potential backup failures reinforces the need to confirm that each backup is successful.
- Arguing against situations that introduce backup failures – e.g., scheduling system reboots at a time when backups “should” have been completed, allowing untrained staff to interact with the system for monitoring when the backup administrator is not available, insufficient involvement between the backup administrator and change control, etc.
- Maintain a register of single points of failure within the backup environment; if the environment has been through a design process this may be as simple as keeping details of “what was requested” vs “what was provided”; in actuality though this will be a living document that should continue to outline issues; it should also feed into or link to a test register, as the two documents will be very closely related.
It can be successfully argued that everything discussed above describes operational or functional requirements of a person fulfilling the role of backup administrator. This is not in dispute; indeed, I’d agree this is the case – someone fulfilling that role needs to be doing the above, and more. However, what is not often considered is that such activities should be considered ethical obligations of the person fulfilling the role. That is, they should not be done “because it’s my job”, but “because it’s right”.
With the exception of “simple” or “easy” concepts, such as hacking and virus generation, IT as an industry is frequently reluctant to engage with ethical considerations; it’s deemed “left brain” and “logical” and thus is not the purdue of such profoundly “right brain” activities as ethics and philosophy. In actual fact, nothing could be further from the truth. In the same way that medicine has become routinely concerned with the ethics of whether something that can be done should be done, IT too must actively consider these scenarios.
Backup administrators are in a position to weild considerable power – or cause considerable damage – through actions or inactions. One of the most common causes for failures, when they occur at the level of the backup administrator, is time, or a lack thereof. By understanding what we are ethically obligated to do, rather than just functionally required to do, we are in a better position to understand primary obligations to the company we work for, to their customers, and the broader community.
If you’d like to read more about human involvement and requirements within enterprise backup systems, you should check out my book, Enterprise Systems Backup and Recovery: A corporate insurance policy.
* It could be argued that each obligation is dependent on its subsequent obligation, making each obligation equally important, or even the last obligation the most important. Logically this may be correct, but I’d like as much as possible to keep the focus on recovery for the simple fact that it is the end result.
** For example, it can be argued that system and application/database administrators have ethical obligations to not peek at data which may be functionally accessible but would be deemed inaccessible by role, privileges or seniority. (With obvious exceptions for situations where it is both functionally required and operationally permitted.)
*** Yes, the data itself may be accessible and may even be usable for direct or transient requirements. However, if it comes back in such a way that previous security settings, such as who could access the file, are too loose, or too strict, then as a total entity encompassing both meta data and data, we can say it is not usable.