A while ago, I ran a post titled Ethical Obligations of Backup Administrators. Following up from that now I want to talk about the procedural obligations implicit to working in the role of being a backup administrator.
Now, to start with, if you think that the primary procedural obligation of a backup administrator is to ensure that the backups work or run, then you need to think more about the end obligation than the start obligation. (This is a primary topic of consideration in my book.)
Before I set out the procedural obligations, I need to define recoverable. You may think this is a self-obvious definition – however, if it were, a lot of problems that regularly occur in backup systems wouldn’t happen at all. Thus, by recoverable I mean the following:
- The item that was backed up can be retrieved from the backup media.
- The item that is retrieved from the backup media is usable as a replacement to the data that was backed up.
- The item can be retrieved within the required window.
A backup should not be deemed to be recoverable unless it meets all three of the above requirements. No ifs, no buts, no maybes. (Indeed, it’s worth noting that many “soft” recovery failures are caused by a failure to meet the third requirement – getting the data back in time is equally as important in mission critical systems as getting the data back.)
Since most people work well with lists, I’ll define these procedural obligations as a list, ordered in priority starting at the highest:
- To ensure that all required data is recoverable. By “data” I’m not just referring to raw data, but all items, files, information, databases, systems, etc., designated as requiring recovery.
- To maintain a zero error policy. There is no such thing as 100% certainty, but the closest you can get to it is by maintaining a zero error policy. In essence, by maintaining a zero error policy, you become immediately aware of any issues that may compromise the above rule.
- To maintain documentation for the environment. No system is complete without documentation. In particular, if someone with adequate skills cannot interact with it after reading the documentation, then the system is not documented and is not a system.
- To maintain an issues register. This is somewhat implicit in the maintenance of a zero error policy, but it is worth remembering that not all issues in a backup system are to do with errors. Issues may be that department heads approve of, or insist on non-standard backups, or that a system went into production without adequate testing, etc.
- To be across ongoing capacity management and forecasting requirements. A backup system can’t reliably work if it could halt due to capacity restraints at any random moment or minor data growth. Thus, the backup administrator must have a finger on the pulse of the capacity of the system.
- To maintain reports. A backup system does not work in isolation, and thus a backup administrator must ensure that reports (both daily/operational and long term/management) are accurate and timely.
- To document all data that is not required for recovery. There should be no “unknowns” in a backup system. Thus, any systems or data that are designated to not require recovery (e.g., QA systems) must be documented as such, and periodically rechecked to confirm this remains the case.
As I said from the outset, many of these obligations are implicit to the role of being a backup administrator. However, for organisations wanting to formalise their processes and their role descriptions, thus achieving higher guarantees of reliability within their backup system, clearly documenting these obligations are vital.
I’m sorry, but I don’t understand what “To be across ongoing capacity management and forecasting requirements”. I understand (and agree with) the explanation that follows, but I can’t parse the original sentence. What am I missing?
What I’m referring to is the need to have an understanding of both the current capacity forecasts and what sort of growth factors lead into them. This allows a better reaction to changing circumstances. E.g., if you ask the average backup admin, they’ll probably say that yearly data growth is X% – let’s say for the example, 30%. But where does that 30% come from? Is 70% of that from filesystem data growth, 20% from database growth, with a remaining 10% to handle miscellaneous. Having an idea of the breakdown of the growth areas allows a better understanding of the impact of growth spikes – e.g., when other departments are added to the backup environment, or new projects start up that require additional storage.
It also ties in with the budget and purchase cycles, both for CapEx and OpEx for the company. I.e., you need to be across the capacity management such that you can slot in CapEx requests and make OpEx expenditure in a planned, logical way that fits into the fiscal operations of the company. Of course, there’s always the chance to have exceptions, but the goal should be to keep the process as smooth and straight forward as possible.
Hi,
I’ve had an intense discussion with my management today about the responsabilities of my job (being responsible for the backup service)
Do yo consider that a restore test is a task to be perfomed/initiated by a backup administrator or to be initiated by an application owner.
My point of view was that restoring data alone is meaningless. This is not the goal of a restore and as a backup admin, i can’t do more. Only the application owner is able to restore all elements of its application and ensure that the system is restorable from the backup.
What do you think ?
I ask you the same question when specific modules have been deployed to ensure the consistency like NMM for MOSS. Who is responsible of performing the restore/consistency check ? The backup admin or the moss admin ?
I agree with your take, and would suggest that it’s irresponsible of either management or other administrators (application, database, system) who don’t get involved in recovery testing.
The job of a backup administrator should be to facilitate the backup and recovery process, of course, but that role should be one which ensures that the resources and configuration are correct so as to enable the sys/app/db administrators to perform their role.
As I suggest in my book, everyone in a company has a role to play in backup and recovery. While there is a certain amount of familiarity that should be expected of backup administrators with all the backup/recovery operations, the core aspects should be tested and (where possible) handled by the people with the primary responsibility in these areas.