What’s backup got to do with it?

Perhaps one of the most common mistakes that companies can make is to focus on their backup window. You might say this is akin to putting the cart before the horse. While the backup window is important, in a well designed backup system, it’s actually only of tertiary importance.

Here’s the actual order of importance in a backup environment:

  1. Recovery performance.
  2. Cloning (duplication) performance.
  3. Backup performance.

That is, the system must be designed to:

  1. First ensure that all data can be recovered within the required timeframes,
  2. Second ensure that all data that needs to be cloned is cloned within a suitable timeframe to allow off-siting,
  3. Third ensure that all data is backed up within the required backup window.

Obviously for environments with well considered backup windows (i.e., good reasons for the backup window requirements), the backup window should be met – there’s no questioning about that. However, meeting the backup window should not be done at the expense of impacting either the cloning window or the recovery window.

Here’s a case in point: block level backups of dense filesystems often allow for much smaller backup windows – however, due to the way that individual files are reconstructed (read from media, reconstruct in cache, copy back to filesystem), they do this at the expense of required recovery times. (This also goes to the heart of what I keep telling people about backup: test, test, test.)

The focus on the recovery performance in particular is the best possible way (logically, procedurally, best practices – however you want to consider it) to drive the entire backup system architecture. It shouldn’t be a case of how many TB per hour you want to backup, but rather, how many TB per hour you need to recover. Design the system to meet recovery performance requirements and backup will naturally follow*.

If your focus has up until now been the backup window, I suggest you zoom out so you can see the bigger picture.


* I’ll add that for the most part, your recovery performance requirements shouldn’t be “x TB per hour” or anything so arbitrary. Instead, they should be decided by your system maps and your SLAs, and instead should focus on business requirements – e.g., a much more valid recovery metric is “the eCommerce system must be recovered within 2 hours” (that would then refer to all dependencies that provide service to and access for the eCommerce system).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.