Every now and then the topic arises over whether snapshots are backups.
This is going through a resurgence at the moment, as NetApp has dropped development of their VTL systems, with some indications being that they’re going to revert to recommending people use snapshots and replication for backup.
So this raises the question again – is a snapshot a backup? I’ll start by quoting from my book here:
A backup is a copy of any data that can be used to restore the data as/when required to its original form. That is, a backup is a valid copy of data, files, applications, or operating systems that can be used for the purposes of recovery.
On the face of this definition, a snapshot is indeed a backup, and I’d agree that on a per-instance basis snapshots can act as backups. However, I’d equally argue that building your entire backup and recovery system on the basis of snapshots and replication is like building a house of cards on shifting sand in the face of an oncoming storm. In short, I don’t believe that snapshots and replication alone provide:
- Sufficient long-term protection.
- Sufficient long-term management.
- Sufficient long-term performance.
I’ll be the first to argue that in a system with high SLAs, having snapshots and/or replication is going to be almost a 100% requirement. You can’t meet a 1 hour data loss deadline if you only backup once every 24 hours – and backing up every hour using conventional backup systems is rarely appropriate (or rarely even works). So I’m not dismissing snapshots at all.
It’s easy to discuss the theoretical merits of using snapshots in lieu of backup/recovery software as a total backup system, but I think that the practical considerations quickly overcome any theoretical discussion. So let’s consider a situation though where you want to keep your backups for 6 months. (These days that’s a fairly short period.) Do you really want to keep 6 months of snapshots around? Let’s assume we keep hourly snapshots for 2 weeks, then one snapshot per day for the rest of the time. That’s 504 snapshots per system – in fact, normally per NAS filesystem. Say you’ve got 4 NAS units and 30 filesystems on each one – that’s around 60,000 snapshots over a course of 6 months.
What’s 60,000+ snapshots going to do to:
- Primary production storage performance?
- Storage and backup administrator management?
- Storage costs?
- Indexing costs?
The argument that snapshots and replication alone can replace a healthy enterprise backup system (or act in lieu of it) just doesn’t wash as far as I’m concerned. It looks good on paper to some, but on closer inspection, it’s a paper tiger. By all means within environments with heavy SLAs they’re likely to form part of the data protection solution, but they shouldn’t be the only solution.
You say: A backup is a copy of any data that….
IMHO, it should be: A backup is an INDEPENDENT copy of any data that….
If the copy is not independent of the source, and the source is hosed (or replication causes the hosing to sent to the “backup”), then it’s all for naught. While a snapshot can act as a backup for a simple file deletion, if you get a lighting strike or flooding, then having a snapshot on the same media isn’t really a backup.
David – you raise a very good point, and an important distinction.
Any snapshot process is relatively limited in the forms of failure that it can protect against unless it’s also at least replicated. For what it’s worth, I don’t disagree with you – in fact your point even more clearly demonstrates the limited potential for snapshots as a replacement to backup and recovery – thanks for raising it!
“Relatively limited?” Snapshots on their own can protect against virtually every situation in which you might want to restore a file short of one in which the volume to which the snapshot is tied is destroyed. Replication of a volume with snapshots can be used to cover those situations.
Together they can cover accidental deletion, corruption, or a desire to revert to an older version of a changed file. They can cover complete loss of a volume due to accidental deletion or multiple RAID failures. Assuming you’re replicating all volumes, they can even cover the loss of all data on a single storage system.
If you have enough snapshot space for your rate of data change and a large enough number of snapshots available, you can store a high granularity of snapshots for a long retention period.
The only situation can imagine that I don’t see covered then by snapshots plus replication is one in which some underlying bug caused WAFL corruption on the original which was them propagated to the replica leaving you with two bad copies. Of course, if you have such a problem odds are good that the effects of it also were replicated to tape in the form of backups of that bad file system. The advantage of tape backups then would be that you could presumably go back far enough that you’d find a backup from prior to the point when the corruption was introduced. Such a circumstance would be horribly unlikely to ever occur however.
That said, I don’t see snapshots plus replication as the sole components of a complete backup strategy as being a sound strategy. That is because there are two problems that normally need to be solved that are mutually exclusive. One is the need to have backups off site to guard against a site-based disaster (fire, flood, etc.) and the other is nee need to perform rapid recovery. The feasibility of getting either is simple enough, but both at once is somewhat impractical.
For instance, you can place two storage controllers in the same data center and have high speed connectivity for restoration of storage. In fact, you can simply switch the backup copy, promoting it to ‘live’. Then you can rebuild and restore the data in the background on the original storage system. But then you won’t have geographical separation of your live data and your backup. You’d require yet another copy of the data for that (such as a third storage system at a remote site, or a set of tapes sent offsite).
You could replicate solely to a remote site, but then in the event of a data loss, you’d need to restore from the remote copy to the local over some form of WAN link. For any substantial quantity of data to be restored, the restore time would be prohibitive due to the bandwidth limitations. You could of course resolve such a difficulty with a completely redundant infrastructure so that your applications could also fail to the remote site. The point is that for backup and restore alone, it is insufficient.
Now, if you’re willing to implement that redundant infrastructure, used aggressive network caching such as FlexCache, WAN optimization hardware, nice fat pipes between sites, and so on, using snapshots and replication can be a fabulous solution and can finally truly be considered as the be all and end all of your backup solution…
But even then there is one more problem. What do you do if you accidentally break the replication and need to resynchronize? Well, you can perform an initial sync and wait for days during which you are exposed should a failure occur, or you can use SnapMirror to tape to establish a baseline. It seems that getting away from tape is pretty hard to do.
There is one thing that snapshots buy you that you can’t get from a conventional backup though. Snapshots are point in time for the entire filesystem. You get that with NDMP backups as well of course, but only because an NDMP backup triggers a snapshot for the purpose of the NDMP backup.
In my opinion, the best backup strategies include the use of both snapshots and more conventional forms of backup (be they to physical tape, or virtual). Replication should then be considered where it is appropriate for uptime considerations or geographic redundancy.
Hi Bob,
I’d like to clarify the part of my initial reply you took exception to. The sentence, in its entirety, was:
“Any snapshot process is relatively limited in the forms of failure that it can protect against unless it’s also at least replicated.”
I.e., I agree there’s a lot of potential failures that snapshots can protect against, but their protection guarantee plummets when there’s no replication because they’re on the same site as the system they’re protecting and sharing the same array as the system they’re protection and, indeed, depending on the snapshot technology, sharing the same storage as the storage they’re protecting. So my point in this is that they made provide protection against multiple kinds of failures, but it’s a tenuous protection that’s susceptible to shared hardware failures.
I will fully agree, as I stated in my original posting, and in the follow up posting that snapshots, particularly in larger organisations or organisations with higher SLAs, will form an important part of achieving data protection, but should not by themselves (or as the snapshots + replication combination) be the only data protection method used.
Thanks for your detailed response!
Please note the poster’s comment on adequate space to hold snapshots .. snapshots are great to just expand the NAS footprint in perpetuity ..
“If you have enough snapshot space for your rate of data change and a large enough number of snapshots available, you can store a high granularity of snapshots for a long retention period.”
search-ability, performance impact, management of recover ..
Words to live by
YOU BACKUP IN LEISURE YOU RECOVER IN HASTE .. Would you want to be on the receiving end of 54000 snapshots to sort through to find the file that the CFO is looking for and needs .. like .. ahhhh .. 4 days ago ..