There’s been of discussions on various storage blogs both previously, and again now on whether a copy (e.g., a tarball, or a snapshot, etc.) is a backup. There have been arguments on both sides of the fence, and I’m going to equally contribute to those arguments now.

You see, a copy is a backup, and it’s not a backup.

It’s almost like Schrödinger’s Cat – it may be a backup, or it may not be a backup, and you won’t know for sure until you look more closely at it.

In my book, I set out early in the process to define a backup, and define it as follows:

A backup is a copy of any data that can be used to restore the data as/when required to its original form. That is, a backup is a valid copy of data, files, applications or operating systems that can be used for the purposes of recovery.

So it would seem then that I come down fairly heavily in favour of the notion that a copy is a backup. Well, yes – and no.

In the broadest sense of the term, a random copy of data such as a tarball, an rsync, a zip file, a read-only snapshot is indeed a “backup”, as it can be used, in a single instance, for the purposes of recovery. However, so too could be a binary print-out/dump of the exact state of every bit on a LUN. Few would argue though that such an arduous and manual re-entry process would really be recoverable, even though in theory it is.

The reason that it’s not really recoverable is we’re all aware of the time frames required for recovery – recoveries must be completed in a timeframe that is useful to the business (or the end user) who needs the data back. Without that, we don’t really have a backup at all – just a random copy of the data.

If we look past the broad term “backup” though, and actually evaluate the term backup system, then I would suggest that a single “backup”, unless it’s an instantiation of protection from the backup system, is not a backup at all, but instead is just a random (or pseudo-random) copy.

To me this boils down to the need to work with the notion of Information Lifecycle Protection. As you may recall, in a previous blog entry I suggested that there’s a need to break off data protection activities from ILM and define a new process that revolves around keeping data available in order to be managed by ILM. It may seem a small distinction, but it’s one which helps in these sorts of discussions. At the time I suggested that conceptually, ILP may be represented as follows:

Components of ILP

Components of ILP

Under this definition, we can cease to worry about whether a copy is a backup, because clearly, a copy will be part of an overall ILP strategy. It’s still data protection, but it doesn’t have to be backup in order to be data protection.

My personal opinion is that a single, isolated copy is technically a backup, but is logically not a backup. “Technically is” because it can be used to restore data. “Logically not” because it’s not in itself a guarantee of a correctly designed backup system. I.e., unless we can say that the copy came from the backup system, we can’t be guaranteed it’s a backup.

One last quote from my book – this time from the back page:

A well-designed backup system comes about only when several key factors coalesce: business involvement, IT acceptance, best practice designs, enterprise software and reliable hardware.

So the answer I guess to “is a copy a backup” is another question – “did the copy from a backup system?” If the answer to that question is yes, then the answer to the original question is the same. If the answer is no, we can’t reliably answer “yes” to the original question.

 

While it turned out to be unrelated, a recent customer question made me think back to the impact of client side compression on the reported saveset size, and for the life of me I couldn’t remember how client side compression affected saveset size reporting.

Of course, it’s relatively simple to test. So I created a 1GB file on my backup server using:

# dd if=/dev/zero bs=1024k count=1024 of=/root/test.dat

Next, to test, I configured a client entry with a saveset of just ‘/root/test.dat’, and set the backup running without any client side compression. The savegroup completion email showed the sort of size you’d expect:

--- Successful Save Sets ---

* tara.pmdg.lab:Probe savefs tara.pmdg.lab: succeeded.
 tara.pmdg.lab: /root/test.dat     level=full,   1048 MB 00:00:13      3 files
 tara.pmdg.lab: index:tara.pmdg.lab level=full,     3 KB 00:00:00      4 files
 tara.pmdg.lab: bootstrap          level=full,     91 KB 00:00:01    177 files

The next step was to enable client side compression. Being lazy and not wanting to launch NMC, I created /root/.nsr with the following content:

<< . >>
compressasm: test.dat

With the backup re-run, I got the conclusive evidence that the saveset size reported is the data written to media (or transferred from the client) not the size of the data itself:

--- Successful Save Sets ---

* tara.pmdg.lab:Probe savefs tara.pmdg.lab: succeeded.
* tara.pmdg.lab:/root/test.dat 66135:save: NSR directive file (/root/.nsr) parsed
* tara.pmdg.lab:/root/test.dat 66135:save: NSR directive file (/root/.nsr) parsed
 tara.pmdg.lab: /root/test.dat     level=full,    124 MB 00:00:07      3 files
 tara.pmdg.lab: index:tara.pmdg.lab level=full,     5 KB 00:00:00      5 files
 tara.pmdg.lab: bootstrap          level=full,    102 KB 00:00:01    186 files

So the next question is – is this a good thing?

The answer is a little fluid. The correct answer I think is that both sizes should be recorded. Clearly for the purposes of backwards compatibility, current sizing values need to continue to report the data written to media. However, logically, there is significant merit in adding another field to the database – e.g., clsize that would report the amount of data the client reads for the backup. This would save a lot of hassle. (The “totalsize” field is not used for this, by the way.)

In the meantime, we just have to keep in mind that the size reported by mminfo, the savegroup completion, etc., is the size written to media – or if you will the size transferred from the client to the storage node.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha