Sub-saveset checkpointing would be good

Generally speaking I don’t have a lot of time for NetBackup, primarily due to the lack of dependency checking. That’s right, a backup product that doesn’t ensure that fulls are kept for as long as necessary to guarantee recoverability of dependent incrementals isn’t something I enjoy using.

That being said, there are some nifty ideas within NetBackup that I’d like to see eventually make their way into NetWorker.

One of those nifty ideas is the notion of image checkpointing. To use the NetWorker vernacular, this would be sub-saveset checkpointing. The notion of checkpointing is to allow a saveset to be restarted from a point as close to the failure as possible rather than from the start. E.g., your backup may be 20GB into a 30GB filesystem and a failure occurs. With image checkpointing turned on in NetBackup, the backup won’t need to re-run the entire 20GB previously done, but will pick up from the last point in the backup that a checkpoint was taken.

I’m not saying this would be easy to implement in NetWorker. Indeed, if I were to be throwing a bunch of ideas into a group of “Trivial”, “Easy”, “Hmmm”, “Hard” and “Insanely Difficult” baskets, I’d hazard a guess that the modifications required for sub-saveset checkpointing would fall at least into the “Hard” basket.

To paraphrase a great politician though, sometimes you need to choose to do things not because they’re easy, but because they’re hard.

So, first – why is sub-saveset checkpointing important? Well, as data sizes increase, and filesystems continue to grow, having to restart the entire saveset because of a failure “somewhere” within the stream is increasingly inefficient. For the most part, we work through these issues, but as filesystems continue to grow in size and complexity, this makes it harder to hit backup windows when failures occur.

Secondly – how might sub-saveset checkpointing be done? Well, NetWorker already is capable of doing this – sort of. It’s in chunking or fragments. Long term NetWorker users will be well aware of this: savesets that had a maximum size of 2GB, and so if you were backing up a 7 GB filesystem called “/usr”, you’d get:

/usr
<1>/usr
<2>/usr
<3>/usr

In the above, “/usr” was considered the “parent” of “<1>/usr”, “<1>/usr” was the parent of “<2>/usr”, and so on. (Parent? man mminfo – read about pssid.)

Now, I’m not suggesting a whole-hearted return to this model – it’s a pain in the proverbial to parse and calculate saveset sizes, etc., and I’m sure there’s other inconveniences to it. However, it does an entry to the model we’re looking for – if needing to restart from a checkpoing, a backup could continue via a chunked/fragmented saveset.

The difficulty lays in differentiating between the “broken” part of the parent saveset chunk and the “correct” part of the child saveset chunk, which would likely require extension to at least the media database. However, I think it’s achievable given that the media database contains details about segments within savesets (i.e., file/record markers, etc.), then in theory it should be possible to include a “bad” flag so that a chunk of data at the end of a saveset chunk can be declared as bad, indicating to NetWorker that it needs to move onto the next child chunk.

It’s fair to say that most people would be happy with needing to go through a media database upgrade (i.e., a change to the structure as part of starting a new version of NetWorker) in order to get sub-saveset checkpointing.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.