A very traditional approach to configuring automated backups in NetWorker is to make use of the schedule override feature in NetWorker groups. That is, by defining either a schedule or a level at the group level, the backup level from all clients in the group will be in lock-step. Pictorially, this configuration resembles the following:

Client levels/schedules in lock-stepWe frequently encourage this sort of setup because it takes two items which NetWorker can run disparately – start time, and level, and effectively merges the two – something a lot of other backup products just do as the one configuration item. Perhaps even more importantly, in small to mid size businesses with modest data levels, this makes more sense anyway – it allows you to readily construct “classic” backup scenarios, such as “full on Friday, incrementals the rest of the week”. So from the perspective of level and amount of data backed up, your backup week would look similar to the following:

Schedule for full backups once a week, incrementals rest of time, lock-stepNow, as I said, this works well for businesses with modest data sizes. However, as the image graphically demonstrates, this creates scenarios where there is a significant disparity between the amount of data backed up on regular days and the amount of data backed up for the fulls. Remembering that it’s the full backups that frequently end up straining backup architectures, companies will often end up revisiting their architecture when the amount of data backed up on the “full day” becomes unmanageable.

For some companies, the full day is chosen for sound business reasons – finance companies for instance may have to do weekly full backups starting close of business Friday, and full monthly backups on the last Friday every month. In these scenarios, where there are important business reasons for keeping full backups on a single day of the week/month, the backup architecture must remain constantly configured to handle the massive spike that full backups create.

However, in other companies where there are no strong business reasons for running all the fulls on the same day, it’s worth remembering that there is an alternate configuration – ironically enough, it’s very much the “default” NetWorker configuration, it’s just one most sites tend not to use. This configuration sees the group control only the start time/collection of clients, and does not have a schedule/level override assigned. Instead, the schedule of each client defines what level backup will be done for that client. This sort of configuration resembles the following:

Groups with schedules defined at the client levelAs you can imagine, this does require a slight change of administrative policies in relation to setting the correct schedule at the client level, and potentially needing additional client instances to handle the daily and monthly backups, but the advantage of this is that you can then start having groups where both incremental and non-incremental backups are done concurrently, spreading out the load of the full backups to create a significantly lower spike in resource requirements. So from the perspective of level and amount of data backed up, your backup week would instead look like the following:

Spreading full backups out over a weekThis style of schedule isn’t for everyone – as I said, if you have a strong business need to restrict all full backups to a particular day, it’s very unlikely to work. I’d suggest as well that it may not be a good strategy if you happen to have a high staff turnover, as it does realistically add a little more complexity into the environment. (While your environment should be as simple as possible, that doesn’t always mean “as simple as conceivable”.)

In larger environments though with significantly higher amounts of data requiring backup, this style of configuration can be a real boon. Compare weekly fulls of say, 10TB (effectively tiny) with weekly fulls of say, 500TB, and you can instantly see the attraction of this programme. Instead of having to design a system capable of handling 500TB in 24 hours, you might instead be able to limit your design to a system that at most has to handle 100TB over a 24 hour period (factoring in incrementals + fulls on any given night). That’s not an insignificant difference.

[Edit, 2010-05-11]

What’s this got to do with large groups? It occurred to me overnight that while the title of the post was originally “Large group backups”, I diverged somewhat between the original intent of the post and the actual resulting post.

So, the other area where this can be useful is in situations where you have groups with large numbers of clients. For example, in environments with 500+ clients, where a single group may have hundreds of clients in it, switching to mixed levels in the one group has the same effect as for an entire large environment, but at a single, localised group.

 

Having recently encountered a situation where a NetWorker client on a customer site repeatedly failed its full backup, I wanted to take a few moments to stress the absolute, importance – no, extreme criticality – of always being on top of your full backups.

Specifically:

  • You should always know whether your full backups have succeeded or not for each and every client of your backup system.
  • Unless there are specific management directives to the contrary, you should always re-run full backups in the event of failure as soon as possible.

To put it another way – a set of backups without a full, when it comes to performing a complete filesystem or system recovery, is about as useful as a chocolate teapot. Perhaps even less so.

I’ve described previously the importance of having a zero error policy, and always knowing if failures occur. So this topic could be summarised as being a subset of the zero error policy. However, if I were to be asked what backup I could “afford to lose” in terms of complete system recoverability, I’d pick an incremental any day over a full. (It’s actually a fine line, but it’s still an important differentiation.)

Without a full backup, at best you can pull back bits and pieces of a filesystem. Sure, they might be the most recently modified bits, which in themselves are important, but they’re not the entire filesystem. For most organisations, they barely touch the surface of the filesystem. Incrementals (and for that matter, differentials) are like the proverbial tip of the iceberg – perhaps without the penguins though*. The real monstrosity in a backup environment – the rest of the iceberg – are the fulls.

Let’s consider it this way – in most environments (discounting say, backups of database dump regions) you’ll find that an incremental backup covers somewhere between 5% to 10% of the filesystem. Not only that, the delta change on a day to day basis will also be quite small. That is, in many situations the files that are backed up each day in incremental backup regimes are the same files, modified day after day for working purposes. So while you may have incrementals of even up to 10% per day of your fulls, in turn 90% or more of those files may be the same files each day that are getting backed up in incrementals.

If we look at a 200GB filesystem though, even 10% of that filesystem is just 20GB. So if your full is somehow lost, that’s 180GB that you can’t readily recover. Additionally, the 20% or so that you can recover is going to be a pigs breakfast as far as getting it back in any consistent state.

NetWorker, through its use of saveset dependency chains, will do its utmost to protect you from regular saveset failures. If a full filesystem backup fails, subsequent incrementals will be chained onto the previous dependency set, retaining the previous full backup for a longer period of time.

It’s important we don’t let those dependency chains just keep building and building. They need to be broken and restarted so that we don’t get into messy situations or use up too much media. That’s why you should have a policy to rerun a full backup as soon as possible if it fails, rather than just waiting for the next one. (Further, I’ve far too often seen that sites with a “just wait until the next full backup runs” policy continually miss full backup failures, often for months at a time, because that sort of attitude also seems to be accompanied with informal records keeping.)

The next thing to consider is that we mustn’t just arbitrarily break dependency chains ourselves. By this, I’m referring to manually recycling media without regards to what may depend on that media, just because we need to free up volumes or have policies that media should be recycled after a certain length of time.

More than anything else, I see this as the reason companies find themselves in situations where NetWorker returns an “Unknown” volume being required for recovery. In this situation, NetWorker knows there should be a full backup, but it doesn’t have access to it, and therefore it can’t do anything to get the complete filesystem (or other type of data) recovered. Or, if there’s going to be a significant recovery error

Your full backups are like gold. No, gold isn’t special enough. Platinum, maybe. Or some combination of gold, platinum and saffron. They’re not to be cavalierly deleted, they’re not to be ignored, and they’re not to be left unchecked. (They’re not to be uncloned, either.)

In actual fact, it really doesn’t matter what your backup product is. What always matters is that your full backups are done, they’re done as soon as possible around the scheduled time, they’re successful, they’re known to be successful, and they’re successfully cloned. If any of those factors aren’t in play, you’ve got to get it fixed straight away.


* Unless they’re incrementals from a Linux system, of course.

 

Bigger, faster, better. That’s one of the common catch-phrases of backup architecture. Backup window too small? Buy faster, and more tape drives! Or buy a VTL! Or buy backup to disk!

I’m the first to admit that sometimes the only way to solve backup window issues is to invest in additional infrastructure.

However, one thing that is frequently not considered is this: do you really need to run all your full backups on a weekend?

Most companies religiously run full backups on a weekend – regardless of the frequency (e.g., weekly or monthly), full backups for all machines will run somewhere between Friday evening and Sunday, so as to (a) maximise the size of the window for full backups and (b) minimise the impact on 9-5 Monday-Friday users.

There are of course some instances where it’s absolutely, without a measure of a doubt, completely necessary to run full backups on the weekend. However, if you’re struggling with getting the full backups for every server in your environment complete over the course of a weekend, you should ask yourself – can any machine get a full backup at another time?

I’ve seen a lot of sites save significant money on infrastructure by asking this question and realising that not all servers had to get a full backup on a weekend. For these sites, while some servers did have to continue to receive full backups on a weekend, many servers could have their full backups shifted to other days of the week. In the most extreme of circumstances, many sites have been able to spread their full backups out over every night of the week. The advantage of such a scenario is that the backup environment no longer has to be designed for extreme peaks in throughput requirements, with the nightly data distribution being far more even.

For instance, if servers in your environment have an even distribution of data, then rather than say, backing up 500 TB of data over a single weekend in full backups, why not run say, ~71TB of full backups every night? (Obviously the infrastructure requirements for such differences in full backup sizes are going to be considerably different.)

The next time you’re worried that your data is growing to the point that you can’t meet your full backup windows, stop and have a think about whether any of that load can be shifted to another day of the week. If they can, you may save your site some money.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha