How much time do your staff take to monitor backups?

The answer should be: very little.

Not because they don’t care, or you’re not tasking someone with the responsibility, but because your system should be designed such that your staff can see a “big picture” overview of all backups in a very short period of time. Assuming you do all your full backups on the weekend, your staff don’t arrive until 08.55 and spend the first 10 minutes grabbing a coffee, chatting, logging on, firing up email, browsers, etc., then if your staff can’t by 09.15 tell you what your percentage success rate for weekend backups, you’re monitoring backups wrong.

Don’t get this confused with troubleshooting. If backups encountered problems, troubleshooting may take considerably longer.

What unfortunately happens all too regularly is that monitoring and troubleshooting are seen as the same activity, or worse, they occupy the same amount of time. Nothing should be further from the truth.

 

Stop

The last 6 weeks my life has seemingly constantly been about interruptions. The house we’re renting has just been sold, and while I appreciate as a landlord myself the constraints of home ownership, I’ve also been made acutely aware of the challenges of trying to live a normal life while you’re constantly being asked to facilitate inspections, access, etc. The simple fact is that for 6 weeks, I’ve not been able to do anything much at all on weekends. Sure, the interruptions may only take an hour or two each day they occur, but since they happen in the middle of the day, there’s a whole bunch of things that you just can’t get to. Such as, a couple of weeks ago, a festival over a long weekend that was entirely unattainable.

Which brings me to the topic of this post – how much does your backup system interrupt you from your work?

If you’re a backup administrator, you probably question the logic of my question – after all, having to spend time on the backup system is just a case of doing your job.

However, this isn’t really the full story. Even if you’re a dedicated backup administrator, your job shouldn’t really be interruption based. An interruption based job, in that respect, implies a firefighting role – and a firefighting role is going to occur because of any combination of the following:

  • Architectural issues;
  • Procedural issues;
  • Hardware/software issues.

None of these should be all-encompassing enough that they become a dominating factor. Timesheets often demonstrate this in terms of how we start notating our used time. For more years than I can count I’ve worked in jobs where time has to be accounted for, and usually in 15 minute increments. But timesheets never account for spin-down and spin-up time. That is, if you’re working on something already, and a new task comes up that you have to switch across to, that switch-time is not instantaneous. (For further details, check here.)

So if your backup system is regularly acting as an interrupt system, are you working productively, or do you have an annoy-a-tron in your environment?

If you’re suffering high levels of interrupts in your backup environment, it’s time to look at changing the environment, even if that change means a temporary spike in work load or a requirement to bring some temporary staff on. With the possible exception of recoveries, no backup environment should be interrupt driven.

With the exception of recoveries, all other activities within a backup environment should be handled either as:

  • Change requests – a formal system tracking and monitoring successful implementation of non-major updates and alterations to the environment. This would cover new clients, new backup modules, etc.
  • Projects – a formal process for delivering substantial changes to the backup environment. (E.g., replacing an existing tape library with a combined backup to disk + long-term tape solution.)

Now I said “with the exception of recoveries” because, quite frankly, recoveries are the most important activity that can be done in a backup environment. As such, I want to note their processes explicitly. Recoveries should fall into one of three different categories:

  • User serviced – Recoveries that end-users or people other than backup administrators/operators can initiate, monitor and complete without intervention. This may be file recoveries from NAS units that integrate with snapshot/rollback functionality, it may be access to a NetWorker recovery GUI, or it may be the ability to initiate recovery from within an application module. These should be practically invisible to the backup administrators/operators.
  • Scheduled – Non-urgent recoveries that are requested via a formal process and submitted to the appropriate recovery facilitator to complete. These would be slotted into the facilitator’s work schedule on a priority basis.
  • Emergency – Critical recoveries (you could call these priority 1 recoveries – regardless of whether the official recovery request has been submitted or not)

In any environment, no matter how well architected, there will always be the risk of emergency situations requiring immediate action – critical faults don’t tend to be something you can just schedule into your work day, for instance.

However, in a well architected backup environment with functioning equipment, it should be the case that fire-fighting is a minimum job aspect, rather than an all-encompassing part of the backup administrator’s role.

 

When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

 

Pretty much everyone understands full and incremental backups in NetWorker. A full backup is a backup of everything, and an incremental is a backup of everything that has changed since the last backup, regardless of what level that is.

On the other hand, differential backups can sometimes throw people out for a while longer; particularly since NetWorker is more flexible and offers 9 differential levels.

I like to visualise backup levels as follows:

Full, Incr, Incr, 5

Figure 1: Full, incr, incr, level 5

This represents a full backup (which is drawn as a “line in the sand”), two subsequent incrementals, and a level 5 backup following the second incremental.

The first incremental backs up everything that has changed since the previous backup, which in this case is a full, thus the end-point for its arrow is the full backup line. The second incremental again backs up everything that has changed since the previous backup – the first incremental. As such, its arrow end-point is the first incremental backup.

Finally, the level 5 backup bypasses the incrementals and backs up everything that has changed since the full, so its arrow end-point goes all the way back to the full.

If you’re wondering about what makes the differential level 5 significant there – nothing. It doesn’t matter whether you use a 5, 3, 2, 8, etc. The differential numbers only have significance when you use more than one of them in a single schedule. For instance, the above backup could have been equally achieved with:

Full, Incr, Incr, 1

Figure 2: Full, incr, incr, level 1

This allows us to establish our first set of criteria for differential levels:

Rule 1: If only using one differential level in a schedule, the number is irrelevant.

Guideline 1: If only using one differential level, start in the middle. That way, if you have to slot in another differential level later, you’ve got room on ‘either side’.

Differentials in NetWorker become more complex when you have more than one level involved. Let’s consider a slight change to the above level diagram:

Full, Incr, Incr, 5, 1

Figure 3: Full, incr, incr, level 5, level 1

In this, we’ve got a fairly standard appearing set of levels until we hit the differential level 1; this follows a 5, and backs up everything that has changed since the full. If we turn to the nsr_schedule man page for a clarification here, we see:

“The levels 1 through 9 cause all files to be saved which have been modified since any lower level was performed. As an example, if you did a full save on Monday, followed by a level 3 save on Tuesday, a subsequent level 3 save on Wednesday would contain all files modified or added since the Monday full save.”

So this leads to our second set of criteria when dealing with differential levels:

Rule 2: Any differential level will backup all files which have been modified since any lower has been performed.

Guideline 2: Think of a full backup as ‘level 0′, and the differential rules make more sense.

Thus, if we swap around the two differential levels in the above diagram, our backup behaviour becomes markedly different:

Full, Incr, Incr, 1, 5

Figure 4: Full, incr, incr, level 1, level 5

The behaviour up to and including the level 1 backup mirrors what we saw in figures 1 and 2 – incrementals going back to the full, and the first differential performed also going back to the full. However, the level 5 backup will only backup those changes which have occurred since the level 1 backup – 1 is lower than 5, so it will “cause all files to be saved which have been modified since any lower level save was performed” to be triggered when the 5 is run.

Our last example for consideration is what happens when we throw another differential level in; let’s go for a level 3:

Full, Incr, Incr, 1, 5, 3

Figure 5: Full, incr, incr, level 1, level 5, level 3

Focusing on the “since any lower level save was performed”, if we do a level 3 backup after the level 5, it too will backup all files that have changed since the level 1 backup, incorporating not only the changes in the level 5 backup, but any changes since that point.

To the uninitiated, NetWorker’s differential backups may seem a little challenging, but once you get the hang of them via the two sets of guidelines above, you’ll find they’re very straight forward.

 

A while ago I became aware of some bugs to do with NetWorker probe based backups. While they worked without issue in 7.6 SP1, it turned out that in 7.6 SP2 and 7.6 SP3, an issue was introduced which prevented them from working as desired.

A probe backup typically runs with the following logic:

  • Execute command on client
  • Does client exit indicating a backup is required?
    • If yes: Run backup
    • If no: Don’t run backup

All fairly straight forward. Unfortunately, for all of 7.6 SP2 and 7.6 SP3 (base), the probe logic on the server would mistakenly trigger a new backup every time the probe was run. I.e., it wasn’t properly detecting whether the client returns the “no backup required” signal.

This would likely have gone unnoticed in a lot of environments using probe backups, since common uses include:

  • Backing up and cleaning up database log files;
  • Initiating a once-daily backup in response to a particular condition being met, with short-term probe windows.

However, if your probe was designed to run multiple times during the day and you only wanted one backup, you got a different result.

I’d noticed this behaviour for a while, but never got around to investigating it. So, ironically, when I finally logged a case with EMC about it, I was told that 7.6.3.1 would be out shortly and would resolve the issue. Which, it did.

So, if you’re using probe based backups and you’re currently on either 7.6 SP2 or 7.6 SP3, and the correct behaviour of those probes is important, you need to jump across to 7.6 SP3 CR1 (otherwise known as 7.6.3.1).

Note: The incorrect probe behaviour is actually dictated here by the NetWorker server, not the client.

 

Push/Pull

A common theme of question asked by people new to NetWorker is whether it supports a push or pull recovery model.

The answer, as you’d expect for an enterprise backup product, is both. However, the recoveries processes aren’t named push and pull.

If you’re not aware of push and pull recovery models, they work thusly:

  • A push recovery model is where all recovery requests are handled by the backup administrator, or at least, on the backup server, and the data retrieved is transferred out to the client.
  • A pull recovery model has the client that wishes to receive the data initiate the recovery and retrieve the data from the backup server.

NetWorker supports both, and in fact more, but it uses the term directed recoveries.

Technically, all recoveries in NetWorker are directed. They involve three clients, which are:

  • Source – the host from where the data was originally backed up;
  • Target – the host that the data is to be recovered from;
  • Control – the host that initiates the data.

Now, because the backup server has the NetWorker client software on it, it can be any one of those clients. A workgroup style “push” recovery would typically work with clients aligned as follows:

  • Source and Target – Host where the data came from originally
  • Control – The backup server

On the other hand, a workgroup style “pull” recovery would typically work with clients aligned:

  • Source, Target and Control – The host where the data came from originally.

NetWorker’s directed recovery model is more powerful and flexible than the above two examples, though. For example, you can run a recovery where all three hosts are different machines – e.g.:

  • Source – Production database server
  • Target – Development database server
  • Control – Backup server

In this situation the directed recovery would be used to act as a means of getting data from production into the development area.

So the answer to that original question is: yes, NetWorker supports a push recovery model. And yes, NetWorker supports a pull recovery model. But it also supports more.

 

Upgrading NetWorker

So a new version of NetWorker has come out, or is coming out, and it’s been decided that you’re going to upgrade, but you want a few tips for making that upgrade as painless as possible. Here’s my 5 rules for upgrading NetWorker:

  1. Read the release notes. If you’re not going to read the release notes, you are better off staying on your current version, no matter what issues you’re having. I can’t stress enough the importance of reading the release notes and having a thorough grasp of:
    • What has changed?
    • What are the known issues with the current release?
    • What were the resolved issues between the current release and the release you’re currently running?
  2. Do a bootstrap and index backup if upgrading between major or minor releases. If going between service packs on the same release, you can skip the index backup so long as your backups have been successful lately, but ensure you still do a bootstrap backup.
  3. Unload all tapes (physical or virtual) in jukeboxes before the upgrade. You’ll see why shortly.
  4. Upgrade in this order:
    • Storage node(s) on the day of the upgrade, before the NetWorker server
    • Server on the day of the upgrade, after the storage node(s)
    • Client(s) later, at suitable times
  5. After the upgrade but before the NetWorker services are restarted on the storage node(s) and server, delete the nsr/tmp directory on those hosts.

Obviously standard caveats, such as following any additional instructions in the release notes or upgrade notes should of course be followed, but sticking to the above rules as well can save a lot of hassle over time. I’ve noticed over the years that a odd, random problems following upgrades can be solved by clearing the nsr/tmp directory on the server and storage nodes. If there’s no tapes in the jukeboxes when the services first start after the upgrade, there’s less futzing for NetWorker to take care of before it’s fully up and running, too.

 

It’s that time of the year where I sit back for a moment and look at what articles have attracted the most readers over the year, and it’s a fairly eclectic bunch. Interestingly, for the first time since forever, the article about fixing NSR Peer Information issues didn’t come first – we have some new winners.

10 – New Micromanual – LinuxVTL and NetWorker

The second micromanual was a step-by-step guide for configuring the open source LinuxVTL system with NetWorker. I had hoped when I started writing micromanuals that I’d get them more frequently delivered, but various factors get in the way of this. Maybe in 2012 I’ll be able to get a couple more out and available.

9 – Killing scheduled cloning operations

When NetWorker’s scheduled clone option was introduced, there were a few bugs relating to stopping a scheduled clone operation from the GUI. Sometimes you could, and sometimes you couldn’t. However, you could always kill a scheduled clone job from the command line, which is what this post explained.

8 – NetWorker Firewall Configuration on Windows

Very early in the year I was doing a lot of work with NetWorker on Windows 2008 R2, and I was noticing a few gaps in the installation process when it came to the process of automated configuration of the Windows Firewall to work with NetWorker daemons. This post explained the lessons I learnt.

7 – Carry a jukebox with you (if you’re using Linux)

This article was my first post about configuring the open source LinuxVTL system with NetWorker. Since then LinuxVTL has evolved quite a lot, and I’ll likely even need to update that micromanual early in the new year as a consequence.

6 – Why I’d choose NetWorker over NetBackup Every Time

Despite the fact that the article was titled “Why I’d choose…”, I had a rather indignant response to this post insisting I was being a jerk by writing it. I stand by every word in that post. I would not, personally, elect to choose NetBackup over NetWorker on the basis that NetBackup only has true image recovery as an option, and that NetBackup doesn’t support dependency chains for backup images. I see both of these factors as critical to a true enterprise backup product, and NetBackup only half supports one of them. That doesn’t make me a jerk, it makes me someone who gives a damn about your data.

5 – Using NetWorker Client with Opensolaris

A guest article written by Ronny Egner, this post covered off getting the NetWorker client working with the OpenSolaris version of Solaris.

4 – Basics – Fixing “NSR peer information” errors

A persistent challenge in NetWorker is when the NSR peer information gets out of whack; usually this can happen when a significant change happens on a client, and the server must have this information reset. I’d still love to see this article become irrelevant by seeing an option appear in NMC to handle it, but until then, this will remain a fairly popular article.

3 – This is wrong

Earlier this year, an Australian hosting service lost thousands of hosted domains and websites due to a “hack attack”. Supposedly the clever hackers destroyed not only the production data, but also all the backups.

What really went wrong was that the company in question had designed a very poor and inadequate backup solution. Rumours were abounding at the time that backups were just simply replicated snapshots. Snapshots may be able to act as backups, but not indefinitely, and not if they’re the only thing configured. (Backups and snapshots are effectively ‘sister’ activities in ILP.)

2 – micromanual: NetWorker Power User Guide to nsradmin

The original micromanual – “NetWorker power user guide to nsradmin” was and remains extremely popular. There’s been thousands of downloads of it since its release, including quite a number from EMC themselves, so it’s clearly a handy resource. If you’ve not downloaded it yourself but you want to boost your NetWorker productivity, it’s a must read.

1 – NetWorker 7.6 SP1

When NetWorker 7.6 SP1 came out, it was a huge release. In my opinion, it should have been numbered NetWorker 7.7 at least; it wasn’t a minor set of changes or a round of bug fixes, it included significant functionality updates (including one of my favourites – support for Boost). As the number one read article of the year, it’s been a big resource for people looking at the functionality of newer releases of NetWorker.

And that, they say, is that

This year has personally been a huge year for me. My partner and I moved state/city in June, going from a regional area just outside of Sydney to the inner west of Melbourne. We also celebrated our 15th anniversary together, surrounded by many of our new friends (who are like family to us) and a few of our old friends. We were even invited to get on the radio to talk about that, not only from the longevity of the relationship and having run the anniversary party up against the monthly Melbourne Den night. (There’s a podcast coming…) It was also the year when I sorted a lot of stuff out, and to boil all this down: it was the year that I spent a lot of time focusing on my personal life and not so much on the blog.

There may still be one or two posts left for 2011, but I’m also starting to get my head around changes and new material for 2012, and I believe 2012 will be a big year for NetWorker users.

 

For some time I’ve been debating whether to generate podcasts for the NetWorker blog.

Rather than continue to vacillate, I’ve decided to do a sample podcast, make it available here for downloading, and decide what to do based on feedback received.

While raw technical posts don’t translate well to podcasts (how do you quote screen output, for instance?), there’s a lot of backup theory related posts I make which can readily converted.

So, please follow the link below to the first podcast, in which I go over a topic near and dear to my heart: What is a zero error policy?

If you’re interested in me producing more podcasts, please let me know. Without feedback, I’ll likely leave it at just this trial. If people are interested though, I’ll setup a proper podcast stream within iTunes and get to work.

Podcast 001: What is a zero error policy?

Cheers!

 

Over the last 15 years, I’ve administered, configured and supported a huge number of NetWorker installs across a very broad range of business types including mining, finance, insurance, media, telecommunications, agriculture, education, government and health, just to name a small few.

As you may imagine, in my time I’ve picked up a few ways of working with NetWorker, and I thought I’d share some of these as my “golden” configuration rules. These are the things that I stick to, regardless of individual design considerations. I.e., don’t consider them to be design rules; they’re different again.

1. Don’t use the default resources

I make it a policy not to use the default resources that NetWorker provides. This isn’t to say that they’re not appropriate at times, but simply that you should own your own configuration. You should establish a naming standard for each of the core configuration items (groups, pools, policies, schedules, etc.) and use that to keep a consistent, uniform configuration, rather than mixing in bits of your own configuration and the default configuration.

Further, there are some default resources that you can’t modify – pools are a classic example. And in those cases, you want to be able to enable certain settings, such as auto media verify. Since you can’t modify bootstrap pools, you may as well start from scratch there.

Exception: Notifications. While there’s a couple of notifications whose alerts you can’t change, for the most part, start by modifying the existing ones for things like savegroup completion, cleaning alerts, etc.

2. Don’t name groups after their start time

I see this time and time again – groups get named after their start time. If you have an entirely static and unchanging configuration, you may sometimes get away with this. However, for the most part, you’re going to need to be flexible on shuffling around group start times from time to time. E.g., you may need to pull a group forward five minutes, or push its start time back ten minutes, etc.

If the group is named after the start time, you’ve got two options:

  1. Give it a different start time to the name, making the configuration violate the law of least astonishment; or
  2. Create a new group, move the clients across to it, delete the old group, adjust pool configurations, etc.

Either way it’s messy and unpleasant. The best approach is to just not insert the group start time into the name of the group – after all, if you go into NMC and look at a listing of all groups, you’ll see the start time immediately anyway!

If for some reason you really need to include some form of time in the group name, keep it as fuzzy as possible; e.g., “pre midnight” or “post midnight” might be one way of doing it, or even just “Early” and “Late”.

3. Use as few pools as possible

The more pools you have, the more media you’ll need to use (for the most part). It also introduces drive contention and makes performance tuning and tweaking of an environment more challenging. Therefore, keep pools to a minimum, focusing on using them for any/all of the following:

  • Segregating backups based on retention periods/frequency of backup (e.g., a “Daily” pool and a “Monthly” pool);
  • Segregating backups based on locality (e.g., “Daily Offsite” and “Daily Onsite”).

If you keep the number of pools you use in your environment to a minimum (while still having the number you need), you’ll have a much easier to maintain environment.

4. Avoid adjusting pools while the server is active

In the dim dark days of NetWorker history, you couldn’t edit pools while the server was backing up. Over time that restriction has been lifted. However, there are still all sorts of situations that can trigger NetWorker to log the dreaded message about pools being edited while the server was busy. And if this gets logged, your pool changes won’t take effect until you can stop and restart NetWorker.

The solution? Avoid it. Plan the changes that you need to make to pools, and slot them into change windows where backups will be minimised or not happen. Equally, design your solution around the knowledge that pool modifications while the server is active can be a bit painful – e.g., having clients and/or savesets explicitly specified in a pool selection criteria should be an exception, not a rule.

5. Always enable Monitor RAP

NetWorker has a facility to track changes to the configuration which is called “Monitor RAP” – it’s a server resource setting, and it’s disabled by default. Once you enable it though, a RAP log is generated in the server’s log directory which maintains details of everything that gets changed (either by an administrator, or NetWorker itself) in the configuration. This not only helps in any audit situation, it also lets you back-trace configuration changes and stay appraised of changes to the environment when you have more than one person with administrative privileges in the datazone.

6. Don’t use wildcards for the admin usergroup

No, don’t.

It’s that simple.

7. Use schedule overrides to establish better monthly schedules

When creating schedules where you say, need to have monthly backups that skip all days of the month except the last Friday of the month, switch out of calendar view and use fuzzy time definitions for overrides – e.g., and override of “full last friday every month”. It’ll save you a lot of hassle!

If you want to know how to do this, check out the examples in the Power Users’s Guide to nsradmin micromanual.

8. Give jukeboxes sensible names

When you run a configuration wizard, NetWorker will name the jukebox by the SCSI port that it finds the jukebox on. This is all well and good, but that port isn’t necessarily static – it can be moved around due to various operating system changes, etc. What’s more, it’s usually not all that human-friendly in terms of remembering, etc.

However, you can temporarily disable the jukebox and rename it.

I tend to rename the jukebox to the model type – e.g., “i500″ or something simple along those lines. This is also the case when there’s only one jukebox attached to each storage node – the “rd=hostname” component of the jukebox name will give you some separation between what’s configured on the storage node and what’s configured on the server. If you have multiple jukeboxes in the same location – particularly if they’re the same model, you might append numbers to the end of the names (e.g., DD_VTL1, DD_VTL2, etc.)

If you’ve got multiple similar jukeboxes in disparate locations say, fibre-channel connected to a single host, you might include an abbreviation of the location in the jukebox name – e.g., “DD_VTL_DC1″ and “DD_VTL_DC2″ for “Data Centre 1″ and “Data Centre 2″ – you get the drift…

9. Use the comment field

I’ll sound like an old person here, but let me put it to you this way:

Us old time NetWorker users campaigned long and hard to get a comment field – use it, or you’re being ungrateful!

Seriously though, the comment field is a great way of recording easy-to-use annotations to help make your configuration even easier to understand. Don’t go crazy with it and try to encapsulate the entire configuration for each resource in its comment field, but use it like a good programmer would use code comments.

10. Don’t mix special and filesystem savesets in the same group

Sometimes I’ll see sites where you have the same client in a group twice; the first time the client has savesets of say:

  • /
  • /opt
  • /usr
  • /usr/local
  • /home

And the second instance of the client is for a module – e.g.,

  • RMAN:OracleDB_Details

You see, NetWorker doesn’t let you have two instances of a client in the same group when one of the clients has an “All” saveset; so, in the example above, by rights the first client instance should have an “All” saveset rather than an explicit listing of savesets. But when both clients are shoe-horned into the backups, you move to an inclusive rather than exclusive backup policy, and that’s just dangerous for data protection, and introduces a much higher risk of human error.

Not only that, EMC recommends against doing it anyway – for the reason above, and for the purposes of stability and performance.

11. And a bonus: Don’t have groups that start at the same time

Keep at least five minutes between start times for groups. This is 100% “best practice” and having multiple groups that start at the same time should be absolutely avoided. If you’re in a situation where you don’t have room to configure any more groups with a 5-minute gap between groups, then, well, you’ve got too many groups and you should look at consolidating them.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha