Recently Australia’s largest grocery chain followed some of the other chains and started offering unit pricing on their products. For example, packaged food includes not only its actual RRP but also the price per 100gm. That way, you can look at say, two blocks of cheese and work out which one is technically the better price, even if one is larger than the other.

This has reminded me of how miserly some companies can be with backup. While it’s something I cover in my book, it’s also something that’s worth explaining in a bit of detail.

To set this up, I want to use LTO-4 media as the backup destination, and look at one of those areas of systems that are frequently skipped from backups by miserly companies looking to save a buck here and there. That, of course, is the operating system. Too often it’s common to see backup configurations that back up data areas on servers, but leave the operating system unprotected because “that can be rebuilt”. That sort of argument is often a penny-wise/pound-foolish approach that fails to take into account the purpose of backup – recovery.

Sure, useless backups are a waste of money. That is, if you backup an Oracle database using the NetWorker module, but also let filesystem backups pick up the datafiles from the running database, then you’re not only backing up the database twice, but also the non-module backup in the scenario I’m describing is useless because it can’t be recovered from.

However, are operating system backups are waste of money or time? My argument is that except in circumstances where it is architecturally illogical/unnecessary, they’re neither a waste of money nor a waste of time. Let’s look at the why…

At the time of writing, a casual search of “LTO-4 site:.au best price” in Google yields within the first 10 results LTO-4 media as low as $80 for RRP. That’s RRP, which often has little correlation with bulk purchases, but miserly companies don’t make bulk media purchases, so we’ll work off that pricing.

Now, LTO-4 media has a native capacity of 800 GB. Rather than go fuzzy with any numbers, we’ll assume native capacity only for this example. So, at $80 for 800 GB, we’re talking about $0.10 per GB – 10c per GB.

So, our $80/800GB cartridge has a “unit cost” of 10c/GB, which sounds pretty cheap. However, that’s probably not entirely accurate. Let’s say that we’ve got a busy site and in order to facilitate backups of operating systems as well as all the other data, we need another LTO-4 tape drive. Again, looking around at list prices for standalone drives (“LTO-4 drive best price site:.au”) I see prices starting around the $4,500 to $5,000 mark. We should expect to see the average drive (with warranty) last for at least 3 years, so that’s $5,000 for 1,095 days, or $4.56 per day of usage. Let’s round that to $5 per day to account for electricity usage.

So, we’re talking about 10c per GB plus $5 per day. Let’s even round that up to $6 per day to account for staff time in dealing with any additional load caused by operational management of operating system backups.

I’ll go on the basis that the average operating system install is about 1.5GB, which means we’re talking about 15c to back that up as a base rate, plus our daily charge ($6). If you had say, 100 servers, that’s 150GB for full backups, or $15.00 for the full backups plus another $6 on that day. Operating system incremental backups tend to be quite small – let’s say a delta of 20% to be really generous. Over the course of a week then we have:

  • Full: 150GB at $15 + $6 for daily usage.
  • Incremental: 6 x (30GB at $3 + $6 for daily usage).

In total, I make that out to be $63 a week, or $3,276 a year for operating system backups to be folded into your current data backups. Does that seem a lot of money? Think of this: if you’re not backing up operating system data, this usually means that you’re working on the basis that “if it breaks, we’ll rebuild the server”.

I’d suggest to you that in most instances your staff will spend at least 4 hours trying to fix the average problem before the business decision is made to rebuild a server. Even with say, fast provisioning, we’re probably looking at 1 hour for full server reinstall/reprovision, current revision patching, etc. So that equals 5 hours of labour. Assuming a fairly low pay rate for Australian system administrators, we’ll assume you’re paying your sysadmins $25 per hour. So a 5 hour attempted fix + rebuild will cost you $125 in labour. Or will it? Servers are servers because they typically provide access or services for more than one person. Let’s assume 50 staff are also unable to work effectively while this is going on, and their average salary is even as low as $20 per hour. That’s $5,000 for their labour, or being more fair, we’ll assume they’re only 50% affected, so that’s $2,500 for their wasted labour.

How many server rebuilds does it take a year for operating system backups to suddenly be not only cost-effective but also a logically sound business decision? Even when we factor in say, an hour of effort for problem diagnosis plus recovery when actually backing up the operating system regions, there’s still a significant difference in price.

Now, I’m not saying that any company that chooses not to backup operating system data is being miserly, but I will confidently assert that most companies who choose not to backup their operating system data are being miserly. To be more accurate, I’d suggest that if the sole rationale for not doing such a backup is “to save money” (rather than “from an architectural standpoint it is unnecessary”) then it is likely that a company is wasting money, not saving it.

 

If you’re a NetWorker administrator, but not a Mac OS X administrator, you may be unfamiliar with the process to stop and restart NetWorker on that platform. It’s actually easy, but for someone who hasn’t come from a Mac OS X command-line background, it’s not something you’d immediately expect.

To stop and restart NetWorker, all you need to use is the SystemStarter command, which you run from Terminal (or another suitable command line login).

A typical stop/restart sequence will look like the following:

preston@archon ~$ sudo SystemStarter stop NetWorker
Password:
Stopping NetWorker Client.
preston@archon ~$ sudo SystemStarter start NetWorker
Starting NetWorker Client.

Or, you could simply run:

preston@archon ~$ sudo SystemStarter restart NetWorker

If you’re wondering where the NetWorker startup/shutdown script is installed, you’ll find it, along with other such scripts, installed in /Library/StartupItems.

 

Something I mention in my book, but which is worth elaborating further upon, is the need to keep backups of your backup server for as long as your longest backups – if not longer. One of the primary reasons for this of course is the indices; recovering older indices is traditionally easier than the laborious alternative of scanning in potentially a multitude of media.

There is however another, equally important reason why your backup server should have at least equally the longest browse/retention time in your site – the logs.

Being able to recover your backup logs (i.e., nsr/logs/daemon*, nsr/logs/messages, etc.) is like having your own personal time machine for the backup system. This becomes important when  you hit recovery situations that you just can’t explain. That is, an error you are getting now, when you try to do a recovery of files backed up 2 years ago, may not make any sense at all. However, if you’re able to recover the backup server logs from that period in time, they may very well fill in the missing information for you. The most common thing I find this helps with is identifying whether what you’re trying to recover was ever actually backed up in the first place. I.e., the scenario runs something along the lines of:

  • User asks for file from arbitrary date – e.g., 29 May 2006.
  • Can’t browse to 29 May 2006, but can browse to 28 May 2006 and 30 May 2006.
  • Recover backup server logs from 30 May 2006 to see that the client could not be contacted for backup on that day.

Now, some would argue that not being able to recover is the real problem – this isn’t always the case. Sometimes, due to circumstances beyond your control, you literally can’t recover – such as say a situation like the above where there was a failure to backup in the first place. In situations such as this, being unable to explain why the recovery can’t be facilitated is equally as bad as not being able to recover.

 

I’m heading to Auckland for a few days for the annual IDATA kick off. While I’m gone I’ll have less Internet access than normal, so it may take a little longer to approve or respond to comments. Meanwhile I have some articles queued up for publication while I’m gone. Back to normal on Monday.

 

A recent discussion on the NetWorker Mailing List about configuring cleaning cartridges prompted me that it would be worthwhile to quickly cover off the oft-asked questions:

  • How frequently should I clean my drives?
  • Should I have NetWorker, or the tape library, clean my drives?

There are two schools of thought when it comes to the frequency of the cleaning. The first is to only have the drive(s) cleaned when they request cleaning. The second is to clean religiously, every X weeks, regardless of whether they request cleaning or not.

There are pros and cons to both techniques.

When it comes to only cleaning when necessary, one of the primary reasons for this technique is that cleaning is essentially an abrasive action – by running a cleaning cartridge through a drive, the drive heads are being rubbed clean by the cleaning tape. This obviously introduces some physical wear, however trivial, which may over time affect the longevity of the drive. Therefore, one can extract maximum life out of ones’ tape drives by only cleaning when requested.

The second technique, that being to clean every X weeks, regardless of whether the drives request cleaning or not, is premised on the notion that it reduces any build-up of dust and particulates on the drives, thus reducing the chances of the drive compromising the longevity of the tape.

So, which should you choose? Well, that probably depends on how clean your environment is. If your drives are in a well protected, isolated environment that has excellent dust filtering, you may very well find that using drive-initiated cleaning is the way to go. However, if your environment isn’t so clean, then forcing periodic cleaning may be more appropriate.

These days, given the increase in technology surrounding tape drives, and the replacement timeframes, I suspect the “abrasive action” argument holds perhaps less force than it used to. Ultimately as well, if your primary goal is to ensure healthy backups, then running a cleaning cartridge periodically through drives to reduce the chance of either a backup or recovery failing/requiring a restart due to cleaning being required may be a smart thing to do.

Next, we must move on to whether NetWorker should clean the drives, or whether the tape library should do so.

In versions of NetWorker 7.3.x and lower, I always advocated that NetWorker manage the cleaning. NetWorker in such versions had a tendency to not react well to any situation where it went to use a drive only to find it was already occupied, even if that was with a cleaning cartridge.

However, with 7.4.x and higher, I have noticed NetWorker is significantly more capable of detecting that a drive is being cleaned and not treating it as an error; instead it simply chooses to retry the operation.

Thus, these days I’d suggest that the decision as to whether NetWorker or the library controls drive cleaning is entirely a personal one, in the same way that choosing to wear black or blue socks is a personal one. My personal preference is that if the library/drives I’m using supports TapeAlert, and I’m using NetWorker 7.4.x or higher, I’ll now enable library controlled cleaning. With older libraries/drives or NetWorker 7.3.x/lower, I’ll have NetWorker manage the cleaning.

 

The utility mmpool is one of those lesser used utilities in NetWorker that you may not necessarily know about, nor would you necessarily always need to use it, but it’s handy to know about it.

Similar to the more well known mmlocate utility, mmpool is designed to list pools on a NetWorker server, and for any pool, list the volumes in that pool. It also has some more “dangerous” options in that it can delete all volumes in a pool, but thankfully it will prompt you about each volume, so you can’t just go blindly destroying media database entries.

The two options I find most useful with mmpool are:

  • -L – List all pools
  • -l pool – List all volumes in the nominated pool.

For example:

# mmpool -L
Staging
Staged Clone
PC Archive Clone
Indexed Archive
Indexed Archive Clone
Default
Full
NonFull
Offsite
Default Clone
Archive Clone
PC Archive
Archive
To review all the volumes in the pool ‘Staged Clone’, you would run:
# mmpool -l "Staged Clone"
volume     pool
Clone.001  Staged Clone
Clone.002  Staged Clone
Clone.003  Staged Clone
Clone.004  Staged Clone
Clone.005  Staged Clone

Again, mmpool is not a utility you’ll find yourself running every day, but it is useful to have available.

 

One of the policy changes made in NetWorker 7.4.4 (and which applies to 7.5.x as well) is that of client parallelism when it comes to new clients.

I have to say, and I’ll be blunt here, I find the policy change reasonably inappropriate.

In a post 7.4.4 world, NetWorker defaults to giving new clients that you create a parallelism of 12. I’d always thought that 4 was a terrible default setting, being too high, in a modern environment; you can imagine then what I thought when I found the new default setting was 12.

There’s a good reason why I find this inappropriate. In fact, it’s implicitly covered in my book by the sheer number of pages I devote to discussing how to plan client parallelism settings. In short, client parallelism settings are typically not something that you should set blindly. Unless you already have very clear ideas of filesystem/LUN layout, processing capabilities, bandwidth, etc., on a client, in my opinion you must start with a parallelism of 1 and work your way up as a result of clear and considered performance testing.

Given the amount of effort that’s been put into the latest NetWorker releases for VMware integration – i.e., the Virtual Client Connection license, etc. – it seems a less than logical choice to increase parallelism settings rather than decrease them (as a default) when you know that over time the number of virtualised hosts being backed up are going to increase.

This is obviously just a small inconvenience, but if you’ve not picked up on this yet, you should be aware of it when you start working with these newer versions of NetWorker.

What the real solution is

For what it’s worth, I actually don’t think the solution is to change the default client parallelism setting to 1, but to start maintaining a “defaults” component within the NetWorker server resource where local administrators can configure default settings for a plethora of new resources to be created (most typically clients, groups and pools).

For example, you might have options where you can specify the following defaults for any new client:

  • Parallelism
  • Priority*
  • Group
  • Schedule
  • Browse Policy
  • Retention Policy
  • Remote Access
  • etc.

These all have their own defaults, but it’s time to move past the point where NetWorker suggests standard defaults, and have all these default settings modifiable by the administrator. I realise that when the server bootstraps itself, it still needs to fall back on standard defaults, and that’s fine. However, once the server is up and running, being able to modify these defaults would be a Time Saving Feature.

This would reduce the amount of work administrators have to do when creating new resources – let’s face it, most of us spend most of the time in new resource creation changing the “default” settings. It also eliminates the amount of human errors introduced when adding to the configuration in a hurry. This sort of “defaults” component would preferably be run as a wizard in NMC on first install, and administrators would be asked if they want to re-run it upon updates.


* Adding priority to this might suggest a need to have the priority field work better than it has of late…

 

The fantastic thing about NetWorker is that being a three-tier architecture, a datazone may encompass far more than just a single site or datacentre. That is, you can design a system where the NetWorker server is in Sydney, and you have storage nodes in Melbourne, Adelaide, Perth, Brisbane, Darwin and Hobart. The server would be responsible for coordinating all backups and storing/retrieving data from the Sydney datacentre, and each storage node would be responsible for the storage/retrieval of backups local to that datazone.

(Or, to use a non-Australian example, you could have a datazone where your backup server is in London, and you have storage nodes in Paris, New York and Cape Town.)

When a NetWorker datazone encompasses only a single datacentre, there’s usually very little tweaking that needs to be done to the server <-> storage node communications, once they’re established. However, when we start talking about datazones that encompass WANs, we do have to take into account the level of latency between the storage nodes and the backup server.

Luckily, there’s settings within NetWorker to account for this. Specifically, there are three key settings, all maintained within the NetWorker server resource itself. These are:

  • nsrmmd polling interval
  • nsrmmd restart interval
  • nsrmmd control timeout

To view these settings in the NetWorker management console, you first have to turn on diagnostic mode. Then, right click the server (absolute topmost entry in the configuration tree) and choose “Properties”. These settings are maintained in the “Media” pane:

Controlling nsrmmd settings in NMC

Controlling nsrmmd settings in NMC

So, what do each of these settings do?

  • nsrmmd polling interval – This is the number of minutes that elapses between times that the NetWorker master process (nsrd) probes the nsrmmd to determine that it is still running. You could think of it as the heartbeat parameter. By default, this is 3 minutes.
  • nsrmmd restart interval – This is how long, in minutes, NetWorker will wait between restart attempts of an nsrmmd process. By default, this is 2 minutes.
  • nsrmmd control timeout – This is the number of minutes NetWorker waits for storage node requests to be completed. By default, this is 5 minutes.

Note that NetWorker is intelligent about this – the man pages for instance explicitly refers to “remote nsrmmd” in each of the first two options, meaning that we should expect local nsrmmd processes on the backup server itself to be dealt with faster, even if these settings are increased.

All these settings work well for regular-sized LAN-contained datazones. However, they may not be optimal in either of the following two scenarios:

  • Very busy datazones that have a large number of devices, even if they’re in the same LAN;
  • WAN-connected datazones.

In either of these scenarios, if you’re seeing periodic phases where NetWorker goes through restarting nsrmmd processes, particularly if this is happening during backups, then it’s a good idea to try to bump up these values to something more compatible with your environment.

My first recommendation, that works for most sites without any further tweaking, is to double each of the first two settings – i.e., increase nsrmmd polling interval to 6 minutes, increase nsrmmd restart interval to 4 minutes, and increase nsrmmd control timeout from 5 to 7 minutes. (I don’t think it’s usually necessary to double nsrmmd control timeout, because usually the delay in such timeouts are caused by devices, not the bandwidth of the connection, and therefore you don’t need to drastically increase the value.)

 

I periodically spend Sunday mornings and other quiet times delving through the articles pointed to by undrln.

One that caught my attention a while ago that I was greatly impressed with was “Stop acting like a sissy and market your company“. Having previously worked for a company that almost religiously insisted on word-of-mouth advertising only which subsequently went under*, I can appreciate this blunt, matter-of-fact advice even more so.

It’s a fascinating read.


* I’m not claiming that was the only reason it went under, or indeed the primary reason it went under, but it was certainly a contributing factor.

 

Occasionally, depending on the issue you are having, EMC support or EMC engineering may request that you provide your NetWorker binary build details. This isn’t necessarily the same as the version information, since patches will obviously have different build details.

Usually they just say something along the lines of “can you run what filename and return the output?” or something along those lines. Well, what isn’t always a useful command depending on the Unix environment you’re on, and I’m even seeing some sites where it’s not installed (e.g., Solaris platforms where the /usr/ccs area doesn’t exist).

So, it’s handy to know how to retrieve this information without the benefit of what. It’s actually easy. For Unix, all you need to do is:

# strings /path/to/file | grep '@('

For example, if I wanted to know the build details for /usr/sbin/save on my laptop, I’d run:

[Sun May 10 07:12:30]
preston@archon ~
$ strings /usr/sbin/save | grep '@('
@(#) Product:      NetWorker
@(#) Release:      7.5.1.Build.269
@(#) Build number: 269
@(#) Build date:   Fri Mar 20 23:05:02 PDT 2009
@(#) Build arch.:  darwin
@(#) Build info:   DBG=0,OPT=-O2 -fno-strict-aliasing

This is all the information that support/engineering are going to be after when they’re wanting the build number of a binary, so knowing how to use strings and grep to retrieve it gives you a solution that will work on every Unix platform.

On Windows, you can readily find the build information by right-clicking the binary, choosing Properties, and then going to the “Version” tab. You’ll get something like the following:

NetWorker build details on Windows

NetWorker build details on Windows

You can see in the above screenshot that the first three information sections are “Build Date”, “Build Info” and “Build Number” – clicking on each of those will give you the information you need to provide.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha