So now the folks at the interoperability labs at Microsoft want to open up the Outlook PST format so that other people can interoperate with it.

Hmmm, forgive me, but that sounds a bit too much like Harrison Bergeron. Equality for all doesn’t mean the experience sucks any less.

The PST format (and the more recent updates for Outlook/Entourage) have been the bane of the average system or backup administrators’ existence for way, way too long.

Oh I recognise the arguments in opening it up: it will allow developers to come up with custom access programs/software, and that may include utilities to allow easier block level backup of local PST files. (See here.)

You know what I think it’s more likely to do though? Encourage more people to use a format that shouldn’t have been created in the first place, to hold data that shouldn’t be there, and thus create more backup and storage problems.

If the Microsoft interoperability labs were serious about interoperability, they’d have published a timeline on moving either towards a more open mail format (not by creating published data structures, but by using honest to goodness simple formats plain text, attachments, etc.), or at least, if nothing else, a stripped down SQL database and provide a local utility/API that can be used to access in plain relational database format. They do, after all, design their own SQL database server, and they provide ODBC client access at the OS level (something I’m constantly reminded of by Excel users!) What’s more, Apple managed to integrate this style of system into their OS, so it can hardly be said that it’s not possible.

 

If you think you can’t go a day without hearing something about dedupe, you’re probably right. Whether it’s every vendor arguing the case that their dedupe offerings are the best, or tech journalism reporting on it, or pundits explaining why you need it and why your infrastructure will just die without it, it seems that it’s equally the topic of the year along with The Cloud.

There is (from some at least) an argument that backup systems should be “out there” in terms of innovation; I question that in as much as I believe that the term bleeding edge is there for a reason – it’s much sharper, it’s prone to accidents, and if you have an accident at the bleeding edge level, well, you’ll bleed.

So, I always argue that there’s nothing wrong with leading edge in backup systems (so long as it is warranted), but bleeding edge is far more riskier a proposition – not just in terms of potentially wasted investment, but due to the side effect of that wasted investment. If a product is outright bleeding edge then having it involved in data protection is a particularly dangerous proposition. (Only when technology is a mix of bleeding edge and leading edge can you at least start to make the argument that it should be at least considered in the data protection sphere.)

Personally I like the definitions of Bleeding Edge and Leading Edge in the article at Wikipedia on Technology Lifecycle. To quote:

Bleeding edge – any technology that shows high potential but hasn’t demonstrated its value or settled down into any kind of consensus. Early adopters may win big, or may be stuck with a white elephant.

Leading edge – a technology that has proven itself in the marketplace but is still new enough that it may be difficult to find knowledgeable personnel to implement or support it.

So the question is – is deduplication leading edge, or is it still bleeding edge?

To understand the answer, we first have to consider that there’s actually 5 classified stages to the technology lifecycle. These are:

  1. Bleeding edge.
  2. Leading edge.
  3. State of the art.
  4. Dated.
  5. Obsolete.

What we have to consider is – what happens when a technology exhibits attributes of more than one classification or stage of technology? To me, working in the conservative field of data protection, I think there’s only one answer: it should be classified by the “least mature” or “most dangerous” stage that it exhibits attributes for.

Thus, deduplication is still bleeding edge.

Why dedupe is still bleeding edge

Clearly there are attributes of deduplication which are leading edge. It has, in field deployments, proven itself to be valuable in particular instances.

However, there are attributes of deduplication which are definitely still bleeding edge. In particular, the distinction for bleeding edge (to again quote from the Wikipedia article on Technology Lifecycle) is that it:

…shows high potential but hasn’t demonstrated its value or settled down into any kind of consensus.

(My emphasis added.)

Clearly in at least some areas, deduplication has demonstrated its value – my rationale for it still being bleeding edge though is the second (and equally important) attribute: I’m not convinced that deduplication has sufficiently settled down into any kind of consensus.

Within deduplication, you can:

  • Dedupe primary data (less frequent, but talk is growing about this)
  • Dedupe virtualised systems
  • Dedupe archive/HSM systems (whether literally, or via single instance storage, or a combination thereof)
  • Dedupe NAS
  • For backup:
    • Do source based dedupe:
      • At the file level
      • At a fixed block level
      • At a variable block level
    • Do target based dedupe:
      • Post-backup, maintaining two pools of storage, one deduplicated, one normal. Most frequently accessed data is typically “hydrated”, whereas the deduped storage is longer term/less frequently accessed data.
      • Inline (at ingest), maintaining only one deduplicated pool of storage
    • For long term storage of deduplicated backups:
      • Replicate, maintaining two deduplicated systems
      • Transfer out to tape, usually via rehydration (the slightly better term for “undeduplicating”)
      • Transfer deduped data out to tape “as is”

Does this look like any real consensus to you?

One comfort in particular that we can take from all these disparate dedupe options is that clearly there’s a lot of innovation going on. The fundamental basics behind dedupe as well are tried and trusted – we use them every time we compress a file or bunch of files. It’s just scanning for common blocks and reducing the data to the smallest possible amount.

It’s also an intelligent and logical method of moving forward in storage – i.e., we’ve reached a point in storage where both companies that purchase storage, and the vendors that provide it, are moving towards using storage more efficiently rather than just continuing to buy it. This trend started with the development of SAN and NAS, so dedupe is just the logical continuation of those storage centralisation/virtualisation paths. More so, the trend towards more intelligent use of technology is not new – consider even recent changes in products from the CPU manufacturers. Targeting Intel as a prime example, for years their primary development strategy was “fast, faster, fastest.” However, that strategy ended up hitting a brick wall – it doesn’t matter how fast an individual processor is if you actually need to do multiple things at once. Hence multi-core really hit the mainstream. Previously reserved in multi-CPU environments for high end workstations and servers, it’s now common for any new computer to come with multiple cores. (Heck, I have 2 x Quad Core processors in the machine I’m writing this article on. The CPU speeds are technically slower than my lab ESX server, but with multi-core, multi-threading, it smacks the ESX server out of the lab every time on performance. It’s more intelligent use of the resources.)

So dedupe is about shifting away from big, bigger biggest storage to smart, smarter and smartest storage.

We’re certainly not at smartest yet.

We’re probably not even at smarter yet.

As an overall implementation strategy, deduplication is practically infantile in terms of actual industry-state vs potential industry-state. You can do it on your primary production data, or your virtualised systems or your archived data or your secondary NAS data or your backups, but so far there’s been little tangible, usable advances towards being able to use it throughout your entire data lifecycle in a way which is compatible and transparent regardless of vendor or product in use.

For dedupe to be able to make that leap fully out of bleeding edge territory, it needs to make some inroads into complete data lifecycle deduplication – starting at the primary data level and finishing at backups and archives.

(And even when we can use it through the entire product lifecycle, we’ll still be stuck with working out what to do with it once it’s been generated, for longer term storage. Do we replicate between sites? Do we rehydrate to tape or do we send out the deduped data to tape? Obviously based on recent articles I don’t (yet) have much faith in the notion of writing deduped data to tape.)

If you think that there isn’t a choice for long term storage – that it has to be replication, and dedupe is a “tape killer”, think again. Consider smaller sites with constrained budget, consider sites that can’t afford dedicated disaster recovery systems, and consider sites that want to actually limit their energy impact. (I.e., sites that understand the difference in energy savings between offsite tapes and MAID for long term data storage.)

So should data protection environments implement dedupe?

You might think, based on previous comments, that my response to this is going to be a clear-cut no. That’s not quite correct however. You see, because dedupe falls into both leading edge and bleeding edge, it is something that can be implemented into specific environments, in specific circumstances.

That is, the suitability of dedupe for an environment can be evaluated on a case by case basis, so long as sites are aware that when implementing dedupe they’re not getting the full promise of the technology, but just specific windows on the technology. It may be that companies:

  • Need to reduce their backup windows, in which case source-based dedupe could be one option (among many).
  • Need to reduce their overall primary production data, in which case single instance archive is a likely way to go.
  • Need to keep more data available for recovery in VTLs (or for that matter on disk backup units), in which case target based dedupe is the likely way to go.
  • Want to implement more than one of the above, in which case they will be buying disparate technology that don’t share common architectures or operational management systems.

I’d be mad if I were to say that dedupe is still too immature for any site to consider – yet equally I’d charge that anyone who says that every site should go down a dedupe path, and that every site will get fantastic savings from implementing dedupe is equally mad.

 

NetWorker has an irritating quirk where it doesn’t allow you to clone or stage incomplete savesets. I can understand the rationale behind it – it’s not completely usable data, but that rationale is wrong.

If you don’t think this is the case, all you have to do to test is start a backup, cancel it mid-way through a saveset, then attempt to clone that saveset. Here’s an example:

[root@tara ~]# save -b Big -q -LL /usr
Oct 25 13:07:15 tara logger: NetWorker media: (waiting) Waiting for 1
writable volume(s) to backup pool 'Big' disk(s) or tape(s) on tara.pmdg.lab
<backup running, CTRL-C pressed>
(interrupted), exiting
[root@tara ~]# mminfo -q "volume=BIG995S3"
 volume        client       date      size   level  name
BIG995S3       tara.pmdg.lab 10/25/2009 175 MB manual /usr
[root@tara ~]# mminfo -q "volume=BIG995S3" -avot
 volume        client           date     time         size ssid      fl   lvl name
BIG995S3       tara.pmdg.lab 10/25/2009 01:07:15 PM 175 MB 14922466  ca manual /usr
[root@tara ~]# nsrclone -b Default -S 14922466
5876:nsrclone: skipping aborted save set 14922466
5813:nsrclone: no complete save sets to clone

Now, you may be wondering why I’m hung up on not being able to clone or stage this sort of data. The answer is simple: sometimes the only backup you have is a broken backup. You shouldn’t be punished for this!

Overall, NetWorker has a fairly glowing pedigree in terms of enforced data viability:

  • It doesn’t recycle savesets until all dependent savesets are also recyclable;
  • It’s damn aggressive at making sure you have current backups of the backup server’s bootstrap information;
  • If there’s any index issue it’ll end up forcing a full backup for savesets even if it’s backed them up before;
  • It won’t overwrite data on recovery unless you explicitly tell it to;
  • It lets you recover from incomplete savesets via scanner/uasm!

and so on.

So, logically, there makes little sense in refusing to clone/stage incomplete savesets.

There may be programmatic reasons why NetWorker doesn’t permit cloning/staging incomplete savesets, but these aren’t sufficient reasons. NetWorker’s pedigree of extreme focus on recoverability remains tarnished by this inability.

 

Over at Backup Central, Curtis Preston says he’s convinced that dedupe to tape according to the CommVault model is a good idea, in a “crazy good” way rather than a “crazy bad” way. To summarise Curtis’ argument (and thereby establish my understanding of it), the process is:

  1. Day to day recovery of deduped tape backup would be crazy (I agree with this)
  2. Design the system so that you still facilitate most recoveries from dedupe on disk (I have no issue with this)
  3. Periodically effectively stage out the dedupe data to tape (first objection)
  4. Long-term recoveries are done from tape written in dedupe format (holy cow that’s insane!)

So, let’s look at why I think this is “crazy bad” by examining each point.

Point one – day to day recovery of deduped tape backup would be crazy

Fully agreed. I’d liken recovery from deduped data on tape to recovery of highly fragmented files from a block level backup. Block level backup products (e.g., EMC’s SnapImage) allows you to bypass the inefficiencies of the filesystem on dense structures to do a block by block backup. This can deliver fantastic time savings. For. Backup.

For recovery, file level reconstruction from block level backups can suck in a terribly horrendous way. File level reconstruction from block level backups requires recovery of the required blocks into a cache, and then the files are put back together. If your files are heavily fragmented (which is often the case on dense filesystems), the number of reads from tape required – and the amount of seeking required – is very high. Real world example: 400 GB dense filesystem (about 40,000,000 files) had full backups reduced from 15 hours to 4 hours using block level backup. Recovery of the entire filesystem took less than 4 hours – recovery of a 40 GB directory took 12 hours. Having a very large cache is one way to get around this, but that starts to get costly (and in my experience is frequently poached).

Recovery from deduped data on tape will very likely suck just as badly.

Point two – design the system so that you facilitate most recoveries from dedupe on disk

Again, fully agreed. So far I’m in complete agreement with Curtis and CommVault. This point can be said of any backup design – design your system so that the most frequently performed recoveries are done from the fastest backup medium.

Point three – Periodically effectively stage out all dedupe data to tape

This is the crazy part, and not crazy good, but out and out crazy bad. To quote Curtis on this:

If you’re going to dedupe to tape, you first have to dedupe to disk.  You create what they call a silo on disk, which is a full backup and a set of deduped incrementals based on (and deduped against) that full backup. The retention on that silo should be long enough to satisfy most of your operational restore requests.  (Typically that’s 30 days, but it could be longer in your environment.)

What’s so crazy-bad about this?

Now, I’ll profess that I don’t know for sure which way this is being done, but it reads that new full backups are generated periodically in the dedupe environment, allowing the previous dependency chains of fulls + incrementals to be transferred out to tape. (Based on my reading of the CommVault marketing documentation, which refers to “reducing” the number of fulls required for retention cycles, this appears to be an accurate assessment.)

So this means that every X days (whatever your period-between-fulls is going to be) you have to do new fulls. Now while this isn’t so much of an issue in regular backups, in dedupe backups it’s a known fact that the initial full backups are hideously slow. This can be worn by most organisations when it’s a once-off. Every month? Even every 3 months or 6 months? Far less likely.

Point four – Long-term recoveries are done from tape written in dedupe format

Obviously some of my objections to this have already been expressed in my comments for point two, but to continue with my objections, let’s look at what Curtis has to say on this point as well:

But I also agree that if I typically do all my restores from within the last 30 days, and someone asks me for a 31 day-old file, it’s generally going to be the type of restore where the fact that it might take several minutes to complete is not going to be a huge deal.  (In the case that you did need to do a large restore from a deduped tape set, you could actually bring it back in to disk in its entirety before you initiate the restore.)

Now, I agree that recovery of longer term backups can be done from slower media in most instances.

There’s a difference between “slower media” and “a snail just overtook our data recovery”.

In the first case, I don’t believe that recovery from deduped data on tape will be in the order of “several minutes” … I think this would turn out to be a highly optimistic rather than terribly realistic time-frame. I would need to see a large number of real world instances of short recovery times to really believe this will be in an order of “several minutes”. Yes, I’m going on a gut feeling, but I feel it’s somewhat justified.

In the second case … “you could actually bring it back in to disk in its entirety” … how much storage do you want to be using here? If we’re talking bringing back the entire “silo”, that’s a lot of storage to bring back  – I’d suggest it’s going to be comparable to but orders of magnitude worse than say, recovering a 1TB virtual machine fileserver to a separate location in order to pull out a 100KB Excel spreadsheet. Let’s be accurate about this: recovering the entire silo would mean recovering all deduped backups – most notably a full of your entire environment.

If we’re talking about recovering just portions of the data on tape, then again, it’s going to be like the file-level recovery from block-level backup issue previously described, and we’ll be back to square one.

In Summary

I’ve got to be entirely blunt here – CommVault’s approach reminds me of the old (crude) expression (made as “G Rated” as possible):

“You can’t polish a poo, but you can roll it in gold dust”.

If the supporting architecture is crazy, it doesn’t matter that it can do something “nifty” – particularly if that something “nifty” will result in significantly slower recoveries (even in limited circumstances).

Yes, it’s undoubtedly the case that the CommVault approach will reduce the amount of data stored on tape, which will result in some cost savings. However, penny pinching in backup environments has a tendency to result in recovery impacts – often significant recovery impacts. For example, NetBackup gives “media savings” by not enforcing dependencies. Yes, this can result in in saving money here and there on media, but can result in being unable to do complete filesystem recoveries approaching the end of a total retention period, which is plain dumb.

The CommVault approach while saving some money on tape will significantly expand recovery times (or require large cache areas and still take a lot of recovery time). Saving money is good. Wasting a little time during longer-term recoveries is likely to be perceived as being OK – until there’s a pressing need. Wasting a lot of time during longer-term recoveries is rarely going to be perceived as being OK.

The other saying that springs to mind is: The road to hell is paved with good intentions.

If I’m correct in my understanding of how the CommVault dedupe-to-tape strategy works based on a review of the CommVault marketing material (typically for any vendor, slim information) and Curtis’ summary, I can only say that their approach is not crazy good as Curtis concludes, but crazy bad.

 

Within NetWorker, data (savesets) can go through several stages in its lifecycle. Here’s a simple overview of those stages:

Basic data lifecycle

Basic data lifecycle

The first stage, obviously, is when data is initially being written – the “in progress” stage.

After the backup completes, data enters two stages – a browsable period and a retention period. These periods may have 100% overlap, or they may be distinctly different. For instance, the “standard” browse/retention policies chosen by NetWorker when you create a new client are:

  • Browse period – 1 month
  • Retention period – 1 year

A common mistake people make with NetWorker is to assume that the retention period starts when the browse period finishes; in actual fact, the retention and browse period start at the same time, but the browse period can finish before the retention period. So using that standard setting as an example, the saveset is browsable for the first 1 month of the 12 months that it is retained – it is not the case that the saveset is browsable for 1 month, then retained for another 12.

Once data is no longer within the retention period, and there are no backups that depend on it still within the retention period, data is considered to be recyclable.

When data is recyclable:

  • If it is on tape:
    • The data will remain available until the media is recycled. This will only happen once all the backups on the media is also recyclable, and either the administrator manually recycles the media or NetWorker re-uses it.
  • If it is on a disk backup unit (ADV_FILE) device:
    • The data will be erased from the disk backup unit the next time a volume clean operation is run, or nsrim is run (either as a overnight standard event by NetWorker, or manually via nsrim -X).

This isn’t the “whole picture” for data lifecycle within NetWorker, but it is a good brief overview to give you an idea of how data is managed within the environment.

 

Over at TED, there’s a short yet highly interesting look at data visualisation technology called the AlloSphere.

I’m a big fan of data visualisation – I think there’s currently too much emphasis on data search. As the amount of storage we continue to consume grows though, data search will only get you so far. To use an analogy, how helpful is a search so refined that it allows you to find a needle in a haystack if you have 500,000 haystacks in front of you?

We are already at the cusp of search being insufficient. Personal storage is a perfect example of this. On my desktop machine I currently have approximately 10TB of formatted storage presented, with about 70% utilisation. Yes, I’m not the average user, but I may be slightly representative of where the average user will be in about 3 years (I’m not unique – I think this would be the case for a lot of people in either IT or multimedia industries).

As much as Apple’s Spotlight is fantastic, sometimes it’s not enough. Sometimes it’s not about knowing what you’re searching for, but drilling down what you’re searching for. For example, I use a fantastic little app on the Mac called Grand Perspective to be able to visualise storage layout/use either on entire drives or individual trees of folders. I don’t use this all the time – and not always for search, but sometimes it’s more useful than any search tool I can think of – e.g., when I need to quickly see what is using space in a folder structure, or when I’m looking for a particularly large file I was working with recently but can’t recall what it was. Here’s a screen shot of Grand Perspective run against my home directory:

Grand Perspective

Grand Perspective

As you can see, it’s big chunks of colour all over the place. What information does it give me? It lets me quickly see where space is occupied. Spotlight doesn’t do this; Finder doesn’t do this – nor do the counterpart features of Windows. I can just move the mouse cursor over any block and instantly see what file it maps to – instantly giving me comparative size details.

Data visualisation: get used to the term – I predict you’ll be hearing it with increasing regularity in the coming years.

(Bringing this back to NetWorker (or even EMC, more generically), I’d love to think there’s serious R&D being done on working out how to integrate data visualisation techniques in their products. In NetWorker we’ve sort of started to get the most primitive versions of this in the dynamic reports in NMC, but that should be seen as only the most absolute starting point. The next phase is to have the developers talk to experts in the field so they get an appreciation of how they might represent data in a way that allows faster “deep diving”.)

 

When I first mentioned probe based backups a while ago, I suggested that they’re going to be a bit of a sleeper function – that is, I think they’re being largely ignored at the moment because people aren’t quite sure how to make use of them. My take however is that over time we’re going to see a lot of sites shifting particular backups over to probe groups.

Why?

Currently a lot of sites shoe-horn ill-fitting backup requirements into rigid schedules. This results in frequent violations of the best practices approach to backup of Zero Error Policies. Here’s a prime example: for those sites that need to do laptop and/or desktop backups using NetWorker, the administrators are basically resigned on those sites to having failure rates in such groups of 50% or more depending on how many machines are currently not connected to the network.

This doesn’t need to be the case – well, not any more thanks to probe based backups. So, if you’ve been scratching your head looking for a practical use for these backups, here’s something that may whet your appetite.

Scenario

Let’s consider a site where there are group of laptops and desktops that are integrated into the NetWorker backup environment. However, there’s never a guarantee of which machines may be connected to the network at any given time. Therefore administrators typically configure laptop/desktop backup groups to start at say, 10am, on the premise that the most systems are likely to be available at that time.

Theory of Resolution

Traditional time-of-day start backups aren’t really appropriate to this scenario. What we want is a situation where the NetWorker server waits for those infrequently connected clients to be connected, then runs a backup at the next opportunity.

Rather than having a single group for all clients and accepting that the group will suffer significant failure rates, split each irregularly connected client into its own group, and configure a backup probe.

The backup system will loop probes of the configured clients during nominated periods in the day/night at regular intervals. When the client is connected to the network and the probe successfully returns that (a) the client is running and (b) a backup should be done, the backup is started on the spot.

Requirements

In order to get this working, we’ll need the following:

  • NetWorker 7.5 or higher (clients and server)
  • A probe script – one per operating system type
  • A probe resource – one per operating system type
  • A 1:1 mapping between clients of this type and groups.

Practical Application

Probe Script

This is a command which is installed on the client(s), in the same directory as the “save” or “save.exe” binary (depending on OS type), and starts with either nsr or save. I’ll be calling my script:

nsrcheckbackup.sh

I don’t write Windows batch scripts. Therefore, I’ll give an example as a Linux/Unix shell script, with an overview of the program flow. Anyone who wants to write a batch script version is welcome to do so and submit it.

The “proof of concept” algorithm for the probe script works as follows:

  • Establish a “state” directory in the client nsr directory called bckchk. I.e., if the directory doesn’t exist, create it.
  • Establish a “README” file in that directory for reference purposes, if it doesn’t already exist.
  • Determine the current date.
  • Check for a previous date file. If there was a previous date file:
    • If the current date equals the previous date found:
      • Write a status file indicating that no backup is required.
      • Exit, signaling that no backup is required.
    • If the current date does not equal the previous date found:
      • Write the current date to the “previous” date file.
      • Write a status file indicating that the current date doesn’t match the “previous” date, so a new backup is required.
      • Exit, signaling that a backup is required.
  • If there wasn’t a previous date file:
    • Write the current date to the “previous” date file.
    • Write a status file indicating that no previous date was found so a backup will be signaled.
    • Exit, signaling backup should be done.

Obviously, this is a fairly simplistic approach, but is suitable for a proof of concept demonstration. If you were wishing to make the logic more robust for production deployment, my first suggestion would be to build in mminfo checks to determine (even if the dates match), whether there has been a backup “today”. If there hasn’t, that would override and force a backup to start. Additionally, if users can connect via VPN and the backup server can communicate with connected clients, you may want to introduce some logic into the script to deny probe success over the VPN.

If you were wanting a OS independent script for this, you may wish to code in Perl, but I’ve hung off doing that in this case simply because a lot of sites have reservations about installing Perl on Windows systems. (Sigh.)

Without any further guff, here’s the sample script:

preston@aralathan ~
$ cat /usr/sbin/nsrcheckbackup.sh
#!/bin/bash

PATH=$PATH:/bin:/sbin:/usr/sbin:/usr/bin
CHKDIR=/nsr/bckchk

README=`cat <<EOF
==== Purpose of this directory ====

This directory holds state file(s) associated with the probe based
laptop/desktop backup system. These state file(s) should not be
deleted without consulting the backup administrator.
EOF
`

if [ ! -d "$CHKDIR" ]
then
   mkdir -p "$CHKDIR"
fi

if [ ! -f "$CHKDIR/README" ]
then
   echo $README > "$CHKDIR/README"
fi

DATE=`date +%Y%m%d`
COMPDATE=`date "+%Y%m%d %H%M%S"`
LASTDATE="none"
STATUS="$CHKDIR/status.txt"
CHECK="$CHKDIR/datecheck.lck"

if [ -f "$CHECK" ]
then
   LASTDATE=`cat $CHKDIR/datecheck.lck`
else
   echo $DATE > "$CHECK"
   echo "$COMPDATE Check file did not exist. Backup required" > "$STATUS"
   exit 0
fi

if [ -z "$LASTDATE" ]
then
   echo "$COMPDATE Previous check was null. Backup required" > "$STATUS"
   echo $DATE > "$CHECK"
   exit 0
fi

if [ "$DATE" = "$LASTDATE" ]
then
   echo "$COMPDATE Last backup was today. No action required" > "$STATUS"
   exit 1
else
   echo "$COMPDATE Last backup was not today. Backup required" > "$STATUS"
   echo $DATE > "$CHECK"
   exit 0
fi

As you can see, there’s really not a lot to this in the simplest form.

Once the script has been created, it should be made executable and (for Linux/Unix/Mac OS X systems), be placed in /usr/sbin.

Probe Resource

The next step is, within the NetWorker, to create a probe resource. This will be shared by all the probe clients of the same operating system type.

A completed probe resource might resemble the following:

Configuring the probe resource

Configuring the probe resource

Note that there’s no path in the above probe command – that’s because NetWorker requires the probe command to be in the same location as the save command.

Once this has been done, you can either configure the client or the probe group next. Since the client has to be reconfigured after the probe group is created, we’ll create the probe group first.

Creating the Probe Groups

First step in creating the probe groups is to come up with a standard so that they can be easily identified in relation to all other standard groups within the overall configuration. There are two approaches you can take towards this:

  • Preface each group name with a keyword (e.g., “probe”) followed by the host name the group is for.
  • Name each group after the client that will be in the group, but set a comment along the lines of say, “Probe Backup for <hostname>”.

Personally, I prefer the second option. That way you can sort by comment to easily locate all probe based groups but the group name clearly states up front which client it is for.

When creating a new probe based group, there are two tabs you’ll need to configure – Setup and Advanced – within the group configuration. Let’s look at each of these:

Probe group configuration – Setup Tab

Probe group configuration – Setup Tab

You’ll see from the above that I’m using the convention where the group name matches the client name, and the comment field is configured appropriately for easy differentiation of probe based backups.

You’ll need to set the group to having an autostart value of Enabled. Also, the Start Time field does have relevance exactly once for probe based backups – it still seems to define the first start time of the probe. After that, the probe backups will follow the interval and start/finish times defined on the second tab.

Here’s the second tab:

Probe Group - Advanced Tab

Probe Group - Advanced Tab

The key thing on this obviously is the configuration of the probe section. Let’s look at each option:

  • Probe based group – Checked
  • Probe interval – Set in minutes. My recommendation is to have each group a different number of minutes. (Or at least reduce the number of groups that have exactly the same probe interval.) That way over time as probes run, there’s less likelihood of multiple groups starting at the same time. For instance, in my test setup, I have 5 clients, set to intervals of 90 minutes, 78 minutes, 104 minutes, 82 minutes and 95 minutes*.
  • Probe start time – Time of day that probing starts. I’ve left this on the defaults, which may be suitable for desktops, but for laptops where there’s a very high chance of machines being disconnected of a night time, you may wish to start probing closer to the start of business hours.
  • Probe end time – Time of day that NetWorker stops probing the client. Same caveats as per the probe start time above.
  • Probe success criteria – Since there’s only one client per group, you can leave this at all.
  • Time since successful backup – How many days NetWorker should allow probing to run unsuccessfully before it forcibly sets a backup running. If set to zero it will never force a backup running. I’ve actually changed, since I took the screen-shot, that value, and set it to 3 on my configured clients. Set yours to a site-optimal value. Note that since the aim is to run only one backup every 24 hours, setting this to “1″ is probably not all that logical an idea.

(The last field, “Time of the last successful backup” is just a status field, there’s nothing to configure there.)

If you have schedules enforced out of groups, you’ll want to set the schedule up here as well.

With this done, we’re ready to move onto the client configuration!

Configuring the Client for Probe Backups

There’s two changes required here. In the General tab of the client properties, move the client into the appropriate group:

Adding the client to the correct group

Adding the client to the correct group

In the “Apps & Modules” tab, identify the probe resource to be used for that client:

Configuring the client probe resource

Configuring the client probe resource

Once this has been done, you’ve got everything configured, and it’s just a case of sitting back and watching the probes run and trigger backups of clients as they become available. You’ll note, in the example above, that you can still use savepnpc (pre/post commands) with clients that are configured for probe backups. The pre/post commands will only be run if the backup probe confirms that a backup should take place.

Wrapping Up

I’ll accept that this configuration can result in a lot of groups if you happen to have a lot of clients that require this style of backup. However, that isn’t the end of the world. Reducing the number of errors reported in savegroup completion notifications does make the life of backup administrators easier, even if there’s a little administrative overhead.

Is this suitable for all types of clients? E.g., should you use this to shift away from standard group based backups for the servers within an environment? The answer to that is a big unlikely. I do really see this as something that is more suitable for companies that are using NetWorker to backup laptops and/or desktops (or a subset thereof).

If you think no-one does this, I can think of at least five of my customers alone who have requirements to do exactly this, and I’m sure they’re not unique.

Even if you don’t particularly need to enact this style of configuration for your site, what I’m hoping is that by demonstrating a valid use for probe based backup functionality, I may get you thinking about where it could be used at your site for making life easier.

Here’s a few examples I can immediately think of:

  • Triggering a backup+purge of Oracle archived redo logs that kick in once the used capacity of the filesystem the logs are stored on exceed a certain percentage (e.g., 85%).
  • Triggering a backup when the number of snapshots of a fileserver exceed a particular threshold.
  • Triggering a backup when the number of logged in users falls below a certain threshold. (For example, on development servers.)
  • Triggering a backup of a database server whenever a new database is added.

Trust me: probe based backups are going to make your life easier.


* There currently appears to be a “feature” with probe based backups where changes to the probe interval only take place after the next “probe start time”. I need to do some more review on this and see whether it’s (a) true and (b) warrants logging a case.

 

Over at Grumpy Storage, there’s currently a fantastic piece about the sorry state of how Requests for Enhancement are handled by most vendors. In the post, we see a proposal of how vendors might improve how RFEs are accepted and worked on seriously.

Obviously my blog has an EMC bent, but I work across a great many products, and the one thing I’ll say about most vendors, regardless of whether they’re OS vendors or hardware vendors or software vendors, they all share one common attribute:

A practically callous disregard for user input.

The most polite response I can think of to vendors who don’t treat RFEs as serious input is “bah humbug”.

Ignoring RFEs (or not working with them) is like the tail wagging the dog. It’s the company basically saying to the end users, “You don’t have a clue what you’re doing. You can’t possibly understand our product or our direction enough to provide valuable input.”

This isn’t to say that all RFEs are sensible. However, lumping all RFEs into the “sounds like s–t” basket simply because a few happen to be illogical (or are for features that already exist) is unfair to the average user who genuinely wishes to recommend enhancements to a product.

At IDATA, I’m the primary developer for IDATA Tools. My take on RFEs for these tools is that they are invaluable. They frequently point to usage scenarios that we hadn’t considered, and they demonstrate how customers need to extend their datazone administration for easier use. Wanting to ignore that would be … well, insane.

RFEs should be treasured. Good on Grumpy Storage for making the case so eloquently.

 

There was a recent discussion on the NetWorker mailing list as to whether some additional logging information that appeared in 7.4.x was worthwhile or whether it was worthless to the point of getting in the way of an administrator.

So that everyone is across what I’m talking about, the messages that started in 7.4.x are along the lines of:

nsrim: Only one browsable Full exists for saveset X. Its browse period is equal to retention period.

So here’s my take on the discussion: log files aren’t to be resented.

I recognise there’s a point where log files become either useless or waste people’s time. However, there’s really only one time for this – when the exact same information is needlessly repeated. In the case of these log messages though, it’s not the exact same information needlessly repeated. It’s different information – it’s going to be about a different saveset each time.

What is the message about, you may be wondering? Well, I actually don’t 100% know for sure. My suspicion is that it’s a message introduced to deal with processing saveset retention following changes introduced for pool based retention policies. But it doesn’t matter.

One thing that will drive me nuts with just about any product is encountering an issue where there’s insufficient logs to actually work out what is going on. Obviously, there’s a fine line to walk – log too much and you waste space and potentially reveal too much about the IP of the package. However, don’t do enough and it becomes extremely challenging for the people doing support (or the people who write the patches (or the people who wrote the software)) to resolve an issue. I don’t believe that having accurate logs guarantees quickly resolving an issue, but they certainly help – and not having them certainly hinders.

So my point is – don’t resent your log files. The amount of space they generally take up in NetWorker is quite minimal (compared to say, the index region), and so you shouldn’t be concerned about space. Nor, I’ll insist, should you be concerned about how to go about stripping out messages you don’t need to review when scanning log files. Backup administrators of enterprise products in particular should be quite conversant with log analysis and text extraction.

If those extra logged entries allow me to quickly find something in a Knowledge Base, or similarly allows support to find something quickly in an engineering database, or allows a patch developer to isolate the section of code that causes the problem, or allows the core developer to target the section of code to write an enhancement, it’s fantastic, and well worth the extra few bytes here and there that occupy my filesystems.

 

A while ago, I made a posting about a long-running annoyance I have with directive management in NetWorker.

This time I want to expand slightly upon it, thanks mainly to some recent discussions with customers that pointed out an obvious and annoying additional lack of flexibility in directive management.

That’s to do with the complete inability to apply directives – particularly the skip or the null directive, against “special” savesets. By “special” I’m referring to savesets that aren’t part of standard filesystem backups yet are still effectively just a bunch of files.

Such as say:

  • SYSTEM FILES:
  • SYSTEM STATE:
  • ASR:

(And so on.)

In short, NetWorker provides no way of skipping these savesets while still using the “All” special saveset for Windows clients. You can’t do any of the following:

  • Hand craft a server-side directive
  • Hand craft a client-side directive
  • Use the directive management option in the client GUI (winworkr) to create a directive to skip these styles of savesets.

OK, the last point is just slightly inaccurate. Yes, you can create the directive using this method – but:

  • The created directive is not honoured, either when left as is, or by transferring to a more standard directive;
  • The created directive is “lost” when you next load winworkr’s directive management option. Clearly it lets you create directives that aren’t valid and it subsequently won’t deal with.

Why does this suck? For a very important reason – in some situations you don’t want to have to back these up, or you can’t back them up. For instance, on certain OS levels and bitness using clusters, you will get an error if you try to backup the ASR: saveset.

This creates a requirement to either:

  1. Accept that you’ll get an error every day in your backup report (completely unacceptable)
  2. Switch from exclusionary backups to inclusionary backups (highly unpalatable and risky)

Clearly then the option is the second, not the first. This though is effectively removing an error by introducing poor backup systems management into the environment.

It would be nice if this problem “went away”.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha