One of the core concepts I try to drive home in my book is that you don’t get a backup system by installing enterprise backup software.

Here’s a diagram to help explain what really goes into making a backup system:

Backup system

In short, you can have as much technology as you want, but without the rest of those pieces all you’ve got is a budget sink-hole.

If you want to understand how all these concepts fit together, you really should take the time to invest in my book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy“.

 

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive :-)

 

This morning we went to the funeral of our best friends’ father. It was, as funerals go, a lovely service and after the funeral and the burial we headed off to the wake, only to have someone’s hilux slam into the driver’s side of our car on a tight bend. They’d skidded and come onto the wrong side of the road by just enough, given the tight corner, to make the impact. Thankfully speed, alcohol or drugs weren’t in play, just the wet, and even more importantly, no-one was injured. Dignity will be lost any time it’s driven without a passenger though – the driver’s door can’t be opened from the inside:

Alas, poor car, I hardly knew ye

The case is with the insurers, and we’re waiting for an assessment next Tuesday to find out whether the car will be repaired or written off. It would be a shame if it’s written off; it’s a Toyota Avalon, circa 2001, and while those cars were frumpy they were damn good cars. With only around 120,000km on the clock it’s not really all that old. About 3 or 4 years ago it was almost completely totalled in a massive hail storm on the central coast; as I recall the repair was in the order of about $12,000, and it only scraped through for repair on an insurance value of around $14,000. Now, with insurance of $7,500 and the repair estimate saying that it’ll top $5,000, age is against the car and it doesn’t look good.

But, this blog isn’t about my hassles, or my car.

It is however about insurance, and insurance is something I’ll be dealing with quite a bit over the coming days. Or I will be, once we hit next Tuesday and the car gets checked out by the assessors.

When we think of “backup as insurance”, there’s some fairly close analogies:

  • Backup is insurance because it’s about having a solution when something goes wrong;
  • Making a claim is performing a recovery;
  • Your excess is how easy (or hard) it is to make a recovery.

Given what’s happened today, it made me wonder what the analogy to “written off” is. That’s a little bit more unpleasant to deal with, but it’s still something that has to be considered.

In this case I’d suggest that the analogy for the insured item being “written off” is one of the following:

  • Having clonesseems simple, but if one recovery fails due to media, having clones that you can recover from instead are the cheapest, logical solution.
  • Having an alternate recovery strategy – so for items with really high availability requirements or minimal data loss requirements, this would refer to having some other replica system in place.
  • Having insurance that can get you through the worst of events – sometimes no matter what you do to protect yourself, you can have a disaster that exceeds all your preparation. So in the absolute worst case scenario, you need something that will help you pay your bills, or ameliorate your building debt while you get yourself back on-board.

Of course, it remains preferable to not have to rely on any of these options, but the case remains that it’s always important to have an idea what your “worst case scenario” recovery situation will be. If you haven’t prepared for one, I’ll suggest what it’s likely to be: going out of business. Yes, it’s that critical that you have an idea what you’ll do in a worst-case scenario. It’s not called “business continuity” for the heck of it – when that critical situation occurs, not having plans usually results in the worst kind of failure.

Me? I’ll be visiting a few car-yards on the weekend to scope up what options I have in the event the car gets written off on Tuesday.

 

While I touched on this in the second blog posting I made (Instantiating Savesets), it’s worthwhile revisiting this topic more directly.

Using ADV_FILE devices can play havoc with conventional tape rotation strategies; if you aren’t aware of these implications, it could cause operational challenges when it comes time to do recovery from tape. Let’s look at the lifecycle of a saveset in a disk backup environment where a conventional setup is used. It typically runs like this:

  1. Backup to disk
  2. Clone to tape
  3. (Later) Stage to tape
  4. (At rest) 2 copies on tape

    Looking at each stage of this, we have:

    Saveset on ADV_FILE deviceThe saveset, once written to an ADV_FILE volume, has two instances. The instance recorded as being on the read-read only part of the volume will have an SSID/CloneID of X/Y. The instance recorded as being on the read-write part of the volume will have an SSID/CloneID of X/Y+1. This higher CloneID is what causes NetWorker, upon a recovery request, to seek the “instance” on the read-only volume. Of course, there’s only one actual instance (hence why I object so strongly to the ‘validcopies’ field introduced in 7.6 reporting 2) – the two instances reported are “smoke and mirrors” to allow simultaneous backup to and recovery from an ADV_FILE volume.

    The next stage sees the saveset cloned:

    ADV_FILE + Tape CloneThis leaves us with 3 ‘instances’ – 2 physical, one virtual. Our SSID/CloneIDs are:

    • ADV_FILE read-only: X/Y
    • ADV_FILE read-write: X/Y+1
    • Tape: X/Y+n, where n > 1.

    At this point, any recovery request will still call for the instance on the read-only part of the ADV_FILE volume, so as to help ensure the fastest recovery initiation.

    At some future point, as disk capacity starts to run out on the ADV_FILE device, the saveset will typically be staged out:

    ADV_FILE staging to tapeAt the conclusion of the staging operation, the physical + virtual instances of the saveset on the ADV_FILE device are removed, leaving us with:

    Savesets on tape only

    So, at this point, we end up with:

    • A saveset instance on a clone volume with SSID/CloneID of: X/Y+n.
    • A saveset instance on (typically) a non-clone volume with SSID/CloneID of: X/Y+n+m, where m > 0.

    So, where does this leave us? (Or if you’re not sure where I’ve been heading yet, you may be wondering what point I’m actually trying to make.)

    Note what I’ve been saying each time – NetWorker, when it needs to read from a saveset for recovery purposes, will want to pick the saveset instance with the lowest CloneID. At the point where we’ve got a clone copy and a staged copy, both on tape, the clone copy will have the lowest CloneID.

    The net result is that NetWorker will, in these circumstances, when both tapes aren’t online, request the clone volume for recovery – even though in an extreme number of cases, this will be the volume that’s offsite.

    For NetWorker versions 7.3.1 and lower, there was only one solution to this – you had to hunt down the actual clone saveset instances NetWorker was asking for, mark them as suspect, and reattempt the recovery. If you managed to mark them all as suspect, then you’d be able to ‘force’ NetWorker into facilitating the recovery from the volume(s) that had been staged to. However, after the recovery you had to make sure you backed out of those changes, so that both the clones and the staged copies would be considered not-suspect.

    Some companies, in this situation, would instigate a tape rotation policy such that clone volumes would be brought back from off-site before savesets were likely to be staged out, with subsequently staged media sent offsite. This has a dangerous side-effect of temporarily leaving all copies of backups on-site, jeapordising disaster recovery situations, and hence it’s something that I couldn’t in any way recommend.

    The solution introduced around 7.3.2 however is far simpler – a mminfo flag called offsite. This isn’t to be confused with the convention of setting a volume location field to ‘offsite’ when the media is removed from site. Annoyingly, this remains unqueryable; you can set it, and NetWorker will use it, but you can’t say, search for volumes with the ‘offsite’ flag set.

    The offsite flag has to be manually set, using the command:

    # nsrmm -o offsite volumeName

    (where volumeName typically equals the barcode).

    Once this is set, then NetWorker’s standard saveset (and therefore volume) selection criteria is subtly adjusted. Normally if there are no online instances of a saveset, NetWorker will request the saveset with the lowest CloneID. However, saveset instances that are on volumes with the offsite flag set will be deemed ineligible and NetWorker will look for a saveset instance that isn’t flagged as being offsite.

    The net result is that when following a traditional backup model with ADV_FILE disk backup (backup to disk, clone to tape, stage to tape), it’s very important that tape offsiting procedures be adjusted to set the offsite flag on clone volumes as they’re removed from the system.

    The good news is that you don’t normally have to do anything when it’s time to pull the tape back onsite. The flag is automatically cleared* for a volume as soon as it’s put back into an autochanger and detected by NetWorker. So when the media is recycled, the flag will be cleared.

    If you come from a long-term NetWorker site and the convention is still to mark savesets as suspect in this sort of recovery scenario, I’d suggest that you update your tape rotation policies to instead use the offsite flag. If on the other hand, you’re about to implement an ADV_FILE based backup to disk policy, I’d strongly recommend you plan in advance to configure a tape rotation policy that uses the offsite flag as cloned media is sent away from the primary site.


    * If you did need to explicitly clear the flag, you can run:

    # nsrmm -o notoffsite volumeName

    Which would turn the flag back off for the given volumeName.

     

    My boss, on his blog, has raised a pertinent question – if it’s so important, according to some vendors, that backup and archive are all achieved through the same product interface, then how many companies out there assign the role of archive administrator to the backup administrator? (Or vice versa).

    I like this question; it’s kind of like the old conundrum of whether the dog wags the tail, or whether the tail wags the dog. That is, are companies that heavily push an integrated backup and archive interface:

    • Responding to the needs of IT to meet current desired business functionality, or,
    • Are they trying to drive IT in a way that perhaps doesn’t meet desired business functionality?

    (Or indeed, something else entirely).

    [Edit, further thoughts, 2010-03-03] I’ve been thinking more about this, and I have to say I can’t think of a single customer environment off-hand where the backup administrator is also responsible for archiving. Archiving seems to remain primarily the purdue of the storage administration teams in sites that I’m aware of, so it does beg the question – how beneficial is an integrated backup and archive administration process?

    [Original wrap-up] So if you’ve got any thoughts on the integration of backup and archive administration, either at the software or the human resources layer, I’d encourage you to jump across to Mike’s blog and make your voice heard.

    (As a first, I’ve disabled comments on this blog posting, so as to encourage discussion to remain in one location – the source article.)

     

    So you’re a busy backup administrator and you’re getting ready to go on leave. It’s 4pm on your final day before the holiday, you’ve finally got everything off your plate, and you think to yourself, “Now I’ve finally got the time, I’ll just quickly upgrade NetWorker before I leave.”

    This unfortunately is an alternative of that Friday change rule violation known as POETS.

    There’s three distinctly wrong things with this scenario:

    • Infrastructure upgrade done without change control.
    • Infrastructure upgrade done at the last minute.
    • Infrastructure upgrade done without follow-up monitoring.

    Any one of those scenarios is enough to cause a nightmare situation – either for yourself, getting call-outs when you’re meant to be on holidays, or for your colleagues, left in the lurch after you switch your phone off for two weeks and go on a holiday to the East Islands.

    All three though? That’s just asking for trouble.

    (This lesson doesn’t actually just apply to NetWorker – it applies across the board for system, application and storage administration. Don’t modify the system just before going away for a while.)

    Just before this holiday season, I had a customer upgrade* their NetWorker server from 7.3.x to 7.5 before going on leave. Not 7.5.1, not 7.5.1.8, 7.5. This didn’t go so well, and a few days later when the fill-in administrators noticed the issue**, there was a bit of work to rectify the various issues and some backups during that time didn’t work.

    This however is by no means unique. Following Twitter I noticed one on-call person suffer a hideous xmas day and following day working on a call-out from what appeared to be an untested change done by someone else before that other person went on holiday.

    And non-betting man that I am, I’d bet a considerable wad of money (and win) that this fellow’s experience wasn’t unique for IT workers over xmas 2009.

    In short: choosing to do an untested/uncontrolled upgrade just before going on holidays can be either self-destructive or selfish (or even both) – it may lose your your holiday, depending on the level of the fail and the backup (or lack thereof) within your company, or it may cause a colleague to have an insufferably unpleasant time. (Alternatively, if you can be reached, it may result in you having a bad time on your holiday in order to help out a colleague having a bad time as well.)

    The problem with rushing through upgrades at the last minute is that they tend to be poorly done, even if they seem simple enough. Even if change control is being followed, if that change control has been rushed through (as it can sometimes be done as a “last minute” activity), then it provides no guarantee that the change will work smoothly. And don’t forget: Murphy’s Law works in the datacentre as well. Something that looks easy, that you should be able to do with your eyes closed, when done as a rush job at the last minute can come unstuck quite easily.

    So please – for your sake, for your colleagues sake, for NetWorkers’ sake and for the sake of your company: please don’t upgrade just before you go on holidays.


    * upgrade = “update” in NetWorker speak

    ** Which should serve as a reminder that you should never only have one backup administrator.

     

    Over the years I’ve dealt with a lot of different environments, and a lot of different usage requirements for backup products. Most of these fall into the “appropriate business use” categories. Some fall into the “hmmm, why would you do that?” category. Others fall into the “please excuse my brain it’s just scuttled off into the corner to hide – tell me again” category.

    This is not about the people, or the companies, but the crazy ideas that sometimes get hold within companies that should be watched for. While I could have expanded this list to cover a raft of other things outside of backups, I’ve forced myself to just keep it to the backup process.

    In no particular order then, these are the crazy things I never want to hear again:

    1. After the backups, I delete all the indices, because I maintain a spreadsheet showing where files are, and that’s much more efficient than proprietary databases.
    2. We just backup /etc/passwd on that machine.
    3. But what about /etc/shadow? (My stupid response to the above statement, blurted after by brain stalled in response to statement #2)
    4. Oh, hadn’t thought about that (In response to #3).
    5. Can you fax me some cleaning cartridge barcodes?
    6. To save money on barcodes at the end of every week we take them off the tapes in the autochanger and put them on the new ones about to go in.
    7. We only put one tape in the autochanger each night. We don’t want <product> to pick the wrong tape.
    8. We need to upgrade our tape drives. All our backups don’t fit on a single tape any more. (By same company that said #7.)
    9. What do you mean if we don’t change the tape <product> won’t automatically overwrite it? (By same company that said #7 and #8.)
    10. Why would I want to match barcode labels to tape labels? That’s crazy!
    11. That’s being backed up. I emailed Jim a week ago and asked him to add it to the configuration. (Shouted out from across the room: “Jim left last month, remember?”)
    12. We put disk quotas on our academics, but due to government law we can’t do that to their mail. So when they fill up their home directories, they zip them up and email it to themselves then delete it all.
    13. If a user is dumb enough to delete their file, I don’t care about getting it back.
    14. Every now and then on a Friday afternoon my last boss used to delete a filesystem and tell us to have it back by Monday as a test of the backup system.
    15. What are you going to do to fix the problem? (Final question asked by an operations manager after explaining (a) robot was randomly dropping tapes when picking them from slots; (b) tapes were covered in a thin film of oily grime; (c) oh that was probably because their data centre was under the area of the flight path where planes are advised to dump excess fuel before landing; (d) fuel is not being scrubbed by air conditioning system fully and being sucked into data centre; (e) me reminding them we just supported the backup software.)

    I will say that numbers #1 and #15 are my personal favourites for crazy statements.

     

    Something I’ve periodically mentioned to various people over the years is that when it comes to data protection, simplicity is King. This can be best summed up with the following rule to follow when designing a backup system:

    If you can’t summarise your backup solution on the back of a napkin, it’s too complicated.

    Now, the first reaction a lot of people have to that is “but if I do X and Y and Z and then A and B on top, then it’s not going to fit, but we don’t have a complex environment”.

    Well, there’s two answers to that:

    1. We’re not talking a detailed technical summary of the environment, we’re talking a high level overview.
    2. If you still can’t give a high level overview on the back of a napkin, it is too complicated.

    Another way to approach the complexity issue, if you happen to have a phobia about using the back of a napkin is – if you can’t give a 30 second elevator summary of your solution, it’s too complicated.

    If you’re struggling to think of why it’s important you can summarise your solution in such a short period of time, or such limited space, I’ll give you a few examples:

    1. You need to summarise it in a meeting with senior management.
    2. You need to summarise it in a meeting with your management and a vendor.
    3. You’ve got 5 minutes or less to pitch getting an upgrade budget.
    4. You’ve got a new assistant starting and you’re about to go into a meeting.
    5. You’ve got a new assistant starting and you’re about to go on holiday.
    6. You’ve got consultant(s) (or contractors) coming in to do some work and you’re going to have to leave them on their own.
    7. The CIO asks “so what is it?” as a follow-up question when (s)he accosts you in the hallway and asks, “Do we have a backup policy?”

    I can think of a variety of other reasons, but the point remains – a backup system should not be so complex that it can’t be easily described. That’s not to ever say that it can’t either (a) do complex tasks or (b) have complex components, but if the backup administrator can’t readily describe the functioning whole, then the chances are that there is no functioning whole, just a whole lot of mess.

     

    You might think, given that I wrote an article awhile ago about the Procedural Obligations of Backup Administrators that it wouldn’t be necessary to explicitly spell out any recovery rules – but this isn’t quite the case. It’s handy to have a “must follow” list of rules for recovery as well.

    In their simplest form, these rules are:

    1. How
    2. Why
    3. Where
    4. When
    5. Who

    Let’s look at each one in more detail:

    1. How – Know how to do a recovery, before you need to do it. The worst forms of data loss typically occur when a backup routine is put in place that is untried on the assumption that it will work. If a new type of backup is added to an environment, it must be tested before it is relied on. In testing, it must be documented by those doing the recovery. In being documented, it must be referenced by operational procedures*.
    2. Why – Know why you are doing a recovery. This directly affects the required resources. Are you recovering a production system, or a test system? Is it for the purposes of legal discovery, or because a database collapsed?
    3. Where – Know where you are recovering from and to. If you don’t know this, don’t do the recovery. You do not make assumptions about data locality in recovery situations. Trust me, I know from personal experience.
    4. When – Know when the recovery needs to be completed by. This isn’t always answered by the why factor – you actually need to know both in order to fully schedule and prioritise recoveries.
    5. Who – Know who requested the recovery is authorised to do so. (In order to know this, there should be operational recovery procedures – forms and company policies – that indicate authorisation.)

    If you know the how, why, where, when and who, you’re following the golden rules of recovery.


    * Or to put it another way – documentation is useless if you don’t know it exists, or you can’t find it!

     

    When I first mentioned probe based backups a while ago, I suggested that they’re going to be a bit of a sleeper function – that is, I think they’re being largely ignored at the moment because people aren’t quite sure how to make use of them. My take however is that over time we’re going to see a lot of sites shifting particular backups over to probe groups.

    Why?

    Currently a lot of sites shoe-horn ill-fitting backup requirements into rigid schedules. This results in frequent violations of the best practices approach to backup of Zero Error Policies. Here’s a prime example: for those sites that need to do laptop and/or desktop backups using NetWorker, the administrators are basically resigned on those sites to having failure rates in such groups of 50% or more depending on how many machines are currently not connected to the network.

    This doesn’t need to be the case – well, not any more thanks to probe based backups. So, if you’ve been scratching your head looking for a practical use for these backups, here’s something that may whet your appetite.

    Scenario

    Let’s consider a site where there are group of laptops and desktops that are integrated into the NetWorker backup environment. However, there’s never a guarantee of which machines may be connected to the network at any given time. Therefore administrators typically configure laptop/desktop backup groups to start at say, 10am, on the premise that the most systems are likely to be available at that time.

    Theory of Resolution

    Traditional time-of-day start backups aren’t really appropriate to this scenario. What we want is a situation where the NetWorker server waits for those infrequently connected clients to be connected, then runs a backup at the next opportunity.

    Rather than having a single group for all clients and accepting that the group will suffer significant failure rates, split each irregularly connected client into its own group, and configure a backup probe.

    The backup system will loop probes of the configured clients during nominated periods in the day/night at regular intervals. When the client is connected to the network and the probe successfully returns that (a) the client is running and (b) a backup should be done, the backup is started on the spot.

    Requirements

    In order to get this working, we’ll need the following:

    • NetWorker 7.5 or higher (clients and server)
    • A probe script – one per operating system type
    • A probe resource – one per operating system type
    • A 1:1 mapping between clients of this type and groups.

    Practical Application

    Probe Script

    This is a command which is installed on the client(s), in the same directory as the “save” or “save.exe” binary (depending on OS type), and starts with either nsr or save. I’ll be calling my script:

    nsrcheckbackup.sh

    I don’t write Windows batch scripts. Therefore, I’ll give an example as a Linux/Unix shell script, with an overview of the program flow. Anyone who wants to write a batch script version is welcome to do so and submit it.

    The “proof of concept” algorithm for the probe script works as follows:

    • Establish a “state” directory in the client nsr directory called bckchk. I.e., if the directory doesn’t exist, create it.
    • Establish a “README” file in that directory for reference purposes, if it doesn’t already exist.
    • Determine the current date.
    • Check for a previous date file. If there was a previous date file:
      • If the current date equals the previous date found:
        • Write a status file indicating that no backup is required.
        • Exit, signaling that no backup is required.
      • If the current date does not equal the previous date found:
        • Write the current date to the “previous” date file.
        • Write a status file indicating that the current date doesn’t match the “previous” date, so a new backup is required.
        • Exit, signaling that a backup is required.
    • If there wasn’t a previous date file:
      • Write the current date to the “previous” date file.
      • Write a status file indicating that no previous date was found so a backup will be signaled.
      • Exit, signaling backup should be done.

    Obviously, this is a fairly simplistic approach, but is suitable for a proof of concept demonstration. If you were wishing to make the logic more robust for production deployment, my first suggestion would be to build in mminfo checks to determine (even if the dates match), whether there has been a backup “today”. If there hasn’t, that would override and force a backup to start. Additionally, if users can connect via VPN and the backup server can communicate with connected clients, you may want to introduce some logic into the script to deny probe success over the VPN.

    If you were wanting a OS independent script for this, you may wish to code in Perl, but I’ve hung off doing that in this case simply because a lot of sites have reservations about installing Perl on Windows systems. (Sigh.)

    Without any further guff, here’s the sample script:

    preston@aralathan ~
    $ cat /usr/sbin/nsrcheckbackup.sh
    #!/bin/bash
    
    PATH=$PATH:/bin:/sbin:/usr/sbin:/usr/bin
    CHKDIR=/nsr/bckchk
    
    README=`cat <<EOF
    ==== Purpose of this directory ====
    
    This directory holds state file(s) associated with the probe based
    laptop/desktop backup system. These state file(s) should not be
    deleted without consulting the backup administrator.
    EOF
    `
    
    if [ ! -d "$CHKDIR" ]
    then
       mkdir -p "$CHKDIR"
    fi
    
    if [ ! -f "$CHKDIR/README" ]
    then
       echo $README > "$CHKDIR/README"
    fi
    
    DATE=`date +%Y%m%d`
    COMPDATE=`date "+%Y%m%d %H%M%S"`
    LASTDATE="none"
    STATUS="$CHKDIR/status.txt"
    CHECK="$CHKDIR/datecheck.lck"
    
    if [ -f "$CHECK" ]
    then
       LASTDATE=`cat $CHKDIR/datecheck.lck`
    else
       echo $DATE > "$CHECK"
       echo "$COMPDATE Check file did not exist. Backup required" > "$STATUS"
       exit 0
    fi
    
    if [ -z "$LASTDATE" ]
    then
       echo "$COMPDATE Previous check was null. Backup required" > "$STATUS"
       echo $DATE > "$CHECK"
       exit 0
    fi
    
    if [ "$DATE" = "$LASTDATE" ]
    then
       echo "$COMPDATE Last backup was today. No action required" > "$STATUS"
       exit 1
    else
       echo "$COMPDATE Last backup was not today. Backup required" > "$STATUS"
       echo $DATE > "$CHECK"
       exit 0
    fi

    As you can see, there’s really not a lot to this in the simplest form.

    Once the script has been created, it should be made executable and (for Linux/Unix/Mac OS X systems), be placed in /usr/sbin.

    Probe Resource

    The next step is, within the NetWorker, to create a probe resource. This will be shared by all the probe clients of the same operating system type.

    A completed probe resource might resemble the following:

    Configuring the probe resource

    Configuring the probe resource

    Note that there’s no path in the above probe command – that’s because NetWorker requires the probe command to be in the same location as the save command.

    Once this has been done, you can either configure the client or the probe group next. Since the client has to be reconfigured after the probe group is created, we’ll create the probe group first.

    Creating the Probe Groups

    First step in creating the probe groups is to come up with a standard so that they can be easily identified in relation to all other standard groups within the overall configuration. There are two approaches you can take towards this:

    • Preface each group name with a keyword (e.g., “probe”) followed by the host name the group is for.
    • Name each group after the client that will be in the group, but set a comment along the lines of say, “Probe Backup for <hostname>”.

    Personally, I prefer the second option. That way you can sort by comment to easily locate all probe based groups but the group name clearly states up front which client it is for.

    When creating a new probe based group, there are two tabs you’ll need to configure – Setup and Advanced – within the group configuration. Let’s look at each of these:

    Probe group configuration – Setup Tab

    Probe group configuration – Setup Tab

    You’ll see from the above that I’m using the convention where the group name matches the client name, and the comment field is configured appropriately for easy differentiation of probe based backups.

    You’ll need to set the group to having an autostart value of Enabled. Also, the Start Time field does have relevance exactly once for probe based backups – it still seems to define the first start time of the probe. After that, the probe backups will follow the interval and start/finish times defined on the second tab.

    Here’s the second tab:

    Probe Group - Advanced Tab

    Probe Group - Advanced Tab

    The key thing on this obviously is the configuration of the probe section. Let’s look at each option:

    • Probe based group – Checked
    • Probe interval – Set in minutes. My recommendation is to have each group a different number of minutes. (Or at least reduce the number of groups that have exactly the same probe interval.) That way over time as probes run, there’s less likelihood of multiple groups starting at the same time. For instance, in my test setup, I have 5 clients, set to intervals of 90 minutes, 78 minutes, 104 minutes, 82 minutes and 95 minutes*.
    • Probe start time – Time of day that probing starts. I’ve left this on the defaults, which may be suitable for desktops, but for laptops where there’s a very high chance of machines being disconnected of a night time, you may wish to start probing closer to the start of business hours.
    • Probe end time – Time of day that NetWorker stops probing the client. Same caveats as per the probe start time above.
    • Probe success criteria – Since there’s only one client per group, you can leave this at all.
    • Time since successful backup – How many days NetWorker should allow probing to run unsuccessfully before it forcibly sets a backup running. If set to zero it will never force a backup running. I’ve actually changed, since I took the screen-shot, that value, and set it to 3 on my configured clients. Set yours to a site-optimal value. Note that since the aim is to run only one backup every 24 hours, setting this to “1″ is probably not all that logical an idea.

    (The last field, “Time of the last successful backup” is just a status field, there’s nothing to configure there.)

    If you have schedules enforced out of groups, you’ll want to set the schedule up here as well.

    With this done, we’re ready to move onto the client configuration!

    Configuring the Client for Probe Backups

    There’s two changes required here. In the General tab of the client properties, move the client into the appropriate group:

    Adding the client to the correct group

    Adding the client to the correct group

    In the “Apps & Modules” tab, identify the probe resource to be used for that client:

    Configuring the client probe resource

    Configuring the client probe resource

    Once this has been done, you’ve got everything configured, and it’s just a case of sitting back and watching the probes run and trigger backups of clients as they become available. You’ll note, in the example above, that you can still use savepnpc (pre/post commands) with clients that are configured for probe backups. The pre/post commands will only be run if the backup probe confirms that a backup should take place.

    Wrapping Up

    I’ll accept that this configuration can result in a lot of groups if you happen to have a lot of clients that require this style of backup. However, that isn’t the end of the world. Reducing the number of errors reported in savegroup completion notifications does make the life of backup administrators easier, even if there’s a little administrative overhead.

    Is this suitable for all types of clients? E.g., should you use this to shift away from standard group based backups for the servers within an environment? The answer to that is a big unlikely. I do really see this as something that is more suitable for companies that are using NetWorker to backup laptops and/or desktops (or a subset thereof).

    If you think no-one does this, I can think of at least five of my customers alone who have requirements to do exactly this, and I’m sure they’re not unique.

    Even if you don’t particularly need to enact this style of configuration for your site, what I’m hoping is that by demonstrating a valid use for probe based backup functionality, I may get you thinking about where it could be used at your site for making life easier.

    Here’s a few examples I can immediately think of:

    • Triggering a backup+purge of Oracle archived redo logs that kick in once the used capacity of the filesystem the logs are stored on exceed a certain percentage (e.g., 85%).
    • Triggering a backup when the number of snapshots of a fileserver exceed a particular threshold.
    • Triggering a backup when the number of logged in users falls below a certain threshold. (For example, on development servers.)
    • Triggering a backup of a database server whenever a new database is added.

    Trust me: probe based backups are going to make your life easier.


    * There currently appears to be a “feature” with probe based backups where changes to the probe interval only take place after the next “probe start time”. I need to do some more review on this and see whether it’s (a) true and (b) warrants logging a case.

    © 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha