Introduction

I was lucky enough to get to work on the beta testing programme for NetWorker 7.6 SP1. While there are a bunch of new features in NW 7.6 SP1 (with the one most discussed by EMC being the Data Domain integration), I want to talk about three new features that I think are quite important, long term, in core functionality within the product for the average Joe.

These are:

  • Scheduled Cloning
  • AFTD enhancements
  • Checkpoint restarts

Each of these on their own represents significant benefit to the average NetWorker site, and I’d like to spend some time discussing the functionality they bring.

[Edit/Aside: You can find the documentation for NetWorker 7.6 SP1 available in the updated Documentation area on nsrd.info]

Scheduled Cloning

In some senses, cloning has been the bane of the NetWorker administrator’s life. Up until NW 7.6 SP1, NetWorker has had two forms of cloning:

  • Per-group cloning, immediately following completion of backups;
  • Scripted cloning.

A lot of sites use scripted cloning, simply due to the device/media contention frequently caused in per-group cloning. I know this well; since starting working with NetWorker in 1996, I’ve written countless numbers of NetWorker cloning scripts, and currently am the primary programmer for IDATA Tools, which includes what I can only describe as a cloning utility on steroids (‘sslocate’).

Will those scripts and utilities go away with scheduled cloning? Well, I don’t think they’re always going to go away – but I do think that they’ll be treated more as utilities rather than core code for the average site, since scheduled cloning will be able to achieve much of the cloning requirements for companies.

I had heard that scheduled cloning was on the way long before the 7.6 SP1 beta, thanks mainly to one day getting a cryptic email along the lines of “if we were to do scheduled cloning, what would you like to see in it…” – so it was pleasing, when it arrived, to see that much of my wish list had made it in there. As a first-round implementation of the process, it’s fabulous.

So, let’s look at how we configure scheduled clones. First off, in NMC, you’ll notice a new menu item in the configuration section:

Scheduled Cloning Resource, in menu

This will undoubtedly bring joy to the heart of many a NetWorker administrator. If we then choose to create a new scheduled clone resource, we can create a highly refined schedule:

Scheduled Clone Resource, 1 of 2

Let’s look at those options first before moving onto the second tab:

  • Name and comment is pretty self explanatory – nothing to see there.
  • Browse and retention – define, for the clone schedule, both the browse and retention time of the savesets that will be cloned.
  • Start time – Specify exactly what time the cloning is to start.
  • Schedule period – Weekly allows you to specify which days of the week the cloning is to run. Monthly allows you to specify which dates of the month the cloning will run.
  • Storage node – Allows you to specify to which storage node the clone will write to. Great for situations where you have say, an off-site storage node and you want the data streamed directly across to it.
  • Clone pool – Which pool you want to write the clones to – fairly expected.
  • Continue on save set error – This is a big help. Standard scripting of cloning will fail if one of the savesets selected to clone has an error (regardless of whether that’s a read error, or it disappears (e.g., is staged off) before it gets cloned, etc.) and you haven’t used the ‘-F’ option. Click this check box and the cloning will at least continue and finish all savesets it can hit in one session.
  • Limit number of save set clones – By default this is 1, meaning NetWorker won’t create more than one copy of the saveset in the target pool. This can be increased to a higher number if you want multiple clones, or it can be set to zero (for unlimited), which I wouldn’t see many sites having a need for.

Once you’ve got the basics of how often and when the scheduled clone runs, etc., you can move on to selecting what you want cloned:

Scheduled Clone Resource, 2 of 2

I’ve only just refreshed my lab server, so you can see that a bit of imagination is required in the above screen shot to flesh out what this may look in a normal site. But, you can choose savesets to clone based on:

  • Group
  • Client
  • Source Pool
  • Level
  • Name

or

  • Specific saveset ID/clone IDs

When specifying savesets based on group/client/level/etc., you can also specify how far back NetWorker is to look for savesets to clone. This avoids a situation whereby you might say, enable scheduled cloning and suddenly have media from 3 years ago requested.

You might wonder about the practicality of being able to schedule a clone for specific SSID/CloneID combos. I can imagine this would be particularly useful if you need to do ad-hoc cloning of a particularly large saveset. E.g., if you’ve got a saveset that’s say, 10TB, you might want to configure a schedule that would start specifically cloning this at 1am Saturday morning, with your intent being to then delete the scheduled clone after it’s finished. In other words, it’s to replace having to do a scheduled cron or at job just for a single clone.

Once configured, and enabled, scheduled cloning runs like a dream. In fact, it was one of the first things I tackled while doing beta testing, and almost every subsequent day found myself thinking at 3pm “why is my lab server suddenly cloning? – ah yes, that’s why…”

AFTD Enhancements

There’s not a huge amount to cover in terms of AFTD enhancements – they’re effectively exactly the same enhancements that have been run into NetWorker 7.5 SP3, which I’ve previously covered here. So, that means there’s a better volume selection criteria for AFTD backups, but we don’t yet have backups being able to continue from one AFTD device to another. (That’s still in the pipeline and being actively worked on, so it will come.)

Even this one key change – the way in which volumes are picked in AFTDs for new backups – will be a big boon for a lot of NetWorker sites. It will allow administrators to not focus so much on the “AFTD data shuffle”, as I like to consider it, and instead focus on higher level administration of the backup environment.

(These changes are effectively “under the hood”, so there’s not much I can show in the way of screen-shots.)

Checkpoint Restarting

When I first learned NetBackup, I immediately saw the usefulness of checkpoint restarting, and have been eager to see it appear in NetWorker since that point. I’m glad to say it’s appeared in (what I consider to be) a much more usable form. So what is checkpoint restarting? If you’re not familiar with the term, it’s where the backup product has regular points at which it can restart from, rather than having to restart an entire backup. Previously NetWorker has only done this at the saveset level, but that’s not really what the average administrator would think of when ‘checkpoints’ are discussed. NetBackup, last I looked at it, does this at periodic intervals – e.g., every 15 minutes or so.

Like almost everything in NetWorker, we get more than one way to run a checkpoint:

Checkpoint restart options

Within any single client instance you can choose to enable checkpoint restarting, with the restart options being:

  • Directory – If a backup failure occurs, restart from the last directory that NetWorker had started processing.
  • File – If a backup failure occurs, restart from the last file NetWorker had started processing.

Now, the documentation warns that with checkpoint enabled, you’ll get a slight performance hit on the backup process. However, that performance hit is nothing compared to the performance and potentially media hit you’d take if you’re 9.8TB through a 10TB saveset and the client is accidentally rebooted!

Furthermore, in my testing (which admittedly focused on savesets smaller than 10GB), I inevitably found that with either file or directory level checkpointing enabled, the backup actually ran faster than the normal backup. So maybe it’s also based on the hardware you’re running on, or maybe that performance hit doesn’t come in until you’re backing up millions of files, but either way, I’m not prepared to say it’s going to be a huge performance hit for anyone yet.

Note what I said earlier though – this can be configured on a single client instance. This lets you configure checkpoint restarting even on the individual client level to suit the data structure. For example, let’s consider a fileserver that offers both departmental/home directories, and research areas:

  • The departmental/home directories will have thousands and thousands of directories – have a client instance for this area, set for directory level checkpointing.
  • The research area might feature files that are hundreds of gigabytes each – use file level checkpointing here.

When I’d previously done a blog entry wishing for checkpoint restarts (“Sub saveset checkpointing would be good“), I’d envisaged the checkpointing being done via the continuation savesets – e.g., “C:\”, “<1>C:\”, “<2>C:\”, etc. It hasn’t been implemented this way; instead, each time the saveset is restarted, a new saveset is generated of the same level, catering to whatever got backed up during that saveset. On reflection, I’m not the slightest bit hung up over how it’s been implemented, I’m just overjoyed to see that it has been implemented.

Now you’re probably wondering – does the checkpointing work? Does it create any headaches for recoveries? Yes, and no. As in, yes it works, and no, throughout all my testing, I wasn’t able to create any headaches in the recovery process. I would feel very safe about having checkpoint restarts enabled on production filesystems right now.

Bonus: Mac OS X 10.6 (“Snow Leopard”) Support

For some time, NetWorker has had some issues supporting Mac OS X 10.6, and it’s certainly caused some problems for various customers as this operating system continues to get market share in the enterprise. I was particularly pleased during a beta refresh to see an updated Mac OS X client for NetWorker. This works excellently for backup, recovery, installation, uninstallation, etc. I’d suggest on the basis of the testing I did that any site with OS X should immediately upgrade those clients at least to 7.6 SP1.

In Summary

The only major glaring question for me, looking at NetWorker 7.6 SP1 is the obvious: this has so many updates, and so many new features, way more than we’d see in a normal service pack – why the heck isn’t it NetWorker v8?

 

When something is going wrong in a NetWorker environment, the first thing you need to do is be able to run up some basic tests. If the issue has anything to do with NetWorker clients, you’ll want to be able to initiate a series of network, probe and index based tests. If you’ve got nothing scripted, ‘check-clients’ from IDATA Tools may very well be what you’re looking for.

As a command line tool, ‘check-clients’ can power through a suite of different tests and data gathering activities against your clients, all with very minimum effort on your part. Let’s look at the tests that are currently available:

[root@nox bin]# check-clients -l
Test Name           Test Description
------------------- ------------------------------------------------------
client_ids          Returns client ID for each configured client
empty               Report clients with empty indices
index               Perform nsrck -L3 on each client
index_rebuild       Perform nsrck -L6 on each client
info                Retrieve client information
list_active         List all configured clients in active groups
list_all            List all clients currently configured
performance         Check backup performance via bigasm
ping                Ping each client
probe               Savgroup probe for each client
resolution          Test/confirm name resolution
rpcinfo             Test rpcinfo/portmapper access
used_space          Calculates used space for backups

Now technically, not all of the above are actually tests as such – for instance, the used_space option was one recently requested by a customer to report on all backups currently held by a backup server for a client. Running it on one of my lab machines, the output looks like the following:

[root@nox bin]# check-clients -g all_active -t used_space
============================================================
Running test: used_space (Calculates used space for backups)
============================================================
        Client                         Used Space (GB)
        ----------------------------   --------------------
        archon                                    362.60783
        faero                                       0.00000
        luyten                                      0.00000
        nox                                       544.40887
        ----------------------------   --------------------
                 Total for 4 clients              907.01669
        ----------------------------   --------------------

To me, that’s a combo test/information gathering option; specifically the customer was after this particular test so that they could spot any newly added clients that hadn’t been backing up (i.e., by having a “Used Space” of 0 GB).

Equally, there’s use in periodically running the “client_ids” test – running and keeping the output of this test will give you help in any sticky situation where you suddenly need access to a previous clients’ host ID:

[root@nox bin]# check-clients -a -t client_ids
=======================================================================
Running test: client_ids (Returns client ID for each configured client)
=======================================================================
        aralathan = 65100d33-00000004-464fcacc-464fcacb-00050000-c0a86404
        archon = 3f33ca7b-00000004-43a4837c-43a484d7-00030000-c0a80006
        asgard = 00b151ed-00000004-43a4837b-43a4837a-00010000-c0a80006
        djwmp = 5560bbf6-00000004-4910cd4b-4910cd4a-01961a00-3d2a4f4b
        faero = 76c06b0a-00000004-453e8e44-453e8e43-00310000-c0a86406
        loki = d3f277da-00000004-4857452f-4857452e-00020000-c0a86404
        luyten = 93166424-00000004-4a2f8cde-4a2f8cdd-01041a00-3d2a4f4b
        nimrod = d6454919-00000004-496aaadc-496aaadb-006f1a00-3d2a4f4b
        nox = 85acae6f-00000004-464fbdd1-464fbdd0-00010000-c0a86404
        valhalla = 61d3ca1e-00000004-495525db-4955299a-00051500-98e71c17

Moving on into actual test territory, multiple tests can be teamed up to do a chunk of information gathering in one command. For instance, combining a ping test and a name resolution test against all active clients is as simple as:

[root@nox bin]# check-clients -g all_active -t ping,resolution
=====================================
Running test: ping (Ping each client)
=====================================
	archon  (0 responses, expected 4)
	faero  (0 responses, expected 4)
	luyten  (4 responses)
	nox.pmdg.lab  (4 responses)

=======================================================
Running test: resolution (Test/confirm name resolution)
=======================================================

	archon
		Name: archon (archon.pmdg.lab) (192.168.100.1) 
		Name: archon.pmdg.lab (archon.pmdg.lab) (192.168.100.1) 
		Addr: 192.168.100.1 (archon.pmdg.lab) 

	faero
		Name: faero (faero.pmdg.lab) (192.168.100.10) 
		Name: faero.pmdg.lab (faero.pmdg.lab) (192.168.100.10) 
		Addr: 192.168.100.10 (faero.pmdg.lab) 

	luyten
		Name: luyten (luyten.pmdg.lab) (192.168.100.18) 
		Name: luyten.pmdg.lab (luyten.pmdg.lab) (192.168.100.18) 
		Addr: 192.168.100.18 (luyten.pmdg.lab) 

	nox.pmdg.lab
		Name: nox.pmdg.lab (nox.pmdg.lab) (192.168.100.4) 
		Name: nox (nox.pmdg.lab) (192.168.100.4) 
		Addr: 192.168.100.4 (balrog.pmdg.lab (unknown))

None of this is re-inventing the wheel of course, but being able to just run a single command that cycles through and tests every active client (or even all clients) is particularly useful.

Even performance testing is catered for with check-clients; reaching out to the clients, the utility can run bigasm tests automatically – a great way for easily testing where performance hits are happening on the network. For example, a quick/basic demo of this option is below:

[root@nox bin]# check-clients -c luyten,nox.anywebdb.com -b Staging -S 50 -t performance
===============================================================
Running test: performance (Check backup performance via bigasm)
===============================================================
        luyten (Solaris/UNIX style test)
                Backup 50 MB to Staging
                50 MB took 12 seconds (4.17 MB/s)
        nox.pmdg.lab (Linux/UNIX style test)
                Backup 50 MB to Staging
                50 MB took 3 seconds (16.67 MB/s)

If you are looking around for a test kit option for NetWorker – and want access to a heap of other goodies at the same time – then ‘check-clients’ out of the IDATA Tools suite may very well be what you need.

 

Backup to disk has well and truly become entrenched as a core backup strategy in most companies. By “backup to disk” I’m referring to either of ADV_FILE devices or VTLs – i.e., the general notion of backing up first to disk. For the rest of the article, since I’m feeling a little lazy today, I’ll follow industry norm and call backup to disk by the generic “B2D”.

Now, in most companies, there’ll still be physical tape involved. Long-term backups held on sufficiently replicated storage – even with deduplication – is going to remain costly for some time to come; but once B2D appears within an organisation, one of two architecture decisions will typically occur:

  1. B2D region designed to hold a “significant” nearline capacity, where “significant” refers to a business-appropriate amount of recent backups.
  2. B2D region designed as a “staging” region to have just enough capacity, where “just enough” means that if data isn’t staged daily (or near-daily), staging areas will become full and backups will stop.

Having observed B2D regions designed as staging-only on several occasions now, I’m even more firmly convinced that B2D as staging is a false economy that fails to take into consideration a few key metrics. Sure, buying say, 5TB or 10TB of disk is cheaper than buying 40TB with deduplication, but the cost of storage doesn’t end with the purchase. In fact, since the actual dollar cost of storage is typically amortised out over its expected deployment time, that cost often ends up being pretty minimal.

There are three distinct costs that I see as evident when using B2D purely as a staging region. These are:

  • Staff time.
  • Physical wear and tear.
  • Increased risk of recovery failure.

Before I go further, I want to cover a term I used in the title of this post; “busy state staging” – it refers to environments where a significant portion of each day is spent with the B2D region being used to stage out from disk to physical tape, so as to free up room. There’s probably four key activities a backup system can be doing at any one time. These are:

  • Backup
  • Recovery
  • Duplication/Cloning
  • Maintenance

Backup, recovery and cloning are all givens; maintenance functions encompass media import/export/labelling, configuration activities, and most definitely includes staging. That’s right – staging is not any of backup, recovery or cloning; it falls into the category of moving data around in order to keep the system running. It’s effectively an overhead function for the environment, and as we know, the aim in any environment is to keep overheads to a minimum.

Over the expected deployment period of the B2D region in a backup system, I’d argue that those three costs previously cited add up to enough to demonstrate that the vast majority of businesses should not deploy B2D in a staging-only configuration. Let’s consider each of them individually.

Staff Time

This is the easiest to factor in. Let’s say your backup administrator has to spend roughly an hour a day between monitoring and maintaining free capacity on a staging-only B2D region. Now add up those hours per day, per week, per year across the lifetime of a deployment, and see how much it represents based on the hourly rate of the backup administrator. Assume $40 per hour, 4 weeks annual leave a year. So that leaves 48 weeks, 5 hours per week at $40 an hour. That’s $9,600 per year of staff costs through managing a poorly provisioned B2D region.

Usually that’s not the final cost though in staff time – my personal experience is that there’s a higher tendency in environments that use B2D for staging to need to engage temporary contractors, etc., to help fill in on projects where systems administration staff don’t have available time to do other projects in the company. So let’s assume that as a result of the backup administrator having to focus on B2D staging an hour a day the organisation has to engage a contractor one week a year to make up the short-fall. Assuming a contracting rate of $80 per hour, that’s $3,200 per year.

Now, assuming B2D storage has been provisioned over a 3 year period, we’re adding $38,400 to the maintenance impact of a staging-only region.

My gut feel, by the way, is that in an appropriately provisioned B2D architecture, the backup administrator will spend at most one fifth of the time in B2D storage administration; and there won’t be a need to engage contractors for that reason. So that $38,400 cost would shrink to say, $5,760 of time. In anyone’s books, that’s a good percentage saving.

Physical Wear and Tear

We’d count ourselves lucky if the only impact of using B2D in a staging configuration were staff costs. There’s more though. The wear and tear on both physical media and physical tape drives will be significantly increased, as these units will be running more frequently. Not only that, rather than having a reduced priority, the service time on physical tape is almost as critical in a tape-only environment. The net consequence is that rather than being able to say, work with a next-day service contract for the physical tape libraries, organisations are forced to stick with a 4-hour same-day response contract. As we know, there’s usually a pretty significant price difference between these types of contracts!

Increased Risk of Recovery Failure

We’d equally count ourselves lucky if the only impacts of using B2D in a staging-only configuration were just staff time and increased maintenance costs. The real insidious cost though is the risk of a recovery failure. In this, I’m not referring to any limitations that may exist around simultaneously recovering to while staging/cloning from B2D media. What I’m referring to is the risk that a backup may not actually run in the first place because a staging region becomes full, blocking new sessions starting. When considered from a backup perspective, that may not sound a lot. Turning it around to the purpose of a backup: imagine the consequence though of that data that was never backed up being needed for a recovery. While it may be logical to say “if it can’t be backed up, then we can’t factor it into recovery requirements”, but disasters, emergencies and auditors do not come when it’s convenient for us.

With this in mind, any backup that fails to run because a staging area is full should be considered from the full impact of a recovery SLA being breached for that data. That may sound harsh, but I’d actually suggest it’s a more business-focused rather than IT-focused approach to backup.

How’s that busy-state staging sounding now?

Enterprise data protection is one of those areas where businesses are most tempted to do cost cutting. We see it with Icarus support contracts, with inappropriate coupling of services, and we see it with B2D staging areas. We can intuit with almost no effort that busy state staging isn’t the best backup model. If your system is busy 20 hours a day between backup, cloning and maintenance functions, then it’s obvious that there’s at least an increased risk of parts failure; but the cost of the architecture is also magnified by wasted staff time, increased maintenance contract costs, and the potential failure to facilitate business-required recoveries.

When we take all those things into consideration, architecting B2D for significant or at least appropriate nearline recovery purposes rather than just staging becomes the cheaper option.

 

Periodically there’ll be a post about storage that counsels the more obvious fact that “Backup is not Archive”. Less frequently discussed, but perhaps more important, is the fact that archive is not backup. To focus on why, and how this is the case, I want to look at email archive.

If we look at a standard email archive model – say, something like SourceOne, then it can, if you squint a bit, look a little like an email backup product – but it’s not really. SourceOne can not only discover and handle archive storage for existing email when it’s installed, but it has the option of automatically ingesting email into the archive as soon as it’s received. Users can then, if they want to, retrieve email directly from the archive rather than asking for a “brick level” recovery.

But is the email archive a backup?

While the short answer is “no”, the long answer is a little more complex than you might think.

Consider the definition of a backup:

A backup is a copy of any data that can be used to restore the data as/when required to its original form. That is, a backup is a valid copy of data, files, applications, or operating systems that can be used for the purposes of recovery.

(From “Enterprise Systems Backup and Recovery: A corporate insurance policy“)

Now, if we consider an email system from the perspective of end user requests for item level recovery, then in that narrow instance, we would be forced to declare the archive to indeed be a backup. However, if the email archive system is unable to restore the entire system state of the email server – from the OS right through to the email database – then from a broader, disaster recovery and system recovery perspective, archive is not backup.

As archive systems grow in complexity and offer more rich feature sets, there’s a blurry line where some people struggle to understand why they’d backup and archive the same system(s). So we provide the litmus test:

Regardless of what the archive system allows recovery of, if it does not allow recovery of the entire system, it’s not a backup.

So in that sense, an email archive system that allows brick level recovery, but can’t facilitate reconstructing the entire email server functionality is not a backup.

 

There’s a lot of talk about tracking data growth by watching SAN and NAS usage, counting allocated storage by the gigabyte, etc., but I’ve always thought that backup and recovery systems offered an elegant way of closely monitoring data growth within an environment.

Recently I was asked to contribute some articles about how backup and recovery can help to improve IT processes and performance within an organisation, and the first thing that occurred to me was to write about this very topic.

If you’re worried about tracking and trending data growth within your environment, and want to see some simple examples of how to account for peaks and troughs in backup capacity while still predicting data growth, please head over to “Using Backup and Recovery to Track and Forecast Data Growth” at IT Performance Improvement.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha