A couple of weeks ago now I decided to take a break from writing on the blog due to personal reasons. Being a bit of a geek, it’s fair to say that I have a lot of IT-related personal projects on the boil at any given time. Or rather, I used to.

However, I found that it had reached the point where I was focusing so much on the blog that I wasn’t giving myself any time to work on other projects that also happen to give me a great deal of personal satisfaction. So I wanted to pull back from the blog for a little while to evaluate where I was going, and what I was doing.

In its own way, this has been both a challenge and also very rewarding. Rewarding because I actually managed to achieve several things that I’d been putting off for ages, and challenging because it made me realise how much I’ve been using this blog as a measure of personal satisfaction.

On top of all that, last Friday marked my partner’s 40th birthday, and I would have been remiss if I’d not spent serious time on preparations for that*.

So, where does this leave nsrd.info?

I’m back, the hiatus is over, but having had some time for reflection, there’ll be a few changes so that I can ensure I don’t let the blog take over my personal life again. Here’s what those changes are going to be:


  • I’m going to wherever possible limit myself to two articles a week. Some weeks there may only be one.
  • I’d be very pleased to have additional contributors. A while ago Ronny Egner contributed a great article about installing the NetWorker client on OpenSolaris, and I’d be pleased to hear from others who have information they’d like to contribute.
  • Before the hiatus, I cutover to using Mail Chimp for notices about new articles. I’m going to continue to use Mail Chimp, but, I’ll be only using Mail Chimp for those one or two articles a week. What does that mean? See the next point…
  • While I’ll only be aiming to publish one or two articles a week, I’ll be starting a new “Tidbit” category and if minor items come to mind during the week (e.g., I spot an article I’d recommend you read), I’ll post it under the Tidbit category. With that in mind, you may very well wish to subscribe to the site’s RSS feed. These won’t be long articles, more like extended tweets.
  • My previous plan for micromanuals will probably slow down quite a bit. At this point I’ll be aiming to provide a micromanual for configuring LinuxVTL with NetWorker by the end of June, and after that I’ll evaluate when the next one will come out, and what it will be about.

In short, I’m hoping that by taking a more structured approach to the blog I’ll get a more balanced personal life :-)

In keeping with the strategy of actually finishing some personal projects, this week I’m going to be focusing on completing the Nomenclature page rather than completing new articles.

Thanks for your patience!


* On that front, here’s a tip for a 40th birthday present: an iPad. Well, that was part of it. The other tip is – if you’re going to have a party – organise in advance a photo book of everyone attending, and get them to sign and leave a message in the book on the night. It seems you can’t go wrong with that, as I was pleased to discover.

 

I will be taking a hiatus from this blog, which will last at least two weeks, or perhaps longer.

 

A very traditional approach to configuring automated backups in NetWorker is to make use of the schedule override feature in NetWorker groups. That is, by defining either a schedule or a level at the group level, the backup level from all clients in the group will be in lock-step. Pictorially, this configuration resembles the following:

Client levels/schedules in lock-stepWe frequently encourage this sort of setup because it takes two items which NetWorker can run disparately – start time, and level, and effectively merges the two – something a lot of other backup products just do as the one configuration item. Perhaps even more importantly, in small to mid size businesses with modest data levels, this makes more sense anyway – it allows you to readily construct “classic” backup scenarios, such as “full on Friday, incrementals the rest of the week”. So from the perspective of level and amount of data backed up, your backup week would look similar to the following:

Schedule for full backups once a week, incrementals rest of time, lock-stepNow, as I said, this works well for businesses with modest data sizes. However, as the image graphically demonstrates, this creates scenarios where there is a significant disparity between the amount of data backed up on regular days and the amount of data backed up for the fulls. Remembering that it’s the full backups that frequently end up straining backup architectures, companies will often end up revisiting their architecture when the amount of data backed up on the “full day” becomes unmanageable.

For some companies, the full day is chosen for sound business reasons – finance companies for instance may have to do weekly full backups starting close of business Friday, and full monthly backups on the last Friday every month. In these scenarios, where there are important business reasons for keeping full backups on a single day of the week/month, the backup architecture must remain constantly configured to handle the massive spike that full backups create.

However, in other companies where there are no strong business reasons for running all the fulls on the same day, it’s worth remembering that there is an alternate configuration – ironically enough, it’s very much the “default” NetWorker configuration, it’s just one most sites tend not to use. This configuration sees the group control only the start time/collection of clients, and does not have a schedule/level override assigned. Instead, the schedule of each client defines what level backup will be done for that client. This sort of configuration resembles the following:

Groups with schedules defined at the client levelAs you can imagine, this does require a slight change of administrative policies in relation to setting the correct schedule at the client level, and potentially needing additional client instances to handle the daily and monthly backups, but the advantage of this is that you can then start having groups where both incremental and non-incremental backups are done concurrently, spreading out the load of the full backups to create a significantly lower spike in resource requirements. So from the perspective of level and amount of data backed up, your backup week would instead look like the following:

Spreading full backups out over a weekThis style of schedule isn’t for everyone – as I said, if you have a strong business need to restrict all full backups to a particular day, it’s very unlikely to work. I’d suggest as well that it may not be a good strategy if you happen to have a high staff turnover, as it does realistically add a little more complexity into the environment. (While your environment should be as simple as possible, that doesn’t always mean “as simple as conceivable”.)

In larger environments though with significantly higher amounts of data requiring backup, this style of configuration can be a real boon. Compare weekly fulls of say, 10TB (effectively tiny) with weekly fulls of say, 500TB, and you can instantly see the attraction of this programme. Instead of having to design a system capable of handling 500TB in 24 hours, you might instead be able to limit your design to a system that at most has to handle 100TB over a 24 hour period (factoring in incrementals + fulls on any given night). That’s not an insignificant difference.

[Edit, 2010-05-11]

What’s this got to do with large groups? It occurred to me overnight that while the title of the post was originally “Large group backups”, I diverged somewhat between the original intent of the post and the actual resulting post.

So, the other area where this can be useful is in situations where you have groups with large numbers of clients. For example, in environments with 500+ clients, where a single group may have hundreds of clients in it, switching to mixed levels in the one group has the same effect as for an entire large environment, but at a single, localised group.

 

There’s long been discussion – particularly between support partners and EMC – about the rather cloak-and-dagger way that cumulative patch clusters have been made available for downloads. Or not made available, as the case may be. Recently though, that’s changed.

Cumulative patch clusters, if you’re not aware, are collections of patches to individual releases that effectively form a sub release. So considering NetWorker 7.5 SP1, otherwise known as NetWorker 7.5.1, we have cumulative patch clusters that effectively form NetWorker 7.5.1.1, NetWorker 7.5.1.2, etc.

There are typically two types of cumulative patch clusters:

  • Just the patches – i.e., the individual binaries that have been updated;
  • Entire new installers.

Personally, I prefer the second option, even though it means a little more downloading – but others may prefer the individual binaries.

On the NetWorker Support page at EMC, you’ll now find Cumulative Patch (aka Fix) Downloads available:

Cumulative patch builds

Now, I wouldn’t recommend that this should be considered open slather to just go and install every new cumulative patch cluster as it comes out – instead, I’d strongly advocate using the public availability of these builds to closely review the fix notes in each release and see if any of those fixes happen to match issues you’ve been experiencing but either (a) haven’t got around to logging a case about or (b) haven’t been able to resolve.

If they do, it would probably warrant considering talking to your support provider about installing the cumulative patch build in question.

As always, information makes backup administration easier, and knowing that these cumulative patch clusters are available and having ready access to the fix notes will become a very useful addition to the debugging and maintenance toolkit for NetWorker administrators.

 

I’m stepping out of my normal NetWorker zone here to briefly discuss what I think is a fundamental flaw with the current state of thin provisioning.

The notion of thin provisioning has effectively been around for ages, since it’s effectively from the mainframe age, but we started to see it come back into focus a while ago with the notion of “expanding disks” for virtualisation products. Ironically these started initially in the workstation products (VMware Workstation, Parallels Desktop, etc.) before starting to gain popularity at the enterprise virtualisation layer.

Yet thin provisioning doesn’t stop there – it’s also available at the array level, particularly in NAS devices as well. So what happens when you mix guest thin provisioning in a hypervisor with thin provisioning at the array/NAS level providing storage to the hypervisor?

Chaos.

Multiple layers of thin provisioning is potentially a major management headache in systems storage allocation. Why? It makes determining what storage you have available and allocated, when looking at any one layer, practically impossible. vSphere for instance may see that you’ve got 2TB of free space in storage that’s currently unallocated, and your NAS may be telling it there’s 2TB of free space, but it may actually only have 500GB free. Compounding the issue, the individual operating systems leveraging that storage as guests will also each have their own ideas about how much storage is available for use. One system suffering unexpected data growth (e.g., a patch provided by a vendor without warning that it’ll generate thousands of log messages a minute) might cause the entire thin provisioning sand castle to collapse around you.

This leads me to my concern about what’s missing in thin provisioning: a consolidated dashboard. A cross platform, cross vendor dashboard where every product that advertises “thin provisioning” can share information in the storage realm so that you, the storage administrator, can instantly see an exact display of allocated vs available real capacity.

This isn’t something that’s going to appear tomorrow, but I’d suggest that if all the vendors currently running around shouting about “thin provisioning” are really serious about it, they’d come up with a common, published API that can be used by any product to query through the entire storage-access vertical. I regret to say the C-word, but it’s clear there needs to be an inter-vendor Committee to discuss this requirement. That’s right, NetApp and EMC, HDS and HP, VMware and Microsoft (just to name a few) all need to sit at the same table and agree on a common framework that can be leveraged.

Without this, we’ll just keep going down the current rather chaotic and hazardous thin provisioning pathway. It’s like an uncleared minefield – you may manage to stagger through it without being blown up, but the odds are against you.

Surely even the vendors can see the logical imperative to reduce those odds.

Disclaimer: I’m prepared to admit that I’m completely wrong, and that vendors have already tackled this and I missed the announcement. Someone, please prove me wrong.

 

There is a bug with the way NetWorker 7.5.2 handles ADV_FILE devices in relation to disk evacuation. I.e., in a situation where you use NetWorker 7.5.2 to completely stage all savesets from an ADV_FILE device, the subsequent behaviour of NetWorker is contrary to normal operations.

If following the disk evacuation, either the standard overnight volume/saveset recycling checks are done, or an nsrim -X is explicitly called, before any new savesets are written to the ADV_FILE device, NetWorker will flag the depopulated volume as recyclable. The net result of this is that it will not permit new savesets to be written to the volume until such time as it is relabelled, or flagged as not recyclable.

When a colleague asked me to investigate this for a customer, I honestly thought it had to be some mistake, but I ran up the tests and dutifully confirmed that NetWorker under v7.5.2 was indeed doing it. However, it just didn’t seem right in comparison to previous known NetWorker behaviour, so I stepped my lab server back to 7.4.5, and NetWorker didn’t mangle the volume after it was evacuated. I then stepped up to 7.5.1, and again, NetWorker didn’t mangle the volume after it was evacuated.

This led me to review the cumulative patch cluster notes for 7.5.2.1 – while there’s been a more recent version released, I didn’t have it handy at the time. Nothing was mentioned on the notes that seemed to relate to this issue, but since I’d got the test process down to a <15 minute activity, I replaced the default 7.5.2 install with 7.5.2.1, and re-ran the tests.

Under 7.5.2.1, NetWorker behaved exactly as expected; no matter how many times “nsrim -X” was run after evacuating a disk backup unit volume, NetWorker did not mark the volume in question as recyclable.

My only surmise therefore is that one of the actual documented fixes in the 7.5.2.1 cumulative build, while not explicitly referring to the issue at hand, happened to (as a side-effect), resolve the issue.

To cut a long story short though, I would advise that if you’re backing up to ADV_FILE devices using NetWorker 7.5.2 that you strongly consider moving to 7.5.2 cumulative patch cluster 1 – i.e., 7.5.2.1.

 

Less than a month ago, Apple released service pack 3 to Snow Leopard – i.e., 10.6.3. A few days after that they released 10.6.3.1 which was apparently only needed in a few instances, but I downloaded and applied anyway due to some irregularities I’d noticed with my OS after installing the vanilla 10.6.3.

It’s recently occurred to me that NetWorker (7.6) has been a heck of a lot more reliable since going to 10.6.3 / 10.6.3.1. As always, it’s a bit of a grey zone, since it’s not officially supported (and there’s definitely some patching required) – hence the wait for 7.6 SP1, but overall I’m now noticing that the client process remains contactable by the server across multiple sleep/wake and/or location transitions, something that it wouldn’t do before. There’s still some other behavioural oddities, but overall, I realised that I’ve not reinstalled the client on my laptop now for over 2 weeks, which is a bit of a record since I installed Snow Leopard. If you’re in a situation where you absolutely have to be running 10.6 and backing up with NetWorker, and knowing it’s not currently supported, I’d suggest you make sure you’re on 10.6.3.1.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha