I’m curious as to the differences between using a commercial, supported version of Linux in the enterprise and a non-supported one. Now, I know all the regular arguments – they’re implicitly stated in my article about Icarus Support Contracts.

But here’s the beef: I’m not convinced that commercial Linux companies really offer a safety net. Or to put it another way – they may offer the net, but I’m yet to see much evidence that it’s actually secured to anything. It almost seems a bit like the emperor’s new clothes, and I believe we’re seeing a real surge in popularity of distributions such as CentOS for precisely this reason.

Here’s the sorts of things I’ve commonly seem from customers with commercial enterprise Linux distributions who say, log support cases with the Linux distributor:

  • Being advised to just simply apply the latest patches – OK, sometimes this is valid, but we all treat such recommendations with caution;
  • Being advised to search Google forums, etc.;
  • Being mired in finger pointing hell – it seems that most features or components a company will want to log a case over aren’t covered by the expensive support contracts that come with enterprise/commercial Linux;
  • Getting average and/or highly complicated responses that don’t inspire confidence.

In short, I worry that commercial enterprise Linux distributions provide few tangible benefits over repackaged or alternate distributions.

As proof that I’m serious about this subject, I’ll say something that years ago may have made me apoplectic: Even given how little I like Microsoft’s products, my honest observation is that companies with Microsoft support contracts get substantially more benefit at substantially lower cost than those who have similar support contracts with the enterprise commercial Linux vendors.

So, I’m asking people to convince me I’m wrong – or at least provide counter-arguments! If you’re using a commercial, enterprise Linux, please help me understand what value you get out of their support programmes – examples of problems they’ve solved, and how they’ve proved themselves equal to (or better than) support offerings from either Microsoft or other Unix providers. Any examples/stories that touch on data backup/recovery or storage would be of particular interest.

So feel free to add a comment and let me know what you think!

 

I’m not a storage geek – storage to me is a means to an end, almost irrelevant to the final goal.

I’m passionate about backup though, because backup is about making people happy.

Backup is about recovery, you see.

Recovery is about making sure people can go home on time rather than re-entering lost data all night.

Recovery is about knowing someone can turn up for a flight they booked six weeks earlier and know the airline still knows they booked the ticket.

Recovery is about knowing someone’s pay deposit isn’t lost after a brief systems hiccup.

Recovery is about a student saving a 50,000 word thesis on a server and knowing it will still be there next morning.

Recovery is about being able to look at digital photos of a loved one ten years after they’re gone.

I have the best job in the world.

If you work in backup and recovery, so do you.

 

This is an appeal for information.

I’ve heard conflicting stories and I can’t get rock solid clarification from any party. Despite Oracle initially announcing that Sun would continue to OEM NetWorker from EMC, I’ve subsequently been told by several Sun OEM customers that this has been recently abandoned. Since I’ve heard of Sun (under Oracle) dropping other contracts, it’s left me quite curious as to what the heck is going on.

If someone can give me a definitive answer, I’d appreciate it.

I want to make it plain – I’m not rumour mongering, just trying to get to the bottom of rumours.

 

On Twitter and via blogs, I subscribe to feeds from a bunch of vendors: HP, EMC, NetApp, Xiotech, Compellent, etc. (Please, someone point me at some good feeds for HDS, IBM, etc. Admittedly I’ve not spent a lot of time looking for them, but I’m still keen on finding them…)

There are days when I’m sorely tempted to find every vendor that I subscribe to on Twitter and unfollow. Why? Well, it’s not because I’ve lost interest in the industry, it’s just because I’m tired of the message that keeps on being sent. You see, Twitter has become a virtual boxing ring, except the fights aren’t by Queensbury rules, they’re all-in rumbles that have little attention to manners or what people are really interested in reading – facts and figures, real life analyses and implementation experiences, etc.

It’s all a bit petri dish at times – I sometimes imagine that it’s like watching an ethnic conflict starting from the very beginning. If you look at those conflicts and try to trace back reasons, it becomes a morass of “he did / no he did / no she said / no she said” style arguments that just can’t be unpicked; people don’t choose winners in such arguments, they get fed up and walk away from them, or become exasperated with both sides and choose a third option. I know I reach the point where I don’t care. I don’t care that X said Y about Z this week and it was a lie, because last week Z said a lie about X so it’s all just a fetid stink of payback.

I’m fed up. I suspect many others are too. So I’m asking all vendors out there to focus on customers and information, rather than doing your best to kick the crap out of each other. Who is with me on this one?

 

I have, on a few occasions, been puzzled as to how to downgrade NetWorker on Mac OS X. There’s a couple of distinct issues that I’ve come up against, and I thought I’d outline them here now that I’ve fully resolved how to do it.

The first is that when NetWorker installs, it’s meant to install uninstall utilities into /Library/Receipts/NetWorker.pkg. However, on Snow Leopard, NetWorker doesn’t write this uninstall information, meaning that technically it’s not possible to uninstall the product. There is, thankfully, a way around this.

First, open up your NetWorker.dmg file, but then drop into the command line and change directory into the NetWorker.pkg/Contents directory within the dmg:

Important Directory Listings - NetWorker Mac OS X Package

In the above screen shot, I’ve shown the two directories you need to be aware of; NetWorker.pkg/Contents, and NetWorker.pkg/Contents/Resources.

You’ll note in the Resources directory that there’s a NetWorkerUninstall script, which needs to be run as root. However, the script depends on there being some content in /Library/Receipts/NetWorker.pkg, so you’ll need to do the following:

$ sudo bash
# cd /Volumes/NetWorker<<version>>/NetWorker.pkg/Contents
# mkdir -p /Library/Receipts/NetWorker.pkg/Contents
# cp Archive.bom /Library/Receipts/NetWorker.pkg/Contents
# cp Resources/NetWorkerUninstall /Library/Receipts/NetWorker.pkg
# /Library/Receipts/NetWorker.pkg/NetWorkerUninstall

Once you run the NetWorkerUninstall script, there’ll be a brief pause before you see a flash of lines with entries such as:

Removing: /usr/share/man/man8/tur.8

and so on.

At the end of this, you theoretically should be able to run the NetWorker installer for the version you want to install. However, you’re likely to still end up with the following output from the installer:

NetWorker Installer - Newer Version Exists

It was this step that had been frustrating me. Thankfully though, I finally started to think like a combined Mac + Unix user, and released there was probably a plist style file hanging around somewhere that wasn’t being cleaned up by the uninstaller, and that if it followed Apple’s naming conventions, it would be com.emc.*.plist. So I did:

# find -xdev / -name "com.emc.*" -print

Lo and behold, I found the following:

/private/var/db/receipts/com.emc.networker.bom
/private/var/db/receipts/com.emc.networker.plist

Removing them was the final piece of the puzzle – without them hanging around, the NetWorker installer utility didn’t pick up there was a newer version of the software installed, and I was finally able to downgrade NetWorker for testing purposes.

 

NetWorker 7.5 SP3 has been released today, and with it comes a selection of important changes, including:

  • ADV_FILE devices now use enhanced load balancing selection criteria; in short, NetWorker will start new backups on the device that has the least NetWorker data written to it;
  • ADV_FILE devices that are newly created get better target/max sessions (1/32 respectively). You can still adjust these the way you want, but it’s a better starting point;
  • ADV_FILE devices can be configured to stop writing at a user-defined %full level;
  • Support for LTO-5;
  • NMC supported on Windows 2008 R2;
  • Autostart 5.3 SP4 support added.

There’s also the usual round of fixes and inclusion of various cumulative patch releases of the previous version. I’m yet to actually download and start testing 7.5 SP3, but it’s good to see the start of the ADV_FILE enhancements discussed for 7.6 SP1 get pushed down into the 7.5 tree as well.

I’m currently going through and updating the primary website, so I’ll aim at putting the PDF of the release notes in the standard location over the weekend. For the time being, you can download the notes either through PowerLink or this local link: NetWorker 7.5 SP3 Release Notes.

 

Every backup you do has a half-life, which isn’t the retention period of the backup. Now, if you’re new to NetWorker, don’t go looking for a half life setting for clients or savesets or groups; I’m referring to a concept here rather than a literal configuration option.

In most environments (in environments where the backup system is not being used for archive or HSM), a backup is most likely to be used within a short period of it being generated. That highest-probability period of usage is what I would suggest should be considered the half-life of the backup. Like regular notions of half-life, it’s not just a one-off measurement, but one that can be continued to applied throughout the lifespan of the backup.

I.e., through each successive half-life iteration, the likelihood of the backup being recalled for recovery halves again. Unlike regular half-life considerations though, the potency – or the importance – of the backup remains the same regardless of its half-life state. That is, a backup you don’t recover from until nearly the end of its life is still likely to be just as important as a backup you recover from 30 minutes after it was completed.

In normal circumstances though, what the half-life of a backup affects is the urgency of a recovery request for that backup. This, in turn, reflects the way in which your backup environment needs to facilitate recoveries. As the half-life of the backup continues to decrease, you can typically take longer to perform the recovery, but at the other end of the spectrum when the backup is quick, a recovery request will similarly expect a rapid response.

You effectively design the backup system to suit the half-life of your backups. If your backups are most likely to be used for recovery within the first two weeks of their generation, then you need to ensure that those backups are your fastest to recover from. From an architecture point of view, this would typically mean storage decisions such as ensuring that at least 2 weeks worth of backups are on disk – either as VTL backups or ADV_FILE type backups. Over time you can move backups out to slower media – making room for new, incoming backups, and keeping old backups recoverable at an appropriate level of cost effectiveness for the likely urgency of a recovery request.

For the most part, we’d normally only need to consider 4 levels of half-life for backups before we hit a level of such diminishing urgency that it becomes a bit like the high availability problem (i.e., the jump from 99.99% availability to 99.999% availability is a far more expensive proposition than the jump from 99.9% availability to 99.99% availability, etc).

These levels would be:

  • Online – For backups that have the highest recovery priority, you’ll likely use a combination of backup and snapshot software. Your “online” backups are snapshots that can be instantly retrieved from.
  • Nearline – For backups that have been recently done, you’ll want to keep them almost-immediately accessible; in a disk backup realm this means within a VTL or on ADV_FILE – in a tape only realm you’d be ensuring these are still within your tape library.
  • Offline – For backups that were done “a while ago”, you’ll want to keep them locally available for recovery purposes but not necessarily hogging more expensive backup space. In a backup to disk/VTL environment, this would either mean staging to physical tape and keeping within a tape library, or keeping on-site in a media vault. For a tape-only environment, it refers to keeping the media on-site in the media vault.
  • Offsite – For backups that have been done “some time ago”, they can typically be kept off-site with a records retention company, or in disaster recovery storage, etc.

(Note that in all of this I’m not talking about clones – copies of your backups – you need them regardless of the half-life of your backup, so I’m taking them as a given at each stage of the process. For obvious reasons, clones and originals should never be in the same location except when they’re being purged.)

There’s another way we talk about half-lives in backups – RTO (recovery time objective) and RPO (recovery point objective). However, RTO and RPO frequently intimidates business. If you’re struggling to get the business to focus on RTOs and RPOs, start with the more readily understandable term of backup half-life and see how you go.

 

Whenever I conduct training, I tell a story about a disaster recovery test that I ran for a customer in the early 00′s. This had to be run within the customer datacentre, and the setup looked vaguely like this:

System setupAs anyone who has worked within an actual datacentre computer room can attest to, these rooms get pretty noisy, and pretty cold. For instance, even at that relatively close distance I actually couldn’t hear the tape library in operation. If I needed to be physically aware of what it was doing, I had to walk to the front of it and cup my hands between my face and the glass panel and peer at the robot and drives. (It was voyeuristic, in a geeky sort of way.)

Now this customer had STK 9840 tape drives, which you might know are very fast when it comes to load/unload operations. When you combine a fast drive with a fast robot mechanism such as was in the STK 9740, it means that the following NetWorker process can actually happen quite quickly:

Load Tape -> Read Label -> Eject Tape -> Unload Tape

In fact, I wasn’t quite aware when I first started the disaster recovery exercise just how quickly these operations would run, as the implementation had been done by other staff before I joined the company – I was just in to do the recovery testing. The testing was made all the more interesting given this was the first system I used with SmartMedia, Legato’s library virtualisation software at the time.

The challenge I had was a simple one: I kept on issuing volume load operations that NetWorker appeared to be accepting as valid, but not processing. I.e., I’d issue a volume load operation, get up from my desk, wander around to the front of the library, peer inside and … see no activity. No tapes in drives, no moving robot, nothing.

After a few iterations of this – you know, the mindless “maybe if I keep doing it, it’ll start working” sort of approach that we all sometimes suffer – it occurred to me to check the logs, and sure enough, NetWorker was reporting that it was loading the volume, detecting a tape label it didn’t know about (of course! – doofus!) and spitting the tape back out again.

The only thing was that it was happening just fast enough that (because I’d felt no need to rush) it would be done by the time I’d get up, grab my phone, and head around to the front of the library. Because I couldn’t hear the damn thing, I had no idea what was going on.

Once I realised how the system worked – in terms of speed of operations – the rest of the process worked smoothly.

A year or two later, I was helping a customer transition from ArcServe to NetWorker, and they had an interesting tape library. I can’t remember the brand now, but it had 66 slots, with 22 slots only ever facing the robot at any one time. The slots themselves were on a carousel, and that carousel could spin into 1 of 3 positions to allow the robot access to the slots. I thought that was a pretty weird design, but then I was confronted with just how long it would take the library to become ready to import a tape after it was dropped in its CAP. In fact, confronted would be better said as confounded – with the computer room several floors beneath the floor I was working on, it was possible to go and put a tape in the CAP and still have the library offline before I got back upstairs. It left me certain that NetWorker was misbehaving.

In fact, NetWorker was reasonably fine – but again, I was working with a library I wasn’t familiar with. It turned out that the bar code reader on the robot head couldn’t actually read the media barcode from the CAP – instead, the robot had to (slowly, laboriously), take the tape out of the CAP, put it a special “reading the barcode slot” at the bottom of the library, read the barcode, then (slowly, laboriously), take the tape out of the special “reading the barcode slot” and return it to the CAP.

Both the STK 9730/9840 experience and the carousel robot experience taught me some very important lessons – lessons which I think every backup administrator should ensure to experience:

  1. You can’t accurately diagnose your environment unless you know how it normally works.
  2. You can’t know how your environment normally works unless you are aware of the physical timings of activities.
  3. You can’t know the physical timings of activities unless you physically watch them.
  4. Therefore you can’t accurately diagnose your environment unless you physically watch your environment.

Now, I’m not suggesting you have to watch your environment all the time – for most backup administrators that would suggest having a desk in a very chilly and noisy computer room. (I once spent every work day for a month in a frigid computer room with overhead cooling. While listening to Justin Bieber for 30 minutes would have been more torturous, it would have only been by a very small amount.)

What you do need to do though is ensure you know how long the basic operations take – loading tape, unloading tape, withdrawing and depositing tapes, re-initialising the library after a reset, powering on, etc. This means sitting with the physical components of your backup system and running the various commands and becoming familiar with how long they take to complete. You can read as many tech specs and manuals as you want, but until you’ve sat down with your tape library (or a comparable one), and experienced the timings yourself, you’re going to be working in the dark when it comes to debugging the system if issues occur.

It’s actually a natural extension to standard system administration practices. A system administrator who is familiar with his or her system will have a reasonably good idea of what processes should be running under normal operations (or rather – what key processes), and what average/peak loading conditions should do to the host. Taking it to the physical layer as a backup administrator is perfectly normal.

 

While I was on my blog hiatus, IDATA Tools v4.2 was released, and I’ve been meaning to outline what has been delivered in this version.

This release focused on making key enhancements to existing tools, and covers:

ToolEnhancements
backup-reportIntroduced new comprehensive mode. When run on the GST server, this utility can now report on backup failures as well as successful backups.
dbufreeNow supports staging from AFTD to pools that are not of "Backup" type.
sslocateCorrected reporting of data to be cloned when savesets to be cloned span multiple volumes.
check-clientsImproved the performance test.
media-freeImproved support for running on Windows servers.
review-resNow supports emailing the configuration report once it has been generated.

For more details about what IDATA Tools can do, check out IDATA’s reseller site, Krisanya, and click the “IDATATools for NetWorker” link in the main menu. IDATA Tools is available for purchase, and of course remains free on subscription to IDATA support customers.

The next, and more substantial update of IDATA Tools is currently in the works – but if you’re using IDATA Tools now, I’d highly recommend you upgrade to v4.2.

 

Clients

The Question

It’s usually the case that the biggest part of a NetWorker environment – in terms of resources that are configured, and software deployed, are the clients themselves. When sites look at upgrading their NetWorker environments though, the normal procedure is to upgrade the server and any storage nodes as the first step, then plan to upgrade clients on an “as needed” or “when we get around to it” basis.

This prompted a customer to recently ask me to write a blog article about this topic (thanks, Robert!) Specifically, Robert’s question was – why should I upgrade my clients?

Having worked with several of my clients now for close to a decade, I’m familiar with the scenario: the servers and storage nodes will be at appropriately supported versions of the NetWorker software, but clients are trailing behind, and before you know it your versions may stretch out like a long tail behind your backup server and storage nodes:

Client versionsSo it begs the question – when NetWorker is so good at supporting older client versions, what’s the rush in upgrading old clients? This is a question where an answer of “…because…?” isn’t sufficient, so perhaps first it’s worthwhile considering some common arguments for not upgrading the clients:

  • If it’s not broken, don’t fix it.
  • We had some problems with version X, it’s stable on X+n, so keep it that way. (A variant of the above.)
  • It’s working, so it’s a low priority task.
  • Admins are too busy fire fighting to do unnecessary upgrades.
  • Change control is too tedious.
  • This is the last supported version for this <old> operating system.

The Answer

The generic answer

Each of the above reasons, in their own right, can be a perfectly valid reason. Temporarily stepping away from backup software and looking at say, operating systems, here’s some example reasons why we eventually choose to upgrade operating systems:

  • We explicitly need the new features.
  • New applications require the new features.
  • Poor support on old OS for new hardware (and vice versa).
  • More efficient.
  • Faster.
  • More secure.

We can evaluate a whole host of  reasons, but we can actually boil any upgrade rationale down to one of the following three generic reasons:

  1. Risk – The risk in not upgrading overrides the cost of upgrading. Two common risks are security or reliability.
  2. Features – The currently installed version lacks features that are both available and required in a newer version available.
  3. Support – The currently installed version is either out of support, or is scheduled to no longer be supported as of a known, unacceptably close date.

Note – regarding features: To be a valid upgrade reason, it should be both available and required, not one or the other – and yes, sometimes upgrades are done based on features being required without first checking if they’re available!

When we boil down upgrade reasons to just three generic terms, risk, features and support, it becomes easier to justify either:

  • Having an active programme in place to keep clients up to date or
  • Periodically updating clients.

So going back to NetWorker clients, we can evaluate what sort of reasons in each of the generic categories might prompt an upgrade; I’m going to go backwards through the previous list.

The NetWorker answer

Support

To me, unsupported = broken. So, “if it’s not broken, don’t fix it” stops being a valid reason at the point where client software installed is no longer supported. So for sites that have v7.3.x and lower clients laying around – or come October 1 2010, v7.4.x and lower clients around, you should either:

  • Upgrade to a supported version or
  • Upgrade to the last supported version that is compatible with the client (for very old clients/applications).

If a client is on an unsupported version of the software and it can be upgraded to a support version, leaving it on that unsupported version can introduce unnecessary risk in the environment. While a current version of NetWorker will more than likely keep communicating with an older version of NetWorker, that doesn’t mean that issues can’t happen, and if they do, you want to be able to resolve the issue as quickly as possible. By having a supported version of the client installed, you can considerably streamline the resolution process.

Features

We have a tendency to focus on the backup server (and to a lesser degree), storage node, when looking for features support. For instance, we may want disk backups to be able to do X, or NDMP backups to be able to do Y, and so on. However, feature support isn’t enhanced only at the server layer. In actual fact, a lot of feature support comes from the client software. For instance:

  • If you’re working with Solaris 10 clients that are deployed in non-global domains, having up-to-date client software ensures that you maximise your support of that configuration;
  • If you’re looking at upgrading a host from Windows 2003 to Windows 2008 R2, you’re likely going to need to upgrade the NetWorker client – you need a newer client instance that has more up to date support for the newer operating systems;
  • If you’re wanting to eliminate no-longer-needed licenses within your backup environment, and are looking at getting rid of those ClientPak licenses, you’ll need to make sure that the clients themselves support the removal of the licenses;
  • If you want to be able to do VSS filesystem backups but not have to buy VSS licenses, you’ll need to have a version of the NetWorker client that supports this option;
  • If you want to replace your Oracle 9 database with Oracle 11, you may find yourself needing to upgrade the database module. This in turn may necessitate an upgrade of the client software to support the newer module, too.

Suffice it to say, feature support can be just as important at the client level as it is at the backup server level. In this regard, the release notes will always be an excellent reference – if you’re not sure whether you need to upgrade, check to see what new functionality comes into play on the latest versions of the software.

Risk

The final reason to upgrade is risk – risk that there is a bug or a security issue in the currently installed version of the software that may be resolved in a newer version. Like “Features”, above, your best bet for determining the risk of not upgrading is by referring to the release notes for newer versions of the software. Read the “fixed issues” notes very carefully; it could be that intermittent issues you haven’t yet found time to investigate – or that you have been actively trying to resolve – are actually resolved in a newer version of the software. While we often look at fixed issues in NetWorker release notes for the server and storage node, they can be equally applicable at the client level, too.

When should clients be upgraded?

Once we’ve determined that we can decide to upgrade clients on the basis of either support, features or risk, we must next ask ourselves the question – when should the clients be upgraded? There’s a sister question to this too – how frequently should clients be upgraded?

I’m not going to suggest that your backup server and all its clients should be kept in absolute version lock-step the entire time. If you have the processes, personnel and time to do this, then by all means go ahead – but it isn’t something that you should obsessively worry about. Instead, I’ll offer some generic suggestions; to do this though I’ll refer to major and significant version numbers. Consider say, NetWorker 7.5 SP2; I’d consider the major version number to be 7, the significant version number to be 5, and the service pack to be 2.

  • Aim to keep all clients that support it on at least the same major version number as the backup server;
  • Where time permits try to get clients on the same (or higher*) major+significant version number as the backup server – but as a general rule, ensure that the clients are at least on a supported major+significant version number.
  • Consider getting clients onto the same major+significant+service pack version as the backup server where there are support, risk or feature reasons, i.e.:
    • Where there are new features in the service pack you need, or,
    • Where there are risks in remaining at the current version, or,
    • Where there are support reasons for updating. (E.g., patch available for new SP that would need to be back-ported to your existing version).

You may think that all these answers are a bit vague – and by necessity, they are, since the issues, needs and processes at each site will govern exactly how and why upgrades are done.


* Yes, or higher. Such as for instance, sites that have been running a NetWorker 7.4.x server, but need to run a 7.5 SP2 client for Windows 2008 R2 systems, etc.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha