While this is pertinent to all versions of NetWorker, it particularly seems relevant mentioning now, since as of 7.5.2, we’re now seeing revised messaging from NetWorker when a tape becomes prematurely full. These new messages now state:

nsrd media notice: LTO Ultrium-4 tape 800814L4 used 2039 MB of 800 GB capacity
nsrd media notice: NetWorker media: (Warning) 800814L4 marked full prematurely.
  Verify possible error on the device /dev/nst4, advertised capacity is 800 GB
  marked full at 2039 MB

Now, it’s worth noting here that normally if you get a tape fill up so soon that probably means there is an issue, and this version of the message, while only subtly different, is certainly more informative and that is a good thing. When we consider VTLs however, it’s a different story. In a virtual tape library, we normally want to use much smaller media sizes than the drive type we’re configured for. That way you’re writing virtual volumes that are 50GB or 100GB rather than 800GB. In my case referring to the above, my lab VTL uses virtual media sizes of 1GB (with compression).

So, how do you go about this? Well, it’s easiest to accomplish when you first setup the environment. You need to change the “Volume Default Capacity” of each virtual device to suit the allocated media sizes. To do this, in NMC turn on View->Diagnostic Mode, then when viewing device properties, enter the appropriate size in gigabytes (followed by “G” or “GB”) in the “Volume default capacity” field of the Configuration tab, shown below:

Changing default volume size

Now, if you can do that on your VTL devices before you start labelling volumes, you’re done and dusted. However, if you’ve previously labelled your media, you either have to relabel the currently blank virtual media or wait until NetWorker gets around to recycling the currently used media.

You can query mminfo to see what the default capacity is registered at – e.g.,

[root@tara ~]# mminfo -m
state volume                  written  (%)  expires     read mounts capacity
800801L4                2254 MB full 02/26/2011   0 KB     5    800 GB
800802L4                   0 KB   0%     undef    0 KB     5   1000 MB
800804L4                   0 KB   0%     undef    0 KB     5   1000 MB
800805L4                   0 KB   0%     undef    0 KB     3    800 GB

Now, what effect does this have to how much you can write to the volumes? The short answer is none. All you’re doing is adjusting the default capacity assigned to new volumes that are labelled in these (virtual) tape drives – and we can see what happens when NetWorker breaches the default volume capacity all the time in relation to physical tape – it just keeps writing until it hits end of physical tape. Nothing more, nothing less. So this means when you fill up your virtual media, NetWorker doesn’t complain at all:

nsrd media notice: LTO Ultrium-4 tape 800802L4 on /dev/nst3 is full
nsrd media notice: LTO Ultrium-4 tape 800802L4 used 2793 MB of 1000 MB capacity
nsrd media info: WORM capable for device /dev/nst3 has been set

Is this something you must to do? Well, no, not technically. However, remembering that I advocate a zero error policy, the above is something I’d definitely strongly recommend for virtual devices. Doing so will eliminate what would otherwise be false errors on the virtual tapes within the NetWorker daemon logs. That means if you have to search for media issues, or refer your daemon logs to your support provider for analysis, they won’t be seeing bunches of “tape filled prematurely” issues.

 

As I mentioned in a post yesterday, NetWorker 7.5.2 (or NetWorker 7.5 SP2) has been released, and with it comes a bunch of feature enhancements as well as a slew of bug fixes.

One of the criticisms of EMC’s development process for NetWorker for a while was that a new service pack would come out with few, if any of the bug fixes added as hot-fixes and cumulative patch clusters for the previous service pack. This is something that EMC have clearly been improving on, because reading through the release notes with the almost 150 bug fixes cited, I see many “familiar” issues that were addressed in various cumulative patch clusters for 7.5.1. On this alone, I’ll give 7.5.2 high praise.

I’ll be running up 7.5.2 in my lab today and looking at a few test cases, etc., but so far of the improvements that have been made in this new service pack, I’m pretty stoked about the following:

  1. Auto-addition of the update enabler. Starting with this version, if you go up to a version of NetWorker that requires an update enabler, NetWorker will create it automatically for you. You still have to get it authorised of course, but this saves smaller sites from the hassle of upgrading without checking for the enabler and then hitting problems.
  2. Support for Windows 7 clients.
  3. Support for Windows 2008 R2 as a client, storage node, and server. The company I work for, IDATA, got involved in the beta testing for this and were pleased with the results.
  4. DFS-R Granular Recovery. A few months ago, I had an issue where a customer’s SYSTEM STATE: saveset was 18GB, due to DFS-R replication. The release notes indicate this shouldn’t be the case any more – this data should now be broken out of the SYSTEM STATE: saveset into regular file backup/recovery operations.
  5. VCB support for ESX v4. I know I should mention this, but I remain overall unexcited about VCB because of the lack of granular Linux support. vSphere API backups, when they come, will grab far more of my attention, I hope.
  6. Client parallelism on new clients is reduced from the previous (increase to) 12, back to the original 4. Client parallelism for the NetWorker server’s client instance on initial bootstrap/creation remains 12, and I’m fine with this. A reversion to client parallelism of 4 however will make performance tuning in new environments at least a little more sane.

Overall I have high hopes for 7.5 SP2. If you’re currently needing to backup Windows 2008 R2 or Windows 7 hosts, this is probably going to be a no-brainer: you’ll likely want to upgrade at least those clients to it straight away.

Before I recommend 7.5.2 more generally of course, I want to run it through its paces. I will reiterate though – even on first glance, it seems very promising. As is always the case, you should make sure that you read the release notes before you contemplate upgrading – and have a clear downgrade path if you need to. This means that if you’ve been supplied with any hot fixes, or cumulative patch clusters, you need to make sure you still have these available as you’re planning the upgrade.

 

Typical that it happens on a day when I’m travelling, but NetWorker 7.5 SP2 (aka NetWorker 7.5.2) has been released today.

There’s a bunch of updates and bug fixes in this release – I actually have really high hopes for it based on discussions I’ve had with various folks at EMC. I’m about to start an hour long train trip where I’ll be reading the release notes – if you have PowerLink you can access them from here.

I’ll aim to have a summary posting of new features and bug fixes in the next 24 hours. In the interim, let me just say that I’m over the moon to see that one of the first new features is a reversion to default client parallelism of 4 rather than the previous change to 12. This is very good to see changed. (Oh, not to mention support for Windows 2008 R2.)

[Edit, 2010-02-25]

An overview of new features, etc., can be found here.

 

The scenario:

  • A clone or stage operation has aborted (or otherwise failed)
  • It has been restarted
  • It hangs waiting for a new volume even though there’s a partially written volume available.

This is a relatively easy problem to explain. Let’s first look at the log messages that happens. To generate this error, I started cloning some data to the “Default Clone” pool, with only one volume in the pool, then aborted. Shortly thereafter I tried to run the clone again, and when NetWorker wouldn’t write to the volume I unmounted and remounted it – a common thing that newer administrators will try in this scenario. This is where you’ll hit the following error in the logs:

media notice: Volume `800829L4' ineligible for this operation; Need a different volume
from pool `Default Clone'
media info: Suggest manually labeling a new writable volume for pool 'Default Clone'

So, what’s the cause of this problem? It’s actually relatively easy to explain.

A core component in NetWorker’s media database design is that a saveset can only ever have one instance on a piece of media. This applies as equally to failed as complete saveset instances.

The net result is that this error/situation will occur because it’s meant to – NetWorker doesn’t permit more than one instance of a saveset to appear on the same piece of physical media.

So what do you do when this error comes up?

  • If you’re backing up to disk, an aborted saveset should normally be cleared up automatically by NetWorker after the operation is aborted. However, in certain instances this may not be the case. For NetWorker 7.5 vanilla and 7.5.1.1/7.5.1.2, this should be done by expiring the saveset instance – using nsrmm to flag the instance as having an expiry date within a few minutes or seconds. For all other versions of NetWorker, you should just be able to delete the saveset instance.
  • When working with tape (virtual or physical), the most recommended approach would be to move on to another tape, or if the instance is the only instance on that tape, relabel the tape. (Some would argue that you can use nsrmm to delete the saveset instance from the tape and then re-attempt the operation, but since NetWorker is so heavily designed to prevent multiple instances of a saveset on a piece of media, I’d strongly recommend against this.)

Overall it’s a fairly simple issue, but knowing how to recognise it lets you resolve it quickly and painlessly.

 

Is your backup system a unicorn, or is it a horse?

If you’re thinking I’m off my rocker asking such an odd question, don’t worry, I haven’t temporarily taken leave of my senses. To assure you that I haven’t taken an odd turn, this is a continuation of a previous post – How Complex is your Backup Environment?

The question of horses and unicorns covers the introduction of artificial design requirements or operational procedures that see something that should be as simple as a horse turned into something as unique, fragile and irreplaceable as a unicorn.

You see, there’s lots of horse handlers out there. Even people who aren’t necessarily trained to look after horses have a good idea of what to do for them. I once had a horse as a kid, and she was relatively easy to look after, particularly since I had a big field to let her roam about in. She practically looked after herself. So if someone turned up tomorrow and asked me to look after their horse for a week, I’d have a fairly good idea of how to look after it, even if I’d never dealt with that sort of horse before.

Give me a unicorn and I’d be in another boat. Literature tells us they’re fragile things. Heck, they’ll only let certain people go anywhere near them, which seriously reduces the scope of the adult population who can care to their needs.

Not only that, it’s easy to find practical advise about handling horses on the internet … of unicorns though it’s a different story: it’s all theoretical stuff. It’s all supposition and old wives tales mixed in with a liberal dosage of imagination. That doesn’t beget practical tips on handling!

If you’re an IT manager, you have to accept that staff will periodically move on. Some companies employ more contractors than permanent staff with the expectation that staff will regularly move on. You want systems to be straight forward so that:

  • New staff become productive as soon as possible;
  • New staff don’t make serious (or catastrophic) mistakes because someone before them broke the law of least astonishment.

This doesn’t just apply to IT management. As an IT worker, you similarly:

  • Don’t want to go into a position where you spend your first three months decoding someone else’s spaghetti system.
  • Don’t want to go into a position where you don’t notice someone has violated the law of least astonishment and make a … boo boo.

So you see, it’s all about unicorns and horses. If you go into a job saying that you’ve worked with horses, or at least have passing familiarity with them, then there’s a good chance that regardless of whether you’re presented with a Shetland Pony, an Appaloosa or a monster sized Draught Horse, you’ll be able to muddle through the process. If you’re presented with a fragile unicorn, your chances of muddling through aren’t so good.

Do yourself a favour: make sure your backup system is a horse, not a unicorn.

 

Covered in several places last week, including The Standalone Sysadmin, was the story about Dell updating their RAID firmware/systems on the latest PowerEdge servers to block the use of non-Dell supplied disks.

The offending support letter from Dell (quoting as per Standalone Sysadmin) reads:

Howard_Shoobe at Dell.com Howard_Shoobe at Dell.com
Tue Feb 9 16:17:54 CST 2010

Thank you very much for your comments and feedback regarding exclusive use of Dell drives. It is common practice in enterprise storage solutions to limit drive support to only those drives which have been qualified by the vendor. In the case of Dell’s PERC RAID controllers, we began informing customers when a non-Dell drive was detected with the introduction of PERC5 RAID controllers in early 2006. With the introduction of the PERC H700/H800 controllers, we began enabling only the use of Dell qualified drives.

There are a number of benefits for using Dell qualified drives in particular ensuring a positive experience and protecting our data.

Now, there’s been a bit of disquiet on that last sentence above – “our data”, in particular. I’m willing to ignore this, as I can readily believe this would have just been a typo or slip on behalf of the technician.

But I’ll cover the other aspect – the more pertinent aspect – denying access in servers for non-Dell drives.

This is nothing more than a PDTD – Profit Driven Technical Decision. And one based on a false economy.

Now, I can understand why enterprise storage vendors take this strategy. That’s regardless of who the enterprise vendor is. EMC, NetApp, HP, etc. – when it comes to enterprise SANs and NAS units, I’d consider this fairly appropriate.

We’re not talking enterprise SANs and NAS units though. We’re talking about DAS. You know, the cheap storage people opt for when their requirements aren’t sufficiently high enough to warrant a SAN or NAS, or when they have a business too small to warrant enterprise class storage.

DAS is not about extreme cost – or at least, it shouldn’t be. It’s not about paying an arm and a leg for 2TB of storage. (For that matter, comparatively, neither are enterprise SAN or NAS – they’re about building high quality systems from the ground up.)

Dell might very well argue that they have to do a little more work to support non-Dell drives (which may possibly mean non-Dell firmware) within their RAID system. This is the heart of a PDTD – there’s a small element of technical truth the argument, but the real heart of the argument is not a technical one, it’s about profit. Every server – indeed every desktop and laptop – manufacturer charges a premium for the hard drives they sell in comparison to buying those drives outright. If you want absolute simplicity and are prepared to pay for it, you buy the system you want with the storage you want from the supplier you want at the price they want. Particularly if you’re a smaller IT shop, what you want is to be able to buy a “basic” shell that has good warranty and then tweak it and add to it as required to suit your budget.

The effects of this decision on Dell will be subtle, given its current state. It’s made a reputation for being cheap and cheerful, building its business model on delivering systems faster and cheaper than its competitors. It has bigger problems, now that its competitors have caught up (and for some, overtaken it) on both these fronts, so differentiating business loss as a result of this decision vs business loss because their model has been under a sustained attack and they’ve been unable to adequately respond is not going to be easy.

But it will, at some level, hurt them. I once sat in a meeting where a particularly … stubborn … IT manager said that he’d never authorise the purchase of Dell equipment again after it took them 3 months to send out a missing bezel for a server he’d purchased in his last job. He was quite vitriolic.

Blocking extra market drives in a DAS environment is significantly more annoying than failing to send out a bezel. There’s going to be a lot of IT staff out there who have say, recommended Dell servers with the intention to install third party drives for DAS storage who are going to be suddenly looking bad in front of their managers. This does not create good customer experiences, and such experiences carry from job to job. The cumulative effect of this decision in future sales shouldn’t be ignored. If I were a Dell share holder at the moment, I wouldn’t be happy with their decision, I’d be … aggrieved.

 

There should be more software installed on your NetWorker server than just the operating system and NetWorker. In order to get the most out of it, you should have a toolkit of utilities and applications that are there, at your beck and call, to help you get the most out of your backup system.

It doesn’t matter whether you’re on Windows, Linux or Unix. Like Batman’s utility belt, having some tools will help you go beyond  a standard NetWorker install.

What I’ll do is outline what my NetWorker utility belt would look like, and then let others comment on what they’d declare as the essentials for themselves. Here’s what I advocate as “must haves” when installing NetWorker:

  • An advanced scripting language – in my case, Perl.
  • SMTP mail (outgoing) from the backup server.
  • SSH (outgoing) from the backup server. (On Windows, this implies use of a bare cygwin install, etc.)
  • IDATA Tools – I kid you not, I’m saying it just “for sales”, I’ve been working on these tools for years and they’re such second nature for certain operations that unless I’m running up a lab server for only a single test, it even gets installed on all my test systems too.
  • The “tail” command; whether it’s installed by default on Unix, or added as a single command on Windows or added as part of a cygwin install on Windows, I can’t go without tail.
  • A web browser – I know that sounds like a given, but on headless enterprise Unix systems, that means ensuring that at least elinks is installed on the NetWorker server itself.
  • A tool for viewing potentially large log files. My tool of choice is usually vi, but I’m a grouchy old Unix user.

So, they’re my “absolutes” – or to be more correct, they’re the tools I’ll either (a) want to automatically install or (b) automatically miss if they’re not installed when I step up to a NetWorker server.

Does this somehow detract from NetWorker? Of course not. Most of those, as you’ll see, are about useful situations around the backup product rather than direct modification of it. I.e., they’re about system process tools. Those that are to do about scripting should be welcomed – I’d take any backup framework product over any monolithic backup product any day!

So, what’s in your utility belt? Or what do you wish was in your utility belt for NetWorker?

 

Over at Xiotech’s blog, there’s an interesting piece about the evolution of 2.5″ drives in enterprise storage titled The Great Shrinking Disk Drive.

I’m not 100% convinced of Xiotech’s argument, but over the years I’ve seen increasing use of 2.5″ drives in enterprise computing – particularly to decrease the footprint and power requirements for DAS in rack-mount servers, etc.

 

We’re used to seeing EMC and NetApp punch it out in the blogosphere and tweetverse, with the occasional pause from both of them to have a minor dust-up with the likes of HP, IBM or HDS, but component manufacturers don’t seem to normally go for such … abrasive^H^H^H^H^H^H^H^Hdedicated … web behaviour.

A couple of weeks ago though, Emulex decided to take it to QLogic over the running heat of their cards. QLogic, they claimed, runs some cards so hot that you could literally fry an egg on them (if, as it turns out, you had a really small frying pan and a small piece of egg).

Well now, QLogic is responding, but this time not with Youtube videos or catty ripostes, but through the courts. According to The Register and a variety of other sources, QLogic has filed a lawsuit against Emulex in the USA alleging deceptive advertising and claims that are harming their business.

Whether there’s any merits in Emulex’s claims will now likely be tested in a courtroom setting. Regardless of whether Emulex were actually correct with their claims, QLogic has certainly guaranteed those claims will get a lot more attention now that they’re suing.

 

I frequently work in support – I help a plethora of companies that have NetWorker issues, and I enjoy doing that work because it’s about fixing their issues and either getting them up and running again (if it was a serious issue), or helping them with something they’d not done before.

In short, I like helping people.

One thing I’ve occasionally heard over the years goes along the lines of:

“I don’t care whose problem it is, I want you to fix it.

This is normally directed by an exasperated IT manager at a bunch of one or more vendors/support providers during a long running issue where different groups believe that the problem originates from different locations outside of their contracted support realm. Thankfully any time I’ve been involved in this it’s been as integrated support provider who (like the customer) has been trying to get the disparate vendors to stop finger pointing. So I’ve got no doubt that there are times when people say this that it’s fully justified.

I’d like to suggest though that sometimes it’s not fully justified; sometimes it’s not my problem – sometimes it’s not someone else’s problem. Sometimes it’s your problem.

This is a bitter pill to swallow. Let me sum up where it ceases to become someone else’s problem with a mangled quote:

The joy of a cheap price will have long faded when the realisation of a poor choice sets in.

I am sorry; I’ve searched high and wide for the original form of this quote, but I’ve not been able to find the original writer, or the original exact words, so I’m hoping I haven’t stretched it too far beyond its original meaning.

So where does the above quote come into play when someone has just pulled out the “I don’t care whose problem it is, I want you to fix it” card?

It comes into play in situations where:

  1. Critical components of your production environment aren’t under a support contract. (E.g., operating systems, databases.)
  2. Staff are not sent on or otherwise given access to critically important training.
  3. Staff are assigned tasks outside of their skillset without mentoring to help them reach that point.
  4. Against all advice, a bleeding edge solution was purchased.
  5. Without checking compatibility guides, disparate software/hardware/components were purchased.

I’d argue that in each of those situations, there is a good chance that some leeway should be given when various partners and vendors start finer pointing. Let’s go through each of those items:

Critical components aren’t under a support contract

It doesn’t matter if you’ve got storage support contracts, and hardware support contracts and individual application support contracts if core components, such as operating systems don’t have support contracts. Support isn’t a “shade of grey”; it’s binary. You either have it or you don’t. Choosing not to have part of it implicitly reduces the effectiveness of other parts of it. If an application or hardware support provider says to you “we think it would be wise to escalate this to <your OS vendor> as well for their feedback”, it’s not necessarily their fault if your response is “we don’t have support for <OS>”. Even more so, if they know that there’s a known issue with the unsupported component, it’s usually unrealistic to expect them to provide a workaround/solution beyond that.

Untrained staff

This is something I make a big point on in my book, and I want to be clear that I’m not talking about magical certifications but honest to goodness training. Needless training is wasteful, but consider this: if someone is escalating issues that any person with adequate training would already know the answer to, then not sending them on training is a false economy. I.e., they spend time not knowing what to do, then they spend time escalating the issue, then they spend time working with the vendor to fix the issue. It doesn’t take many of these incidents to actually eclipse the time it would take to send them on training.

Unskilled staff

There’s an old UI and system design principle:

The system should be as simple as possible, and no simpler.

This means that the system should be designed for the target audience or users. It doesn’t mean that a nuclear power plant’s control systems should be so simple that a janitor or lunch-room worker can fully operate it. (In actual fact, when you break this rule and start designing systems to be simpler than they should be, you start making the system more complex and harder for experienced user interaction, and more susceptible to “black box” failure.)

The net result of this is that staff who are assigned particular roles either should have the skills for those roles, or have someone available to mentor them to help them get their skill levels up to the required level.

My core case in point in this is that in situations where backup administration is done by system administrators, it’s very common to see the “newbie” or the most junior person get that task. I know, I’ve been there – it’s how I started in backup.

It’s also entirely, ahem, “ass-backwards”. A junior person is least likely to understand the potentially complex interrelationships between operating systems, applications, storage systems, performance tuning and networking requirements of the average backup system. This is a natural fit for the most senior staff rather than the most junior staff.

To put it bluntly: if you put the wrong person in the job without suitable mentoring provisions in place and they make a serious mistake, it’s not their fault, nor is it the fault of your support vendors, it’s your fault.

Bleeding Edge Solutions

In any competitive bidding process, it’s highly likely that at least one solution proposed will be bleeding edge. Sometimes it will be because the only potential way of achieving everything you want to do is by going bleeding edge. Equally as often it will be because it’s a common sales strategy: sell the thing with the most shiny bits.

Bleeding edge is thusly named for a good reason: if you slip up, it’ll cut you.

Now, if you’re demanding that everyone involved in the sale of a bleeding edge solution drop the finger pointing and start resolving the issue, that’s likely to be perfectly valid. But spare a thought for vendors on the periphery who weren’t involved in the sale but somehow have to continue to support the bleeding edge solution. And spare a thought for the people who explicitly told you that it was a risky solution.

Incompatible Systems

There’s nothing wrong with having policies to, or simply deciding to purchase different components for a solution from a variety of suppliers and vendors.

However, as I mention in my book, when you do this, it pushes the onus of responsibility onto you to do one of the following:

  • Explicitly confirm compatibility of all disparate components.
  • Explicitly tell all vendors the overall solution and components to be deployed, and explicitly state that what they sell must be known to be compatible.

The enterprise IT realm in particular is not plug and play. Just because X works with Y doesn’t mean that X works with Y2, and it doesn’t mean that just because X works with Y and Y works with Z that X will work with Z as well.

Why do I care? Why should you care?

Why do I care about this, and why should you care about this? Business is evolving. It’s no longer about traditional vendor/vendee or supplier/customer relationships. It’s about building business partnerships based on trust and a mutual desire for common success. As we know from our personal lives, partnerships that are entirely one sided don’t work.

The old business model confidently maintained that “the customer is always right”. This however loses relevancy in a true partnership. In a business partnership as well as a personal one, we know that true strength comes from each side acknowledging the needs and goals of the other side and working out how to mutually satisfy those goals without detriment to either.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha