As a system administrator, I loathed being audited. Not because I feared that it would expose holes in the security or policies of my systems, but rather because for the most part, auditing was usually conducted by incompetant staff at big name auditing/taxation companies. Now, I have no doubt that when it comes to their original auditing domains, namely taxation and accounting, such companies do usually offer excellent services.

For the most part though I’ve found that for anything outside of absolute basic system administration reviews, such companies offer poor feedback that’s often erroneous to the point of being farcical. (For example, having a password field of ‘x’ in /etc/passwd pointed out as being “insecure” having failed to note the use of shadow password files…)

So, having undoubtedly just annoyed quite a few people, I’ll go on to explain why auditing shouldn’t be a terrible experience if you’re in the storage and data protection domain. More importantly, I’ll explain how auditing can be changed from an unpleasant experience where it’s necessary to explain to management they wasted their money, to one where you, and your company, get value out of it.

The best auditing is conducted by experts in the field. Not the field of auditing, but the field of what you want audited. So, in order to get a decent and useful audit of your storage and data protection systems, you need to follow these rules:

  1. It should be done from someone outside your company.
  2. It should be done by someone who won’t be assigned any work as a result of the audit.
  3. It should be done by someone with creditionals (e.g., registered partners of companies, or like-companies for the products you’re using).

This isn’t to say that whomever does the audit should never get any further work from your company, but rather, if they make recommendations that you have to buy X, Y and Z to resolve the issues they’ve highlighted, they’re doing it out of honesty because they won’t get to sell them to you.

Moving on, there’s a few more rules you should also follow in order to get a successful audit:

  1. You must assign a champion within your company who has sufficient authority to ensure that the staff conducting the audit get access and feedback they require.
  2. You must provide direction to the auditing company – that is, outline what you need investigated and the structure of the results you want. However, this can be dangerous if mishandled, so most importantly follow the next rule…
  3. You must provide freedom for the auditing company to expand beyond your direction to encompass and point out other issues that you may not have anticipated in your directional statement.

Finally, the audit process should start with a brainstorming/whiteboarding session, and the results should be presented in a similar session.

There’s more to auditing than the above, but if you step away from the ‘regular’ auditing companies that can offer little assistance in storage and data protection, you will actually get a quality result.

 

I’ve been using nsradmin for the last 12 years. So when I read about a new utility, ‘jobquery’ in NetWorker 7.5, that’s designed to work in a similar way to nsradmin but query the jobs database instead of the media database, I was looking forward to giving it a go. (This was in no small part to lingering disappointment over how nsrjobsd has been practically a black box since it was introduced.)

So I was rather … disappointed when I ran jobquery for the first time and it appeared to hang.

Running, say:

# jobquery

Appeared to hang.

Running, say:

# jobquery -s server

Also appeared to hang.

Running, say:

# jobquery -s server print

Didn’t return a thing.

So I thought maybe this is a tool that was let out of the barn a little too soon, and even went to the point of logging a question case with EMC about it. After all, it appeared to not really work at all.

It turns out I’d not anticipated that there might have been a simpler problem with jobquery, that being … less than desirable interface design. Let’s be blunt: if you write an interactive “shell” style query interface, it should tell the user when it’s waiting for input.

The problem with the initial invocation attempts was a simple one – it wasn’t hanging, but instead, it was waiting for input without telling me it was waiting for input. Consequently, I’m currently asking EMC to file a bug about this. I know the difference between RFEs and bugs – an RFE is a request for enhancement, or to change something that’s there by design, but a bug is a problem with the actual implementation. Now, someone might argue that maybe this should be filed as an RFE if it was originally designed to not show any prompt, but my take on it is that any interface that doesn’t differentiate between “waiting for input” and “processing/stuck” is, in actuality, a buggy design.

Oh, and jobquery just doesn’t like being told what to do in relation to queries on the command line, even though the man page says it will accept it.

If you’ve been trying to use jobquery and not getting much satisfaction, try it again without waiting for a prompt. Once I got past the lack of prompt, I was quite excited by the promise of jobquery – in fact, I’m hoping that a future release will actually implement the ability to even stop jobs – e.g., kill off a single saveset, or even say, pause a clone/stage operation.

No doubt jobquery needs some improvements, but it wasn’t quite the aborted attempt I’d been initially worried about, and you should give it a go – you’ll be pleasantly surprised.

 

[This is something I'll revisit from time to time and update.]

There are some commands you should always be aware of with NetWorker, regardless of whether you like the command line or not. Here’s some of them:

# nsrim -X

This does a check of the save sets against volumes – while the EMC documentation suggests you should only need to run this after a crash, here’s an alternate policy: you need to run this at least fortnightly, or preferably weekly. (If you are running low on media, or low on space on disk backup units, running nsrim -X can also in a pinch be used to force any recycling that may be ready but ordinarily wouldn’t be processed until around midnight.)

# nsrck -m

This performs another style of consistency check on the media database, and corrects errors when they occur. It actually does cumulative repairs, so when you need to run it, you should be prepared to run it up to 3 or 5 times before logging a support call. When do you run it? After a NetWorker crash – if your server does crash, you really, really need to run this command. It should give you no output – if it gives you output, it means there’s been issues, and you want to run again and see whether all issues have been corrected.

# nsrck -L3

This is a good mid-level check to run against the indices that can clear up redundant entries and act as a good sanity check against indices. It’s also a good starting point if you’re having intermittent backup issues that aren’t network/hardware related, and like nsrim -X, it’s a command I recommend running at least fortnightly, or even weekly.

(All of these commands should be run when there’s no backup activity.)

 

Parallelism in NetWorker is effectively multiplexing by another name. There are three areas where you have traditionally been able to set this:

  • Client parallelism – how many savesets a client can simultaneously send in a backup
  • Server parallelism – how many savesets a backup server will simultaneously allow to be active for the purposes of backup
  • Target sessions – the optimal number of savesets you want running to a backup device

As of NetWorker 7, we saw the introduction of:

  • Savegrp parallelism – the maximum number of backup savesets that can be running for a particular group.

As of NetWorker 7.3, we saw the introduction of:

  • Max sessions – the maximum number of savesets you’ll permit running to a backup device

Somewhere in the 7.x tree – I don’t recall when – there was another parallelism setting introduced, this time for the pool:

  • Max parallelism – The maximum number of savesets that can be simultaneously written to media belonging to a particular pool.

Also, we’ve seen the introduction of:

  • Max active devices – a setting maintained in the device resource, but is shared by all devices common to a single storage node, rather refers to the maximum number of devices that can be active on the storage node at any one time.

All of these settings serve one key purpose – they let you tune the performance of your NetWorker datazone.

Note: It’s worth pointing out something fairly critical here – all of these settings affect  backup savesets, they don’t affect recovery savesets. NetWorker will always allow new recovery savesets to be initiated, even if it can’t immediately facilitate the recovery.

Client parallelism is actually one of the most difficult parallelism settings to tune, and I’ve been somewhat disappointed by the new “default” setting of 12 (up from 4) in NetWorker 7.4.x onwards. I strongly believe it should be set to 1 for all new clients so as to ensure people think about the performance implications before they increase it.

I won’t go further into client parallelism here – I covered in considerable detail in my book, so if you want details of evaluating client parallelism settings you should check it out*. 

Server parallelism is a lot easier to understand – how powerful is your server, and how many devices do you have? In an optimal environment, your backup environment should be able to handle the processing of enough streams to keep every single backup device in your datazone streaming at full speed**. We’ll get to this in a moment, but optimally you want to keep that to as few savesets as possible – i.e., in a perfect world, we’d like to be able to keep every backup device running at full speed from individual savesets. This doesn’t always happen though, so you need to be able to plan for the appropriate number of savesets. 

(Even when the backup server is not actually backing anything up (e.g., all client backups are conducted by storage nodes, with the backup server just acting in a director role), every active saveset does consume resources on the backup server – this includes general coordination resources as well as index resources, etc.)

Device target sessions is an interesting one. It’s not actually a hard limit. In the first pass, it refers to how many savesets should be running on a device before new savesets are started on the next device. So, if every device in the environment has target sessions of 4, then one by one NetWorker will want to get 4 savesets running to each device. But what happens when every device is running 4 savesets, and NetWorker needs to start a new saveset? In that instance, NetWorker just ‘cycles’ through all the devices, tacking on another saveset to each device until they’re say, all running 5 savesets. Then if another comes along it starts building each device up to 6, and so on. In effect, it’s a primitive form of load balancing. 

The newly introduced setting of max sessions for devices does act as a hard limit – a device will never exceed the number of active savesets as defined by the max sessions parameter; this by default is set to 512, effectively not placing a limit on the number of sessions running to the device***.

So what about the other settings? Where would you use them?

The savegrp parallelism setting is a great option to use if you have multiple groups running in such a way that they overlap, and one or more of the groups has large numbers of clients. You see, traditionally, the code for a group assumes that when it starts, it can query the server’s parallelism setting and start up to that many savesets. However, if you’ve got multiple groups running, then you could exceed the number of permitted savesets. This can result in timeouts or failures. If however you’ve say, got server parallelism of 64, and one group with 100 clients, and two other groups with say, 4 clients each, you might set the large group to have parallelism of 60, and the other two groups to each have parallelism of 2. This would enable all three groups to simultaneously run.

Max parallelism for pools is not something I’ve really played around with. However, I can immediately imagine it would be useful if you had specific pools for disk backup units that are all connected via the same FC or SCSI bus – you could set a maximum parallelism setting for all the pool so you don’t swamp the interface. That’s just one example after only a couple of seconds of thinking about it, so I know there’ll be other options there.

Max active devices for storage nodes is again something I’ve not played around with, but, I can see that I’d particularly make use of it in a situation where the actual storage node machine itself is not capable of driving all the backup devices attached to it at full speed; in this instance, limiting the number of active devices would allow you to say, have 3 of 6 devices running at full speed, rather than 6 of 6 devices running at a very sub-optimal speed.

So, there’s a good starting point at parallelism. 

 


* Not necessarily to be construed as a sales pitch. I went to a lot of effort to explain all the factors of client parallelism in my book, and it’s far too long to repeat in a blog entry.

** By full speed, when referring to drives that do hardware compression, I refer to the streaming compression speed.

*** If you need devices that can handle more than 512 active sessions, I really want to sell you the arrays you’ll need to achieve it!

 

Ever need to adjust the browse/retention time for a saveset, but you’ve not been sure how to do so? Here’s how.

To change the browse or retention time, you’ll need to find out the saveset ID (SSID) of the given saveset. This can be done with mminfo.

For instance, say you had a backup done last night of a machine called ‘archon’ that has now been rebuilt, but you want to keep the old backup for much longer than normal – e.g., ten years instead of the normal 3.

First, to find out what you need to change, get a list of the SSIDs:

# mminfo -q "client=archon,savetime>=24 hours ago" -r name,ssid
 name                          ssid
/                              4036558666
/Volumes/TARDIS/Yojimbo        4019781450
/Volumes/Yu                    4003004234

(If you’re confused about that savetime command, see my other post here.)

Now, for each of those SSIDs that are returned, we’ll run a nsrmm command to adjust the browse and retention time*.

The basic nsrmm command for adjusting the browse and retention time is:

# nsrmm -S ssid -w browse -e retent

or, for a single instance of a saveset:

# nsrmm -S ssid/cloneid -w browse -e retent

Where the ‘browse’ and ‘retent’ values can be either one of the two following:

  • A literal date in US date format ** – e.g., “12/31/2019″ for 31 December 2019.
  • A ‘fuzzy’ english worded date – e.g., “+10 years” for 10 years from today.

Note that (rather obviously), your browse time cannot exceed your retention time, and generally its recommended that you set browse time to retention time.

So in this case, you’d run for each SSID or SSID/CloneID you want to affect:

# nsrmm -S ssid -w "+10 years" -e "+10 years"

Which will look like the following, based on my mminfo output:

# nsrmm -S 4036558666 -w "+10 years" -e "+10 years"
# nsrmm -S 4019781450 -w "+10 years" -e "+10 years"
# nsrmm -S 4003004234 -w "+10 years" -e "+10 years"

It’s that simple.


* You can also do this against an instance of a saveset by using the SSID/Clone ID; to do that variant, request “-r name,ssid,cloneid”, then use the two numbers in the nsrmm command separated by a forward slash – ssid/cloneid.

** The restriction on US date format may have eased in 7.5. I’m going to do some additional playing around with locales sometime soonish.

 

Introduction

When it comes to servers, I love virtualisation. No, not to the point where I’d want to marry virtualisation, but it is something I’m particularly keen about. I even use it at home – I’ve gone from 3 servers, one for databases, one as a fileserver, and one as an internet gateway down to one, thanks to VMware Server.

Done rightly, I think the average datacentre should be able to achieve somewhere in the order of 75% to 90% virtualisation. I’m not talking high performance computing environments – just your standard server farms. Indeed, having recently seen a demo for VMware’s Site Recovery Manager (SRM), and having participated in many site failover tests, I’ve become a bigger fan of the time and efficiency savings available through virtualisation.

That being said, I think backup servers fall into that special category of “servers that shouldn’t be virtualised”. In fact, I’d go so far as to say that even if every other machine in your server environment is virtual, your backup server still shouldn’t be a virtual machine.

There are two key reasons why I think having a virtualised backup server is a Really Bad Idea, and I’ll outline them below:

Dependency

In the event of a site disaster, your backup server should be at least equally the first server that is rebuilt. That is, you may start the process of getting equipment ready for restoration of data, but the backup server needs to be up and running in order to achieve data recovery.

If the backup server is configured as a guest within a virtual machine server, it’s hardly going to be the first machine to be configured is it? The virtual machine server will need to be built and configured first, then the backup server after this.

In this scenario, there is a dependency that results in the build of the backup server becoming a bottleneck to recovery.

I realise that we try to avoid scenarios where the entire datacentre needs to be rebuilt, but this still has to remain a factor in mind – what do you want to be spending time on when you need to recover everything?

Performance

Most enterprise class virtualisation systems offer the ability to set performance criteria on a per machine basis – that is, in addition to the basics you’d expect such as “this machine gets 1 CPU and 2GB of RAM”, you can also configure options such as limiting the number of MHz/GHz available to each presented CPU, or guaranteeing performance criteria.

Regardless though, when you’re a guest in a virtual environment, you’re still sharing resources. That might be memory, CPU, backplane performance, SAN paths, etc., but it’s still sharing.

That means at some point, you’re sharing performance. The backup server, which is trying to write data out to the backup medium (be that tape or disk), is potentially either competing with for, or at least sharing backplane throughput with the machines that is backing up.

This may not always make a tangible impact. However, debugging such an impact when it does occur becomes much more challenging. (For instance, in my book, I cover off some of the performance implications of having a lot of machines access storage from a single SAN, and how the performance of any one machine during backup is no longer affected just by that machine. The same non-trivial performance implications come into play when the backup server is virtual.)

In Summary

One way or the other, there’s a good reason why you shouldn’t virtualise your backup environment. It may be that for a small environment, the performance impact isn’t an issue and it seems logical to virtualise. However, if you are in a small environment, it’s likely that your failover to another site is likely to be a very manual process, in which case you’ll be far more likely to hit the dependency issue when it comes time for the full site recovery.

Equally, if you’re a large company that has a full failover site, then while the dependency issue may not be as much of a problem (due to say, replication, snapshots, etc.), there’s a very high chance that backup and recovery operations are very time critical, in which case the performance implications of having a backup server share resources with other machines will likely make a virtual backup server an unpalatable solution.

A final request

As someone who has done a lot of support, I’d make one special request if you do decide to virtualise your backup server*.

Please, please make sure that any time you log a support call with your service provider you let them know you’re running a virtual backup server. Please.


* Much as I’d like everyone to do as I suggest, I (a) recognise this would be a tad boring and (b) am unlikely at any point soon or in the future to become a world dictactor, and thus wouldn’t be able to issue such an edict anyway, not to mention (c) can occasionally be fallible.

 

In a previous post, I bemoaned the lack of nsrwatch on Windows. I thought it would be worthwhile pointing out an example of where nsrwatch comes in handy, for the non-believers.

You’re on the road, you don’t have the option of pulling your laptop out, and you need to check on the state of your NetWorker server. Via a mobile phone ssh session, nsrwatch really is your friend here:

 

nsrwatch, via issh

nsrwatch, via issh

 

I know this is a minor quibble – and for the most part, minor quibbles sum up the issues that I have with NetWorker. However, it annoys the hell out of me.

Somewhere, someone in the development team at EMC got the bright idea for NetWorker 7.3 that for “usability” and “consistency” reasons, it would become necessary when depositing media to either:

  • Answer an inane question (i.e., sequence goes: put media in CAP, run nsrjb -d, get asked to answer ‘yes’ to whether you want to import media or not)
  • Remember to add a -Y option to the nsrjb -d command to automatically answer ‘yes’ to the inane question.

Now, I do have a problem with this. In the first case, core behaviour was changed, and I don’t like that. Nor do I see a valid consistency reason – in nsrjb you’ve always been required to answer ‘yes’ or to add a -Y to the command if you’re going to do something destructive, but depositing media shouldn’t be deemed a destructive action.

In the second case, administrators are going to get into the habit of automatically throwing a -Y option onto each nsrjb command they run. I’m the first to admit that where I need to, I do use -Y, but don’t advocate that people get into the habit of using it except for when it is really necessary.

Thus, my preference would be that some release of NetWorker would drop the inane question when a deposit is done, and just let the administrator or person running the deposit get on with the activity of moving tapes.

 

I always like to know how to work with the command line, even on Windows. (As I say in my book, while a picture may be worth a thousand works, a GUI is not necessarily worth a thousand command line options.)

As such, when I want to stop and restart NetWorker on Windows, I find it faster on the command line than going to the services panel and scrolling through to the various processes.

Note: All of these options are documented in the NetWorker and License Manager administration manuals, but as always, there’s a difference between something that’s documented and something you can find, so I’m presenting these for those who are pressed for time.

There are lengthy names for each of the NetWorker client service, NetWorker server service, and NetWorker Management Console server service which you may think would make it inconvenient to stop and restart from the command line, but thankfully there’s shortcut names as well. These are:

  • NetWorker Remote Exec Service: nsrexecd
  • NetWorker Backup and Recover Server: nsrd
  • EMC GST Service: gstd
  • NetWorker License Manager: lgtolmd

Now, if you’re wanting to quickly stop everything, you can rely on process dependencies and run:

C:\> net stop nsrexecd /y

That will shutdown everything, since everything relies on the nsrexecd (client) process in order to run.

If you want to start NetWorker on a client, all you need to run is:

C:\> net start nsrexecd

If you’re wanting to start NetWorker on a server though, you should run:

C:\> net start nsrd

There’s a simple reason for this – nsrd relies on nsrexecd, so Windows dependency checking will start nsrexecd first for you.

Similarly, if you want to start the management console on a Windows machine, you’d run:

C:\> net start gstd

And if you’re running Legato License Manager, you’d run:

C:\> net start lgtolmd
 

There’s a new version of IDATA Tools available. If you’re looking for a way of turbocharging your NetWorker administration experience, you should check it out. Refer to the Blogroll links for requesting a trial version or purchasing.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha