Sometimes it’s helpful to run NetWorker in debug mode – but sometimes, you just want to throw the nsrmmd processes into debug mode, and depending on your site, there may be a lot of them.

So, I finally got around to writing a “script” to throw all nsrmmd processes into debug mode. It hardly warrants being a script, but it may be helpful to others. Of course, this is Unix only – I’ll leave it as an exercise to the reader to generate the equivalent Windows script.

The entire script is as follows:

#!/bin/sh

PLATFORM=`uname`

if [ "$PLATFORM" = "Linux" ]
then
	PROCLIST=`ps -C nsrmmd -o pid | grep -v PID`
elif [ "$PLATFORM" = "SunOS" ]
then
	PROCLIST=`ps -ea -o pid,comm | grep 'nsrmmd$' | awk '{print $1}'`
fi

DBG=$1

for pid in $PROCLIST
do
	echo dbgcommand -p $pid Debug=$DBG
	dbgcommand -p $pid Debug=$DBG
done

The above is applicable only to Solaris and Linux so far – I’ve not customised for say, HPUX or AIX simply because I don’t have either of those platforms hanging around in my lab. To invoke, you’d simply run:

# dbgnsrmmd.sh level

Where level is a number between 0 (for off) and 99 (for … “are you insane???”). Running it on one of my lab servers, it works as follows:

[root@nox bin]# dbgnsrmmd.sh 9
dbgcommand -p 4972 Debug=9
dbgcommand -p 4977 Debug=9
dbgcommand -p 4979 Debug=9
dbgcommand -p 4982 Debug=9
dbgcommand -p 4991 Debug=9
dbgcommand -p 4999 Debug=9
Note that when you invoke dbgcommand against a sub-daemon such as nsrmmd (as opposed to nsrd itself), you won’t get an alert in the daemon.{raw|log} file to indicate the debug level has changed.
 

Having observed Oracle’s strategy now for a while since the acquisition of Sun, and many discussions with a quite a few customers, it’s clear that Oracle is gunning for the “entire vertical” model. Quite simply, their primary focus appears now to be selling an entire solution to a company, starting with the low level storage, and extending all the way to the application tier.

There’s nothing wrong with that kind of strategy – so long as it doesn’t alienate companies who want to buy piecemeal.

Oracle however are most definitely alienating the educational institutions. Several institutions I deal with now have policies that require end-of-life Sun kit to be replaced with comparable Linux or Windows solutions, unless an absolute rock-solid business case can be built.

With educational support rapidly eroding, Solaris as a tent-pole Unix platform is well and truly dead.

[Original post below, 22-04-2010]

I was told yesterday that one of the changes Oracle has wrought at Sun is the killing of all educational discount programmes. Apparently while they’re still listed on the Sun websites, they’re unavailable. Another fascinating change is collapsing support programmes from multiple levels of varying cost to one single level.

From a Unix perspective, I grew up on Solaris, and I’ve always seen the Unix world split into two camps. On one side you’ve got HPUX and AIX, dominated by ‘smit’, and the other side was led by Solaris, with Tru64 close behind.

The HPUX and AIX approach to Unix has always been an interesting one. It’s about rigid controls, and it’s appealed to formal environments and procedure-oriented enterprises. I still remember a comment made by a senior BHP IT manager when I still worked in Newcastle. I’m modifying it slightly so that this article doesn’t get mired down in faeces:

In BHP IT Melbourne if you want to go the toilet, you hold a meeting about it. In BHP IT Wollonging if you want to go to the toilet, you consult a huge procedure about doing it. In BHP IT Newcastle, you do it in the corridor while you keep going with your work.

I know, it’s a wee crass, but as much as anything I saw it as a statement about the platforms in use at the time – particularly Newcastle vs Wollongong. You see, the Newcastle Unix team was dominated by Solaris, with a few Tru64 boxes and a couple of HPUX boxes (hell, even a couple of AT&T boxes). The Wollongong team was dominated by AIX.

What it comes down to is that the administrator mentality behind Solaris is all about free thinking. Not in a hippy sort of way, but in a “hey, here’s a Unix. Do whatever the hell you like with it” sort of way. It’s the result of having year after year of students at Universities using Solaris because that was the cheapest and most flexible Unix for the Universities to deploy. In this case by “free thinking” I’m not referring to any OSS ideals, but to the notion that it’s a full Unix that isn’t constrained by what the vendor feels you should do with it.

I’m taking a roundabout way of getting there, but I think the greatest damage Oracle is doing to Solaris is making the entire Sun platform less attractive to educational markets. People tend to stick to the platforms they learn at University – at least for a while – and so the overall Sun educational discount programme has always been a very clever one: hook them while they’re still learning, teach them that they can use the platform for whatever they want, and they’ll keep coming back to it once they’re out in the work force. This becomes a very powerful drag-sales method. Graduates come out of University looking for jobs on the platforms they have experience with. As they become team leaders or middle managers, they continue to advocate those platforms unless there’s a very strong reason to void the emotional attachment they have to a platform. Net result? The discounts to the educational market are recouped through the full prices in the commercial market. (I’d suggest that at a desktop/laptop level, Apple has been working at this now for some time, and it’s starting to build momentum that Microsoft will have trouble halting.)

Oracle clearly don’t understand that drag-sales model as it has applied with Sun. By killing off educational discount programmes for the entire Sun platform and making the Solaris operating system more costly to install and support, they’re eroding the “use it for anything” base market and mind share that has always been so critical to the continuing popularity of Solaris. I’m sure HP and IBM are both very pleased with this new direction that Oracle is taking. Oracle is making the jobs of HP and IBM sales people that much easier.

If Oracle lock in their current strategy and force this change, Solaris as a tent-pole Unix platform is dead. Somehow, I doubt Oracle would even care.

 

The challenge

Recently a customer asked me if it is possible to install and use Networker on Opensolaris. Opensolaris itself is a open-source operating system based on the well-know Solaris. Opensolaris has some unique features such as ZFS (which offers features such as on-the-fly compression and on-the-fly deduplication) and COMSTAR (which enables the operating system to export its storage via FC-SAN and iSCSI).

Although Networker is not yet certified for Opensolaris (there is an open RFE to do that) it is certified for Solaris. So I tried to install the most recent version at that time 7.5.2 with pkgadd on Opensolaris build 134 which ran as expected.

On first start nothing happened. It turned out nsrexecd requires two ssl libraries missing on opensolaris:

admin@opensolaris:/# ldd /usr/sbin/nsrexecd
libcommonssl.so =>       /usr/lib/nsr/amd64/libcommonssl.so
libc.so.1 =>     /lib/64/libc.so.1
libssl.so.0.9.7 =>       NOT FOUND
libcrypto.so.0.9.7 =>    NOT FOUND
libmp.so.2 =>    /lib/64/libmp.so.2

Checking the files it turned out the libraries itself are there but the version number does not match: nsrexecd required 0.9.7, opensolaris ships with 0.9.8 (=newer). So I tried to link the files accordingly. Checking again yielded:

admin@opensolaris:/# ldd /usr/sbin/nsrexecd
libcommonssl.so =>       /usr/lib/nsr/amd64/libcommonssl.so
libc.so.1 =>     /lib/64/libc.so.1
libssl.so.0.9.7 =>       /lib/64/libssl.so.0.9.7
libcrypto.so.0.9.7 =>    /lib/64/libcrypto.so.0.9.7
libmp.so.2 =>    /lib/64/libmp.so.2

So from the library dependency point of view everything looked good and nsrexecd was able to start as well.

The next step involved an attempt to start a local save job:

admin@opensolaris:/#save /etc
61261:save: Failed initialize ports from nsrexecd on "opensolaris"
39078:save: RAP error: Service not available.
4196:save: Failed to get port range from local nsrexecd: Service not available.
3817:save: Using networker-server as server

/etc
/etc/hosts
[...]

A few error messages, but that was expected for the first save.

In a second step i tried to start a job from the networker server itself. This job failed entirely. Looking at the logs it seemed nsrexecd was not started on the client. So I (re)-started nsrexecd on the client and initiated the save job from the server a second time. Nothing changed. The server complained about being unable to connect to the client.

On the client no nsrexecd was not running anymore. That was even stranger because i just restarted the process prior starting the backup.

On subsequent tests I noticed nsrexecd dies every time i invoke a save job – even a local save job.

So i did some tests with debugging turned on:

admin@opensolaris:/# nsrexecd -D9
lg_stat(): Calling stat64().
[....]
[....]
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 68 Attempting to register 390113 (vers 1) service with portmapper (111)
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 60 Successfully registered service 390113 with portmapper (111)
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 23 mondaemon_check count 1
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 16 checking file ..
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 17 checking file ...
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 18 checking file sec.
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 26 checking file nsrladb.lck.
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 30 checking file product.res.lck.
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 28 @(#) Product:      NetWorker
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 34 @(#) Release:      7.5.2.Build.452
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 22 @(#) Build number: 452
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 47 @(#) Build date:   Thu Feb  4 22:35:03 PST 2010
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 29 @(#) Build arch.:  sol10amd64
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 53 @(#) Build info:   DBG=0,OPT=-O2 -fno-strict-aliasing
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 35 clu_is_cluster_host_lc(): ENTRY ...
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 30 lg_lstat(): Calling lstat64().

When starting a save job (either locally or remotely) nsrexecd dies:

0 1270119985 2 0 0 2 654 0 opensolaris nsrexecd 2 %s 1 0 33 Found 390113 program on port 7937
0 1270119985 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 27 mondaemon_kill_check: entry
0 1270119985 2 0 0 2 654 0 opensolaris nsrexecd 2 %s 1 0 33 Found 390436 program on port 9327
0 1270119985 2 0 0 3 654 0 opensolaris nsrexecd 2 %s 1 0 84 RPC Authentication: RPCSEC_GSS negotiated GSS Legato as the authentication mechanism
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 53 auth_thread_inc_count(): 1 child threads are running.
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 21 clu_is_virthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 input hostname=opensolaris
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 43 clu_is_virthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 input hostname=opensolaris
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 24 input hostname=127.0.0.1
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 109 Failed to get user rights: Could not find authentication information for daemon number: 0, daemon instance: 0
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 28 input hostname=192.168.180.2
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 74 Adding ssnchnl:     session id = 2  ssn (pointer) = f62570  ops = 57e1a0    fd = 13
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 30 lg_lstat(): Calling lstat64().
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 69 RPC Authentication: admin/opensolaris@ authenticated using GSS Legato
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 70 RPC Authentication: Non-encrypted channel negotiated for ip: 127.0.0.1
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 39 Channel exited with status: (unknown) 0
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 39 Removing ssnchnl:   ssn = f62570    fd  = 13
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 53 auth_thread_dec_count(): 0 child threads are running.
Segmentation Fault (core dumped)

Doing some further tests yielded that backups initiated locally are running more or less successfully (with some error messages) and are indeed recoverable. Backups initiated remotely are not working due to nsrexecd crashing.

Analyzing the core dump from nsrexecd left behind yields:

admin@opensolaris:/nsr/cores/nsrexecd# pstack core
core 'core' of 654:     nsrexecd -D9
-----------------  lwp# 1 / thread# 1  --------------------
fffffd7fff21f783 t_delete () + 33
fffffd7fff21f42e realfree () + 5e
fffffd7fff21fbe2 cleanfree () + 52
fffffd7fff21ee61 _malloc_unlocked () + a1
fffffd7fff21ed86 malloc () + 2e
fffffd7fff2063be calloc () + 46
fffffd7ffe2c5291 netconfig_dup () + 21
fffffd7ffe2c4139 getnetconfigent () + d1
fffffd7ffe2de3e7 __rpc_getconfip () + 28f
fffffd7ffe2b85e1 getipnodebyname () + 29
fffffd7ffe338c88 get_addr () + 138
fffffd7ffe338883 _getaddrinfo () + 493
fffffd7ffe338b24 getaddrinfo () + c
000000000051258b lg_inet_pton () + 6b
0000000000475bb3 is_addr_match () + 33
0000000000475caf ???????? ()
00000000004f7c89 _authenticate_varp () + 1c9
00000000004f533d svc_dispatch_varp () + bd
00000000004f54b1 svc_getreq_poll_varp () + c1
000000000046b339 nsrexec_svc () + 449
000000000046f471 main () + 10a1
000000000045aafc _start () + 6c
-----------------  lwp# 2 / thread# 2  --------------------
fffffd7fff28dbba __pollsys () + a
fffffd7fff22bcca poll () + 62
000000000051d1ce lg_poll () + e
0000000000469a3c ???????? ()
000000000051e5a3 ???????? ()
fffffd7fff284ae4 _thrp_setup () + bc
fffffd7fff284da0 _lwp_start ()

Using mdb:

admin@opensolaris:/nsr/cores/nsrexecd# mdb /usr/sbin/nsrexecd core
Loading modules: [ libc.so.1 ld.so.1 ]
> $C
fffffd7fffde0810 libc.so.1`t_delete+0x33()
fffffd7fffde0840 libc.so.1`realfree+0x5e()
fffffd7fffde0880 libc.so.1`cleanfree+0x52()
fffffd7fffde08b0 libc.so.1`_malloc_unlocked+0xa1()
fffffd7fffde08d0 libc.so.1`malloc+0x2e()
fffffd7fffde08f0 libc.so.1`calloc+0x46()
fffffd7fffde0920 libnsl.so.1`netconfig_dup+0x21()
fffffd7fffde0950 libnsl.so.1`getnetconfigent+0xd1()
fffffd7fffde0990 libnsl.so.1`__rpc_getconfip+0x28f()
fffffd7fffde0a20 libnsl.so.1`getipnodebyname+0x29()
fffffd7fffde0b90 libsocket.so.1`get_addr+0x138()
fffffd7fffde0c40 libsocket.so.1`_getaddrinfo+0x493()
fffffd7fffde0c50 libsocket.so.1`getaddrinfo+0xc()
fffffd7fffde0cc0 lg_inet_pton+0x6b()
fffffd7fffde0e30 is_addr_match+0x33()
fffffd7fffde0e60 0x475caf()
fffffd7fffde0ea0 _authenticate_varp+0x1c9()
fffffd7fffde8f20 svc_dispatch_varp+0xbd()
fffffd7fffdf8fa0 svc_getreq_poll_varp+0xc1()
fffffd7fffdfcad0 nsrexec_svc+0x449()
fffffd7fffdffcd0 main+0x10a1()
fffffd7fffdffce0 _start+0x6c()

So from my first observations the crash has something to do with memory allocation/reallocation and with network functions (based on “netconfig_dup”). Due to my limited knowledge on the libc and its internal functions I was unable to dig deeper.

Unsatisfied with the current state (local initiated backups and recoveries are working, remotely arent) I tried several things:

  • Networker client 7.6
  • Networker client 7.6.1
  • Disabling IPv6
  • Using dependent libraries from Solaris 10 x86
  • and so on

But without success. nsrexecd kept crashing.

Due to a mistake I accidentally installed 7.4.5 and to my surprise it worked fine – even remote save jobs are running perfectly smooth.

I have not yet checked if the newer ssl libraries are causing the problem. Judging from the error stack trace I would trend to say so.

Conclusion

Although officially unsupported by EMC using networker client 7.4.5 works fine on Opensolaris. Even using ZFS as file system is supported (it is since 7.3.2).

Using version 7.5.x or 7.6.x causes nsrexecd to crash thus making remotely initiated saves impossible while locally initiated jobs run fine.

So if you need to backup your opensolaris-based system the author recommends to use networker client 7.4.5 over 7.5.x or 7.6.x.

About the author

Ronny Egner is working as a freelancer focused on Oracle databases, UNIX operating systems and EMC / Legato Networker. He is based in Germany (Europe) and is available for projects all over the world. His blog can be found at http://blog.ronnyegner-consulting.de.

 

I use Parallels quite a lot within my Mac environment, and recently tried to get Solaris/AMD 64-bit installed. Even on a Mac Pro system Solaris stubbornly refuses to install in 64-bit mode, picking the 32-bit kernel every time.

So after exhausting a lot of search options, I submitted a case to Parallels support – titled:

“Solaris installer does not recognise 64-bit CPU”

Overnight, I got the first email back from Parallels support, with this response:

Escalating this ticket to our next level of Support since the issue is regarding Linux.

I half-typed an email response to correct the engineer, but then I thought better of it. If I need to explain that Solaris isn’t Linux to a support engineer, then on second thoughts, I’d prefer to have my case escalated to an engineer who (hopefully) already knows this.

[2009-07-15 Edit]

The second level support engineer I got was much more savvy in the differences between operating systems and was able to answer my question. Solaris 64-bit Parallels support is being actively worked on, so hopefully I’ll see release notes for an update to the current version “soon” (my words, not theirs) mentioning added support for Solaris 64-bit guests.

[2009-12-30 Edit]

Parallels Desktop v5 does seem much better at supporting 64-bit Solaris. There’s a few tricks to getting networking going, but nothing terrible.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha