Using NetWorker Client with Opensolaris

The challenge

Recently a customer asked me if it is possible to install and use Networker on Opensolaris. Opensolaris itself is a open-source operating system based on the well-know Solaris. Opensolaris has some unique features such as ZFS (which offers features such as on-the-fly compression and on-the-fly deduplication) and COMSTAR (which enables the operating system to export its storage via FC-SAN and iSCSI).

Although Networker is not yet certified for Opensolaris (there is an open RFE to do that) it is certified for Solaris. So I tried to install the most recent version at that time 7.5.2 with pkgadd on Opensolaris build 134 which ran as expected.

On first start nothing happened. It turned out nsrexecd requires two ssl libraries missing on opensolaris:

admin@opensolaris:/# ldd /usr/sbin/nsrexecd
libcommonssl.so =>       /usr/lib/nsr/amd64/libcommonssl.so
libc.so.1 =>     /lib/64/libc.so.1
libssl.so.0.9.7 =>       NOT FOUND
libcrypto.so.0.9.7 =>    NOT FOUND
libmp.so.2 =>    /lib/64/libmp.so.2

Checking the files it turned out the libraries itself are there but the version number does not match: nsrexecd required 0.9.7, opensolaris ships with 0.9.8 (=newer). So I tried to link the files accordingly. Checking again yielded:

admin@opensolaris:/# ldd /usr/sbin/nsrexecd
libcommonssl.so =>       /usr/lib/nsr/amd64/libcommonssl.so
libc.so.1 =>     /lib/64/libc.so.1
libssl.so.0.9.7 =>       /lib/64/libssl.so.0.9.7
libcrypto.so.0.9.7 =>    /lib/64/libcrypto.so.0.9.7
libmp.so.2 =>    /lib/64/libmp.so.2

So from the library dependency point of view everything looked good and nsrexecd was able to start as well.

The next step involved an attempt to start a local save job:

admin@opensolaris:/#save /etc
61261:save: Failed initialize ports from nsrexecd on "opensolaris"
39078:save: RAP error: Service not available.
4196:save: Failed to get port range from local nsrexecd: Service not available.
3817:save: Using networker-server as server

/etc
/etc/hosts
[...]

A few error messages, but that was expected for the first save.

In a second step i tried to start a job from the networker server itself. This job failed entirely. Looking at the logs it seemed nsrexecd was not started on the client. So I (re)-started nsrexecd on the client and initiated the save job from the server a second time. Nothing changed. The server complained about being unable to connect to the client.

On the client no nsrexecd was not running anymore. That was even stranger because i just restarted the process prior starting the backup.

On subsequent tests I noticed nsrexecd dies every time i invoke a save job – even a local save job.

So i did some tests with debugging turned on:

admin@opensolaris:/# nsrexecd -D9
lg_stat(): Calling stat64().
[....]
[....]
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 68 Attempting to register 390113 (vers 1) service with portmapper (111)
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 60 Successfully registered service 390113 with portmapper (111)
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 23 mondaemon_check count 1
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 16 checking file ..
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 17 checking file ...
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 18 checking file sec.
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 26 checking file nsrladb.lck.
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 30 checking file product.res.lck.
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 28 @(#) Product:      NetWorker
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 34 @(#) Release:      7.5.2.Build.452
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 22 @(#) Build number: 452
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 47 @(#) Build date:   Thu Feb  4 22:35:03 PST 2010
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 29 @(#) Build arch.:  sol10amd64
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 53 @(#) Build info:   DBG=0,OPT=-O2 -fno-strict-aliasing
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 35 clu_is_cluster_host_lc(): ENTRY ...
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 30 lg_lstat(): Calling lstat64().

When starting a save job (either locally or remotely) nsrexecd dies:

0 1270119985 2 0 0 2 654 0 opensolaris nsrexecd 2 %s 1 0 33 Found 390113 program on port 7937
0 1270119985 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 27 mondaemon_kill_check: entry
0 1270119985 2 0 0 2 654 0 opensolaris nsrexecd 2 %s 1 0 33 Found 390436 program on port 9327
0 1270119985 2 0 0 3 654 0 opensolaris nsrexecd 2 %s 1 0 84 RPC Authentication: RPCSEC_GSS negotiated GSS Legato as the authentication mechanism
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 53 auth_thread_inc_count(): 1 child threads are running.
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 21 clu_is_virthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 input hostname=opensolaris
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 43 clu_is_virthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 input hostname=opensolaris
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 24 input hostname=127.0.0.1
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 109 Failed to get user rights: Could not find authentication information for daemon number: 0, daemon instance: 0
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 28 input hostname=192.168.180.2
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 74 Adding ssnchnl:     session id = 2  ssn (pointer) = f62570  ops = 57e1a0    fd = 13
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 30 lg_lstat(): Calling lstat64().
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 69 RPC Authentication: admin/opensolaris@ authenticated using GSS Legato
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 70 RPC Authentication: Non-encrypted channel negotiated for ip: 127.0.0.1
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 39 Channel exited with status: (unknown) 0
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 39 Removing ssnchnl:   ssn = f62570    fd  = 13
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 53 auth_thread_dec_count(): 0 child threads are running.
Segmentation Fault (core dumped)

Doing some further tests yielded that backups initiated locally are running more or less successfully (with some error messages) and are indeed recoverable. Backups initiated remotely are not working due to nsrexecd crashing.

Analyzing the core dump from nsrexecd left behind yields:

admin@opensolaris:/nsr/cores/nsrexecd# pstack core
core 'core' of 654:     nsrexecd -D9
-----------------  lwp# 1 / thread# 1  --------------------
fffffd7fff21f783 t_delete () + 33
fffffd7fff21f42e realfree () + 5e
fffffd7fff21fbe2 cleanfree () + 52
fffffd7fff21ee61 _malloc_unlocked () + a1
fffffd7fff21ed86 malloc () + 2e
fffffd7fff2063be calloc () + 46
fffffd7ffe2c5291 netconfig_dup () + 21
fffffd7ffe2c4139 getnetconfigent () + d1
fffffd7ffe2de3e7 __rpc_getconfip () + 28f
fffffd7ffe2b85e1 getipnodebyname () + 29
fffffd7ffe338c88 get_addr () + 138
fffffd7ffe338883 _getaddrinfo () + 493
fffffd7ffe338b24 getaddrinfo () + c
000000000051258b lg_inet_pton () + 6b
0000000000475bb3 is_addr_match () + 33
0000000000475caf ???????? ()
00000000004f7c89 _authenticate_varp () + 1c9
00000000004f533d svc_dispatch_varp () + bd
00000000004f54b1 svc_getreq_poll_varp () + c1
000000000046b339 nsrexec_svc () + 449
000000000046f471 main () + 10a1
000000000045aafc _start () + 6c
-----------------  lwp# 2 / thread# 2  --------------------
fffffd7fff28dbba __pollsys () + a
fffffd7fff22bcca poll () + 62
000000000051d1ce lg_poll () + e
0000000000469a3c ???????? ()
000000000051e5a3 ???????? ()
fffffd7fff284ae4 _thrp_setup () + bc
fffffd7fff284da0 _lwp_start ()

Using mdb:

admin@opensolaris:/nsr/cores/nsrexecd# mdb /usr/sbin/nsrexecd core
Loading modules: [ libc.so.1 ld.so.1 ]
> $C
fffffd7fffde0810 libc.so.1`t_delete+0x33()
fffffd7fffde0840 libc.so.1`realfree+0x5e()
fffffd7fffde0880 libc.so.1`cleanfree+0x52()
fffffd7fffde08b0 libc.so.1`_malloc_unlocked+0xa1()
fffffd7fffde08d0 libc.so.1`malloc+0x2e()
fffffd7fffde08f0 libc.so.1`calloc+0x46()
fffffd7fffde0920 libnsl.so.1`netconfig_dup+0x21()
fffffd7fffde0950 libnsl.so.1`getnetconfigent+0xd1()
fffffd7fffde0990 libnsl.so.1`__rpc_getconfip+0x28f()
fffffd7fffde0a20 libnsl.so.1`getipnodebyname+0x29()
fffffd7fffde0b90 libsocket.so.1`get_addr+0x138()
fffffd7fffde0c40 libsocket.so.1`_getaddrinfo+0x493()
fffffd7fffde0c50 libsocket.so.1`getaddrinfo+0xc()
fffffd7fffde0cc0 lg_inet_pton+0x6b()
fffffd7fffde0e30 is_addr_match+0x33()
fffffd7fffde0e60 0x475caf()
fffffd7fffde0ea0 _authenticate_varp+0x1c9()
fffffd7fffde8f20 svc_dispatch_varp+0xbd()
fffffd7fffdf8fa0 svc_getreq_poll_varp+0xc1()
fffffd7fffdfcad0 nsrexec_svc+0x449()
fffffd7fffdffcd0 main+0x10a1()
fffffd7fffdffce0 _start+0x6c()

So from my first observations the crash has something to do with memory allocation/reallocation and with network functions (based on “netconfig_dup”). Due to my limited knowledge on the libc and its internal functions I was unable to dig deeper.

Unsatisfied with the current state (local initiated backups and recoveries are working, remotely arent) I tried several things:

  • Networker client 7.6
  • Networker client 7.6.1
  • Disabling IPv6
  • Using dependent libraries from Solaris 10 x86
  • and so on

But without success. nsrexecd kept crashing.

Due to a mistake I accidentally installed 7.4.5 and to my surprise it worked fine – even remote save jobs are running perfectly smooth.

I have not yet checked if the newer ssl libraries are causing the problem. Judging from the error stack trace I would trend to say so.

Conclusion

Although officially unsupported by EMC using networker client 7.4.5 works fine on Opensolaris. Even using ZFS as file system is supported (it is since 7.3.2).

Using version 7.5.x or 7.6.x causes nsrexecd to crash thus making remotely initiated saves impossible while locally initiated jobs run fine.

So if you need to backup your opensolaris-based system the author recommends to use networker client 7.4.5 over 7.5.x or 7.6.x.

About the author

Ronny Egner is working as a freelancer focused on Oracle databases, UNIX operating systems and EMC / Legato Networker. He is based in Germany (Europe) and is available for projects all over the world. His blog can be found at http://blog.ronnyegner-consulting.de.

7 thoughts on “Using NetWorker Client with Opensolaris”

  1. Hi Ronny,

    I would like to know how to stop the client initiated backups in networker.

    Thanks in advance.

    Marcos.

    1. Hi Marcos,

      If you’re wanting to stop client initiated backups from the backup server, it’s a bit of a challenge. It’s far easier to stop client initiated backups by logging onto the client and killing the ‘save’ process on the client.

      Cheers,

      Preston.

    1. In the old days, it was relatively simple. There’d be a controlling process running on the server for each client save. Thus, you could (usually) kill the client save process by killing the server’s controlling process.

      These days though the saves are effectively controlled via nsrjobd, and I’m not sure how to get nsrjobd to do the killing – if it can.

  2. I got same problem on Solaris 11 express using Networker client 7.6.1.
    I solved the problem changing the authentication method.
    Using the old one the comunication return ok.
    To fix.
    On the client do:

    · nsradmin –p nsrexec

    . type: NSRLA

    show

    print

    update auth methods: “0.0.0.0/0,oldauth”

    show

    print

    · stop e restart Networker.

  3. Had the same problem with Solaris 11 Express and the “nsradmin –p nsrexec” did not work for me, since it kept core dumping. Anyway, did it by hand.

    # cd /var/nsr/res/nsrladb
    # grep auth */*
    03/….:auth methods: “0.0.0.0/0,nsrauth/oldauth”;
    Edit file and remove “,nsrauth”
    auth methods: “0.0.0.0/0,oldauth”;

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.