The challenge

Recently a customer asked me if it is possible to install and use Networker on Opensolaris. Opensolaris itself is a open-source operating system based on the well-know Solaris. Opensolaris has some unique features such as ZFS (which offers features such as on-the-fly compression and on-the-fly deduplication) and COMSTAR (which enables the operating system to export its storage via FC-SAN and iSCSI).

Although Networker is not yet certified for Opensolaris (there is an open RFE to do that) it is certified for Solaris. So I tried to install the most recent version at that time 7.5.2 with pkgadd on Opensolaris build 134 which ran as expected.

On first start nothing happened. It turned out nsrexecd requires two ssl libraries missing on opensolaris:

admin@opensolaris:/# ldd /usr/sbin/nsrexecd
libcommonssl.so =>       /usr/lib/nsr/amd64/libcommonssl.so
libc.so.1 =>     /lib/64/libc.so.1
libssl.so.0.9.7 =>       NOT FOUND
libcrypto.so.0.9.7 =>    NOT FOUND
libmp.so.2 =>    /lib/64/libmp.so.2

Checking the files it turned out the libraries itself are there but the version number does not match: nsrexecd required 0.9.7, opensolaris ships with 0.9.8 (=newer). So I tried to link the files accordingly. Checking again yielded:

admin@opensolaris:/# ldd /usr/sbin/nsrexecd
libcommonssl.so =>       /usr/lib/nsr/amd64/libcommonssl.so
libc.so.1 =>     /lib/64/libc.so.1
libssl.so.0.9.7 =>       /lib/64/libssl.so.0.9.7
libcrypto.so.0.9.7 =>    /lib/64/libcrypto.so.0.9.7
libmp.so.2 =>    /lib/64/libmp.so.2

So from the library dependency point of view everything looked good and nsrexecd was able to start as well.

The next step involved an attempt to start a local save job:

admin@opensolaris:/#save /etc
61261:save: Failed initialize ports from nsrexecd on "opensolaris"
39078:save: RAP error: Service not available.
4196:save: Failed to get port range from local nsrexecd: Service not available.
3817:save: Using networker-server as server

/etc
/etc/hosts
[...]

A few error messages, but that was expected for the first save.

In a second step i tried to start a job from the networker server itself. This job failed entirely. Looking at the logs it seemed nsrexecd was not started on the client. So I (re)-started nsrexecd on the client and initiated the save job from the server a second time. Nothing changed. The server complained about being unable to connect to the client.

On the client no nsrexecd was not running anymore. That was even stranger because i just restarted the process prior starting the backup.

On subsequent tests I noticed nsrexecd dies every time i invoke a save job – even a local save job.

So i did some tests with debugging turned on:

admin@opensolaris:/# nsrexecd -D9
lg_stat(): Calling stat64().
[....]
[....]
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 68 Attempting to register 390113 (vers 1) service with portmapper (111)
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 60 Successfully registered service 390113 with portmapper (111)
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 23 mondaemon_check count 1
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 16 checking file ..
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 17 checking file ...
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 18 checking file sec.
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 26 checking file nsrladb.lck.
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 30 checking file product.res.lck.
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 28 @(#) Product:      NetWorker
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 34 @(#) Release:      7.5.2.Build.452
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 22 @(#) Build number: 452
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 47 @(#) Build date:   Thu Feb  4 22:35:03 PST 2010
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 29 @(#) Build arch.:  sol10amd64
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 53 @(#) Build info:   DBG=0,OPT=-O2 -fno-strict-aliasing
0 1270119900 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 35 clu_is_cluster_host_lc(): ENTRY ...
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119900 2 0 0 5 654 0 opensolaris nsrexecd 2 %s 1 0 30 lg_lstat(): Calling lstat64().

When starting a save job (either locally or remotely) nsrexecd dies:

0 1270119985 2 0 0 2 654 0 opensolaris nsrexecd 2 %s 1 0 33 Found 390113 program on port 7937
0 1270119985 2 0 0 1 654 0 opensolaris nsrexecd 2 %s 1 0 27 mondaemon_kill_check: entry
0 1270119985 2 0 0 2 654 0 opensolaris nsrexecd 2 %s 1 0 33 Found 390436 program on port 9327
0 1270119985 2 0 0 3 654 0 opensolaris nsrexecd 2 %s 1 0 84 RPC Authentication: RPCSEC_GSS negotiated GSS Legato as the authentication mechanism
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 53 auth_thread_inc_count(): 1 child threads are running.
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 21 clu_is_virthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 input hostname=opensolaris
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 43 clu_is_virthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 input hostname=opensolaris
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 24 input hostname=127.0.0.1
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 109 Failed to get user rights: Could not find authentication information for daemon number: 0, daemon instance: 0
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 26 clu_is_localvirthost:ENTRY
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 28 input hostname=192.168.180.2
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 48 clu_is_localvirthost():EXIT unknown cluster type
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 28 lg_open(): Calling open64().
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 74 Adding ssnchnl:     session id = 2  ssn (pointer) = f62570  ops = 57e1a0    fd = 13
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 30 lg_lstat(): Calling lstat64().
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 69 RPC Authentication: admin/opensolaris@ authenticated using GSS Legato
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 70 RPC Authentication: Non-encrypted channel negotiated for ip: 127.0.0.1
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 39 Channel exited with status: (unknown) 0
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 39 Removing ssnchnl:   ssn = f62570    fd  = 13
0 1270119985 2 0 0 6 654 0 opensolaris nsrexecd 2 %s 1 0 53 auth_thread_dec_count(): 0 child threads are running.
Segmentation Fault (core dumped)

Doing some further tests yielded that backups initiated locally are running more or less successfully (with some error messages) and are indeed recoverable. Backups initiated remotely are not working due to nsrexecd crashing.

Analyzing the core dump from nsrexecd left behind yields:

admin@opensolaris:/nsr/cores/nsrexecd# pstack core
core 'core' of 654:     nsrexecd -D9
-----------------  lwp# 1 / thread# 1  --------------------
fffffd7fff21f783 t_delete () + 33
fffffd7fff21f42e realfree () + 5e
fffffd7fff21fbe2 cleanfree () + 52
fffffd7fff21ee61 _malloc_unlocked () + a1
fffffd7fff21ed86 malloc () + 2e
fffffd7fff2063be calloc () + 46
fffffd7ffe2c5291 netconfig_dup () + 21
fffffd7ffe2c4139 getnetconfigent () + d1
fffffd7ffe2de3e7 __rpc_getconfip () + 28f
fffffd7ffe2b85e1 getipnodebyname () + 29
fffffd7ffe338c88 get_addr () + 138
fffffd7ffe338883 _getaddrinfo () + 493
fffffd7ffe338b24 getaddrinfo () + c
000000000051258b lg_inet_pton () + 6b
0000000000475bb3 is_addr_match () + 33
0000000000475caf ???????? ()
00000000004f7c89 _authenticate_varp () + 1c9
00000000004f533d svc_dispatch_varp () + bd
00000000004f54b1 svc_getreq_poll_varp () + c1
000000000046b339 nsrexec_svc () + 449
000000000046f471 main () + 10a1
000000000045aafc _start () + 6c
-----------------  lwp# 2 / thread# 2  --------------------
fffffd7fff28dbba __pollsys () + a
fffffd7fff22bcca poll () + 62
000000000051d1ce lg_poll () + e
0000000000469a3c ???????? ()
000000000051e5a3 ???????? ()
fffffd7fff284ae4 _thrp_setup () + bc
fffffd7fff284da0 _lwp_start ()

Using mdb:

admin@opensolaris:/nsr/cores/nsrexecd# mdb /usr/sbin/nsrexecd core
Loading modules: [ libc.so.1 ld.so.1 ]
> $C
fffffd7fffde0810 libc.so.1`t_delete+0x33()
fffffd7fffde0840 libc.so.1`realfree+0x5e()
fffffd7fffde0880 libc.so.1`cleanfree+0x52()
fffffd7fffde08b0 libc.so.1`_malloc_unlocked+0xa1()
fffffd7fffde08d0 libc.so.1`malloc+0x2e()
fffffd7fffde08f0 libc.so.1`calloc+0x46()
fffffd7fffde0920 libnsl.so.1`netconfig_dup+0x21()
fffffd7fffde0950 libnsl.so.1`getnetconfigent+0xd1()
fffffd7fffde0990 libnsl.so.1`__rpc_getconfip+0x28f()
fffffd7fffde0a20 libnsl.so.1`getipnodebyname+0x29()
fffffd7fffde0b90 libsocket.so.1`get_addr+0x138()
fffffd7fffde0c40 libsocket.so.1`_getaddrinfo+0x493()
fffffd7fffde0c50 libsocket.so.1`getaddrinfo+0xc()
fffffd7fffde0cc0 lg_inet_pton+0x6b()
fffffd7fffde0e30 is_addr_match+0x33()
fffffd7fffde0e60 0x475caf()
fffffd7fffde0ea0 _authenticate_varp+0x1c9()
fffffd7fffde8f20 svc_dispatch_varp+0xbd()
fffffd7fffdf8fa0 svc_getreq_poll_varp+0xc1()
fffffd7fffdfcad0 nsrexec_svc+0x449()
fffffd7fffdffcd0 main+0x10a1()
fffffd7fffdffce0 _start+0x6c()

So from my first observations the crash has something to do with memory allocation/reallocation and with network functions (based on “netconfig_dup”). Due to my limited knowledge on the libc and its internal functions I was unable to dig deeper.

Unsatisfied with the current state (local initiated backups and recoveries are working, remotely arent) I tried several things:

  • Networker client 7.6
  • Networker client 7.6.1
  • Disabling IPv6
  • Using dependent libraries from Solaris 10 x86
  • and so on

But without success. nsrexecd kept crashing.

Due to a mistake I accidentally installed 7.4.5 and to my surprise it worked fine – even remote save jobs are running perfectly smooth.

I have not yet checked if the newer ssl libraries are causing the problem. Judging from the error stack trace I would trend to say so.

Conclusion

Although officially unsupported by EMC using networker client 7.4.5 works fine on Opensolaris. Even using ZFS as file system is supported (it is since 7.3.2).

Using version 7.5.x or 7.6.x causes nsrexecd to crash thus making remotely initiated saves impossible while locally initiated jobs run fine.

So if you need to backup your opensolaris-based system the author recommends to use networker client 7.4.5 over 7.5.x or 7.6.x.

About the author

Ronny Egner is working as a freelancer focused on Oracle databases, UNIX operating systems and EMC / Legato Networker. He is based in Germany (Europe) and is available for projects all over the world. His blog can be found at http://blog.ronnyegner-consulting.de.

 

In the last few days, cumulative patch clusters have been released for the following versions of NetWorker:

  • 7.6 – Patch cluster 7.6.0.3 released.
  • 7.5.2 – Patch cluster 7.5.2.1 released.
  • 7.4.5 – Patch cluster 7.4.5.56 released.

As per usual, these haven’t been released to PowerLink, but can be requested via your authorised support partner. Remember that cumulative patch clusters don’t contain any new features – they’re just accumulated key bug fixes. If you’re having any issues with either your current 7.6.0.x, 7.5.2 or 7.4.5.x install, you may want to talk to your support partner about the fixes included in those cumulative patch clusters.

[Edit - 2010-03-26] Apologies, I meant to say that cumulative patch cluster 7.4.5.6 had been released for the 7.4.5 tree, not 7.4.5.5.

 

Over the last 36 hours or so I’ve been doing a lot of tests of NetWorker 7.4.5, and overall I must say I’m reasonably impressed in its stability.

Let’s be completely up front: this is a bug-fix only release. If you check the release notes, you’ll see that there are no new features in this release at all. This from the outset usually means a fairly stable release, as there’s no “new code” (so to speak) competing with existing code and patches.

So far my testing has been limited to Linux and Solaris, with Windows testing to start tomorrow. (I frequently pick on Linux for heavy NetWorker testing because I’ve found in the past that if a *nix platform is going to have issues, Linux will be the first one to do so.)

According to the release notes, there’s 89 resolved issues in NetWorker 7.4.5; while some of them of course are somewhat trivial (e.g., one of the fixed issues is to do with NetWorker vs EBS branding in particular scripts), many of them represent significant fixes to issues in NetWorker 7.4. Previously several of these rolled into cumulative patch clusters, however, the number of fixes in 7.4.5 exceeds the number of patches cited for the cumulative patch clusters by a quite a bit, meaning there’s been quite a lot of effort go into this “service pack”.

My gut feel at this point is that if you’re still on the 7.4.x tree, 7.4.5 may be quite a worthwhile version to update to. As always, no site should update their version of NetWorker without a careful review of the release notes, and administrators should make themselves completely aware of (at bare minimum) the following:

  • Fixed issues.
  • Known limitations.
  • Where their current installers are should a back-out be required.
  • Where copies of any currently installed patches are should a back-out be required.

In short: an update should always be prepared for, both in the action plan and the back-out plan, and always consider the update in light of the needs and issues of your site.

I’ll post another update in a day or two once I’ve had more time to review this release.

 

I received an automated alert from PowerLink overnight telling me that the release notes for NetWorker 7.4 had been updated. Having downloaded the updated release notes this morning, I can see that there’s information in there about an (as yet unreleased) NetWorker 7.4.5. This is not uncommon – sometimes the release notes for NetWorker in PowerLink has been known to be updated with details of new versions for up to 1-2 weeks before the downloads hit.

Obviously at this point I don’t have the opportunity to test out 7.4.5 given I’m not yet able to see it in downloads. I have to imagine though that it’s somewhat similar to 7.4.4.7, the last cumulative patch build that I received from EMC. There are around 3.5 pages worth of fixed bug notes in the release notes, which seems to mean that it not only includes all the cumulative patch updates, but a swag of other updates as well.

As per usual, it’s reassuring to see a bunch of bugs for which I’ve had patches delivered for my customers appearing in the fixed bug list – EMC is certainly making a lot of improvements in eliminating issues that repeatedly crop up in successive versions!

I’ll do a new posting once I’ve had a chance to download and test out 7.4.5, but certainly on the basis of the fixed bug lists (and the very minimal “known issues and limitations” list), this may be a promising update to the 7.4 tree.

[Edit - later the same day...]

NetWorker 7.4.5 has now appeared in PowerLink for downloads. I’m going to kick off a few key platform downloads today and start testing over the coming 48 hours.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha