Being a Mac user, a lot of people expect me to loathe everything Microsoft. Also, having come from a Unix background, equally as many people expect me to loathe everything Microsoft. I’ll be honest – I have no great fondness for them, but I’m also practical enough to recognise that the chances of them falling over in a heap at any point are microscopically small. I also wouldn’t wish it on them. Instead, what I hope is that they eventually learn cooperative engagement in the marketplace.

To me, Microsoft has been over the last few years suffering the sort of decline that comes not from a larger competitor taking it on, but what I describe as “death by a thousand mosquito bites”. No one single failing of theirs, and no one single competitor of theirs, is causing them catastrophic harm; however, the combination of their failings, and their competitors, is actually starting to dig in.

Over at Daring Fireball, John Gruber has one of his typically insightful commentaries on Microsoft’s decline. Yes, it’s told from an Apple perspective, but Gruber is one of the leading Apple bloggers these days, so that’s to be expected. If you’re at all interested in a “non-Microsoft” perspective, Gruber’s commentary is worth spending 10 minutes to read.

 

July 31 is System Administrator Appreciation day. Since most backup administrators are either system administrators or come from system administration roles, and since I started out in system administration, I think this is a great day.

Remember: system administrators often work tirelessly in the background, without getting noticed. In fact many system administrators often don’t get noticed until there’s a problem!

So please, take some time on July 31 to say thanks to your system administrators for all the hard work they’ve put in over the last twelve months.

 

Everyone makes mistakes. That’s part of being human. Indeed, I’d suggest that anyone who expects you to never make mistakes in your job may perhaps be either demented or live at right angles to reality.

I believe the best, the most realistic and useful thing we can aim for is to never make the same mistake twice.

Thus, below I present my “worst recovery ever” as an example of a mistake that I certainly don’t intend to ever have happen to me again. It happened in my last job, and since then I’ve changed the way I work when it comes to recoveries.

It was Friday afternoon – about 3pm in fact. There was a training course running, and for once the training network was behaving itself. As our management had (once) bought into the notion of network computing, our training environment was sufficiently convoluted such that Sun-Rays referred to our production backup server+fileserver+SunRay server, then ran RDP to a VMware server.

One of our engineers who had little to do with NetWorker (ironically, he was hired to be my field replacement for NetWorker, but circumstances changed when he was hired) ran one of those notoriously bad RedHat updates that for a while killed glibc and a bunch of other system files if you didn’t happen to be in North America. So after much diagnosing and discussing, it was necessary to do an OS recovery. The only problem was that the OS on his laptop was so hosed that you couldn’t start any more login sessions, so you were stuck with what was there.

I started the recovery, selected files and viewed volumes. However, some of the files we needed were on media that was outside of the library, and because we were backing up to disk, then cloning, then staging, NetWorker wanted the (offsite) clones, not the onsite originals. His laptop didn’t have administration rights on the backup server, so I ssh’d across to the backup server, set the appropriate tapes to have a ‘suspect’ status, then ran up the recovery again. I selected the root filesystem (/) and kicked off the recovery.

About 2-3 minutes later someone came to ask me about why they couldn’t access the fileserver any more. Checking, I couldn’t log in either. It was odd – the training course was still working, but nothing new would work.

And then it hit me. I never logged out of the Sun system before I kicked off the recovery. Not only that, when I kicked off the recovery, I’d run “recover -c linuxClienton the Sun server.

That’s right, I was recovering a Linux filesystem on top of a Solaris system – including /dev, including all the base binaries. And because it was a recovery designed to overwrite a clobbered operating system, I’d told it to force-overwrite everything it came across.

I was …unhappy… with myself. I aborted the recovery, obviously, but the damage had already been done. No-one else could do anything, and I couldn’t start the recovery of the backup server because it was also the Sun-Ray server, and there were a bunch of paying students wrapping up their training course. So I had to wait for the training course to complete before I could even start the recovery. Practically everyone else got an early mark, since there wasn’t much they could do.

The recovery turned out to be somewhat problematic. Because the Solaris OS was hosed, no new processes could be started on that – the only solution, after much consideration, was an OS reinstall. At some point though the OS disks for that server had been taken out to a customer site because that customer had lost their disks, or something along those lines, so an earlier revision Solaris installer disk was used. However, it turned out that earlier revision disk wasn’t compatible with that hardware, and these were the days when Solaris took forever to install when you didn’t have tonnes of RAM. So by 10pm that night, I’d given up on being able to install the OS that night. Murphy’s law strikes as much with recoveries as it does with anything else in IT, so of course I’d not slept a wink the night before, and I lived one and a half hours away from the office by train at the best of times – at that time of the night, 3 hours at least.

I fell into bed around 2am, but fretting over the recovery, didn’t sleep much either. I’d recently purchased a Sun workstation myself, so I knew I had an install disk that would work, which was why I’d chosen to come home rather than download Solaris from the office. Later that morning saw me heading back into the office to keep going, and of course, my installer of Solaris was now faulty so the OS install hit bad sectors on the disk during the process and I hit my head repeatedly against the column in my office.

Lady luck then struck. Or at least wandered by and waved. There was a detached mirror left over from some disk swapping from about 2-3 months ago.

Finally I managed to boot the system from the previously attached mirror, get that mirror re-syncing, and installed NetWorker. Once the mirrors had resynced the recovery started in earnest, and the system finally came back.

By that stage though it was about 5pm on Saturday afternoon. I was wrecked from having not slept for a couple of days, not to mention still bloody angry with myself for the entire chain of events.

So I vowed I’d never make the same mistake again.

How, I hear you ask, will I prevent myself from ever making this mistake again? I check, check, check. In real recovery situations, I never run a recovery command any longer without first checking what host I’m logged into. It takes me an extra 10 seconds, maybe even fewer, but it guarantees that I don’t get that sick filling in the pit of my stomach that comes with FUBAR’ing one system when trying to recover another.

If I had done that simple check all those years ago, I would have had a lovely, quiet weekend.

 

Most days my blog stats shows at least one search coming into the blog along the lines of “how fast is NetWorker”, etc. It’s understandable. A lot of people selling products other than NetWorker try to push old FUD that it’s not fast enough. Equally, a lot of people who are considering NetWorker are understandably curious as to whether it will be fast enough to suit their needs.

I thought I should write a (brief) piece on this.

To cut to the chase, NetWorker is as fast as your hardware will allow. Yes, there are obviously some software limitations, but that’s true of any backup product.

Looking at the facts though, we can refer back as far as 2003, where NetWorker broke the (let’s call it) “land speed record” for backup by achieving backup performance of 10TB per hour. Most companies now would still be happy with 10TB an hour, but obviously that performance metric was bound by the devices and infrastructure available at the time. These days, it would obviously come out much faster.

I’m currently struggling to find the original Legato piece about this performance record, but my recollection is that it was:

  • Averaging 10TB/h
  • Achieving 2.86GB/s (that’s gigabytes per second, not gigabits per second)
  • Using real customer data

I did find the (very brief) SGI announcement about the speed achieved here. I also found a Sun/Legato presentation here (search for “10TB/h”), and a “press clipping” here.

The net result? Well, I’m not claiming every environment will get that sort of speed, but what I will reasonably confidently assert is that NetWorker will scale to meet your needs, so long as you have budget.

Backup performance isn’t really a p–ssing competition that you want to get into – in reality, if you want to worry about “speeds and feeds”, look at restore performance. NetWorker does admirably there – that 10TB/h filesystem backup restored at 4.5TB/h, and a block level backup run at 7.2TB/h restored at 7.9TB/h.

So the next time someone tries to tell you that “NetWorker isn’t fast enough to be enterprise”, remember one thing: they’re wrong.

 

Over at undrln, there’s currently a link to a comparison of the evolution of the Pepsi and Coca Cola logos. This, to me, is a fascinating insight into one key marketing fact – if it’s not broken, don’t fix it.

(NB: This post is not an endorsement of either product.)

 

Generally speaking I don’t have a lot of time for NetBackup, primarily due to the lack of dependency checking. That’s right, a backup product that doesn’t ensure that fulls are kept for as long as necessary to guarantee recoverability of dependent incrementals isn’t something I enjoy using.

That being said, there are some nifty ideas within NetBackup that I’d like to see eventually make their way into NetWorker.

One of those nifty ideas is the notion of image checkpointing. To use the NetWorker vernacular, this would be sub-saveset checkpointing. The notion of checkpointing is to allow a saveset to be restarted from a point as close to the failure as possible rather than from the start. E.g., your backup may be 20GB into a 30GB filesystem and a failure occurs. With image checkpointing turned on in NetBackup, the backup won’t need to re-run the entire 20GB previously done, but will pick up from the last point in the backup that a checkpoint was taken.

I’m not saying this would be easy to implement in NetWorker. Indeed, if I were to be throwing a bunch of ideas into a group of “Trivial”, “Easy”, “Hmmm”, “Hard” and “Insanely Difficult” baskets, I’d hazard a guess that the modifications required for sub-saveset checkpointing would fall at least into the “Hard” basket.

To paraphrase a great politician though, sometimes you need to choose to do things not because they’re easy, but because they’re hard.

So, first – why is sub-saveset checkpointing important? Well, as data sizes increase, and filesystems continue to grow, having to restart the entire saveset because of a failure “somewhere” within the stream is increasingly inefficient. For the most part, we work through these issues, but as filesystems continue to grow in size and complexity, this makes it harder to hit backup windows when failures occur.

Secondly – how might sub-saveset checkpointing be done? Well, NetWorker already is capable of doing this – sort of. It’s in chunking or fragments. Long term NetWorker users will be well aware of this: savesets that had a maximum size of 2GB, and so if you were backing up a 7 GB filesystem called “/usr”, you’d get:

/usr
<1>/usr
<2>/usr
<3>/usr

In the above, “/usr” was considered the “parent” of “<1>/usr”, “<1>/usr” was the parent of “<2>/usr”, and so on. (Parent? man mminfo – read about pssid.)

Now, I’m not suggesting a whole-hearted return to this model – it’s a pain in the proverbial to parse and calculate saveset sizes, etc., and I’m sure there’s other inconveniences to it. However, it does an entry to the model we’re looking for – if needing to restart from a checkpoing, a backup could continue via a chunked/fragmented saveset.

The difficulty lays in differentiating between the “broken” part of the parent saveset chunk and the “correct” part of the child saveset chunk, which would likely require extension to at least the media database. However, I think it’s achievable given that the media database contains details about segments within savesets (i.e., file/record markers, etc.), then in theory it should be possible to include a “bad” flag so that a chunk of data at the end of a saveset chunk can be declared as bad, indicating to NetWorker that it needs to move onto the next child chunk.

It’s fair to say that most people would be happy with needing to go through a media database upgrade (i.e., a change to the structure as part of starting a new version of NetWorker) in order to get sub-saveset checkpointing.

 

NetWorker 7.4 saw the introduction of language-neutral logs – raw logs, if you will, in the form of changing daemon.log to daemon.raw. The primary purpose behind this change was to allow customers using one language pack to send a language-neutral log to a support engineer who could then render it locally in a language they could understand.

Many have expressed a dislike for the raw log format – personally it doesn’t bother me, because for the most part I work in Unix environments, so commands to extract only portions of the log either before or after rendering are only a few keystrokes away at any time, and well, I’m a command line person :-)

There is a facility in NetWorker though to turn on what is referred to as a “realtime rendered log”. One catch: don’t submit the realtime rendered log to your support provider and don’t rely exclusively on it. In some instances, it’s known that messages may be dropped from the realtime rendered log that would be reported if running a manual nsr_render_log against the daemon.raw file.

To setup realtime log rendering, you need to run up nsradmin against the client daemon – either on the NetWorker server or another machine. For example:

[root@nimrod ~]# nsradmin -p 390113 -s nimrod
NetWorker administration program.
Use the "help" command for help, "visual" for full-screen mode.
nsradmin> print type: NSR log
                        type: NSR log;
               administrator: "isroot,host=nox",
                              "user=root,host=localhost",
                              "user=root,host=nimrod";
                       owner: NMC Log File;
             maximum size MB: 2;
            maximum versions: 10;
        runtime rendered log: ;
                        name: gstd.raw;
                    log path: /opt/lgtonmc/logs/gstd.raw;

                        type: NSR log;
               administrator: "isroot,host=nox",
                              "user=root,host=localhost",
                              "user=root,host=nimrod";
                      owner: NetWorker;
            maximum size MB: 2;
           maximum versions: 10;
       runtime rendered log: ;
                       name: daemon.raw;
                   log path: /nsr/logs/daemon.raw;

Here, we have two different log entries – one for NMC and one for NetWorker. If we wanted to just change the NetWorker one, we’d restrict our selection thusly:

nsradmin> print type: NSR log; name: daemon.raw
                        type: NSR log;
               administrator: "isroot,host=nox",
                              "user=root,host=localhost",
                              "user=root,host=nimrod";
                       owner: NetWorker;
             maximum size MB: 2;
            maximum versions: 10;
        runtime rendered log: ;
                        name: daemon.raw;
                    log path: /nsr/logs/daemon.raw;

To update the log setting to generate a rendered log in realtime, all one would have to do is issue the following command:

nsradmin> update runtime rendered log: /nsr/logs/daemon.log
        runtime rendered log: /nsr/logs/daemon.log;
Update? y

To have any change you make here take effect, you need to stop and restart NetWorker.

However, that’s not all you can do here – look at some of those other settings. In particular, “maximum size MB” and “maximum versions”. Personally I think maximum versions is already at a good value – that will keep 10 older copies of the daemon.raw file. (I strongly advocate keeping older copies of log files and ensuring their backups are kept for the lifetime of any backups you may wish to recover from. That way if you have any recovery issues you can, if necessary, recover the server logs from the time the backup was generated.)

The most interesting setting for me is the maximum size MB. The process for NetWorker starting a new daemon.raw file works as follows: IF NetWorker is restarted AND the daemon.raw curently exceeds the maximum size MB THEN rename the existing daemon.raw file AND start a new one.

In the dim dark days of system capacity allocation where asking for a 2GB drive for a backup server was about as well received as asking the board of directors for their cars so you could host your own demolition derby, a maximum log file size of 2MB was seemingly appropriate.

These days I think it’s too low. Now there’s a balancing act here, particularly when you take email and file uploads into consideration. I wouldn’t make it too big, but I would make it bigger – e.g., 10 MB. Reason? Say you have a problem overnight that causes you to restart your NetWorker server and log a case with your support provider. The first thing they’ll ask you to do is send through your daemon.raw; however, in a lot of restart scenarios where issues have occurred, your log file will have grown big enough to require being recreated, so you have to remember to send both. Increasing the default log file size doesn’t guarantee avoiding this, but it does help to ameliorate the possibility.

For convenience then, if you’re going to make the above changes with your log rendering options, I’d also suggest considering increasing the maximum size to at least 5 MB or even 10 MB. (Given file compresability, this shouldn’t pose a problem.)

To do this, you’d enter the command:

nsradmin> update maximum size MB: 10
             maximum size MB: 10;
Update? y

Again, if making changes here you need to restart the NetWorker services for them to take effect.

NOTE: Windows users will need to type paths for the rendered logs with double-backslashes, and in double-quotes. For instance, assuming a default installation location, the path issued might be:

nsradmin> update realtime rendered log: 
"C:\\Program Files\\Legato\\nsr\\logs\\daemon.log"

Of course, since changes to this affect logging, which in turn affects the ability to diagnose or monitor the backup system, I strongly urge you to confirm following the changes that logging is working.

 

Truth be told, I don’t have any real involvement in EBS (Enterprise Backup Software) these days. If you’re unaware of it, EBS is EMC NetWorker, rebadged. When I did have involvement with it, it was back in the days when it was called Solstice Backup.

One of the things that I liked about Solstice Backup was that it basically came with all new copies of Solaris with what was called a “Single Server” edition. That meant that it would support 1 tape drive, no tape library, and only be used to backup the backup server itself. Yes, single server edition would effectively mean a decentralised backup environment, but the purpose of single server edition wasn’t to get everyone to go down the blitheringly idiotic path of decentralised backups. Instead, it had two purposes, viz.:

(a) to provide a basic but very reliable way of backing up servers, and,

(b) to give companies an introduction to enterprise backup software.

You see, you could jump from single server edition to workgroup edition, or network edition, just by replacing the licenses. Your configuration would remain in place, meaning all you had to do was to start extending that configuration to cater for the expanded functionality of the product. Your existing backups were recoverable. Your existing backups would continue to backup. You could just do more.

I can’t say for sure whether EBS still supports single server edition – but it’s not really all that relevant to what I’m about to say, so it doesn’t really matter one way or another.

To this day I think it’s a shame that (Legato first, now) EMC hasn’t come up with a OEM model for NetWorker to allow for the inclusion of single server edition in one or more of the major Enterprise Linux distributions – e.g,. RedHat or Novell/SuSE. Obviously such a model would require an appropriate support system – when effectively giving it away for free (i.e., as part of a base system), there would need to be adequate training to allow the OEM/OS partner to adequately do first level support of the product as part of regular support work, but as we’ve seen with Sun and Solstice Backup Single Server Edition, that can be done. It’s a great way of getting the foot in the door, and in my personal experience at least, many companies that actually took the time to configure single server edition ended up upgrading to at least Workgroup, if not Network edition of NetWorker. Note what I said there: companies that actually took the time. I.e., there’s no guarantee that every single company will want to go ahead with configuring it – particularly with the size of current NetWorker documentation*. In other words, there were, and still are, some impediments to easy untrained roll-outs of NetWorker.

Those impediments to having NetWorker more approachable for rapid roll-out with easy instructions in a ‘single server’ environment however are readily quantifiable and easily resolvable. To prove that I’m not talking out of my butt, I’ll do my best to quantify the “top 5″ items that would be necessary:

  1. Documentation – Rumblings on the NetWorker mailing list aside, NetWorker documentation has significantly improved over the last year. There’s been a big push to get useful documentation – hence the technical upgrade guides, the “continuous improvement” that’s going into PowerLink articles**, etc. However, quick start guides are still needed.
  2. Server merge functionality – If you’re going to do a single server edition, it’s necessary to support merging multiple NetWorker server media databases, configuration files and indices into a single datazone. That’s to allow for companies that might initially start down the path of a few standalone servers before realising they need to consolidate and have grown-up backups.
  3. Backup to disk + tape – In this day and age, single server edition should support say, 1TB of disk backup in a single device + a tape drive. That allows for basic cloning/staging and support for high speed devices, but doesn’t give away so much functionality that it discourages purchase of a full license. (Indeed, I’m inclined to suggest that it’s high time EMC includes in all the base NetWorker licenses, support for 1TB of disk backup space.)
  4. Manual Backup in NMC – This would take effort, but it’s something that would feed into all versions of NetWorker, so it would be worth the effort, giving NetWorker better selling points. I’m not talking about running a group manually – I’m talking about browsing a client (the wizard in 7.5 supports this, after all), and manually selecting files for backup, as is currently available in the Windows user program and used to be available in nwbackup. It should be available in NMC.
  5. Recovery in NMC – As above, and even more important than the above, we should see the complete ditching of (filesystem) client GUIs – nwrecover and winworkr, and see NMC support recovery as a standard option within that GUI.

Will the above points take time? Yes. Are they worth it? Yes. Will they carry through to other versions (Workgroup/Network/Power)? Well, point 3 is irrelevant to those versions, but all the other points are very relevant to all tiers of NetWorker, so implementing them will certainly help continued adoption of NetWorker – not only that, they’re all highly logical.


* Having a Getting Started With NetWorker guide would probably help in that sort of scenario too. (Yes, I’m getting closer to formalising what I’m going to do on that front.)

** Yes, there are some outdated PowerLink articles regarding NetWorker, but that’s true for any product that’s been around for as long as NetWorker. The point is, there are active and ongoing efforts to improve the documentation in PowerLink. Credit where credit is due.

 

OK, so this has been linked to a … bazillion … times by various storage-type bloggers over the last week. but it’s only been in the last 24 hours that I’ve actually scrolled through the entire list. (Truth be told, the colour combinations don’t work for me … perhaps they should have used more chartreuse to keep my attention.)

The folks over at Mozy have added a blog entry: How Much is a Petabyte? As much as anything, it’s a nice little overview of the historical growth of storage. (Having reached the point where at home my partner and I have about 20TB of space between us and my servers, I can attest to storage growth!) If you’ve got 3 minutes to spare (and you haven’t already read it), you may find it interesting.

 

Many years ago, a company switched from ArcServe to NetWorker. They did so around the time they made their end of year backups, the ones that they intended to keep ‘forever’ for legal requirements.

Fast-forward several years, and it was requested to recover Lotus Notes backups from those original end of year archives. That’s when the support call came through. You see, those end of year archives were done on a standalone tape drive, not a tape library, and both tapes had, say, ‘YEAR2002′ written on the label. There was a little “1″ noted on the first label, and a little “2″ noted on the second label. For convenience, we’ll call them the first and second tapes.

When they put the first tape into the library for recovery, their first issue was getting NetWorker to mount the tape, since it didn’t have a barcode. Some non-GUI commands later, the tape was in the drive, but NetWorker wouldn’t keep the tape mounted – every time they tried to mount the tape, NetWorker threw up an error saying that it was expecting tape YEAR2002 with a particular volume ID, not YEAR2002 with a different volume ID that wasn’t in the media database. The second YEAR2002 tape would mount though, but NetWorker couldn’t perform a recovery because all the media wasn’t available.

So, here’s what happened:

  • The manual backup was run of a bunch of systems and Lotus Notes.
  • A tape was labelled YEAR2002 within NetWorker, and the backup ran until the tape filled up.
  • A new tape was put into the tape drive, and since they had no exposure to NetWorker, they labelled that tape as YEAR2002 as well and the backup went on its way.

I’ll qualify here – the Lotus Notes backup was done using the module.

Now here’s the thing – while NetWorker works on the volume ID being unique, it also works on the volume label being unique as well. It won’t support two volumes in the media database at the same time with the same label. It gets pretty strident about that if you try to label one tape with another tapes’ label, but I guess if you’re new to NetWorker it might just seem like there’s a bunch of confirmation boxes you have to click before you can label your next tape.

So the net result was that the backup was written to two pieces of media that couldn’t co-exist in the media database at the same time. Scanning the first necessitates removing the second from the media database, and because this isn’t a filesystem backup, there are limitations that couldn’t be stepped around in recovering from partial savesets.

For a regular filesystem backup as a last resort this still would not be impossible to recover from – using scanner and uasm you can still suck the data off the tape(s) without NetWorker needing both in the media database. Tedious, and not as good as just being able to select data in a recovery program, but it’s better than no recovery at all. But you can’t use scanner and uasm for a non-filesystem recovery

(You also can’t write a new tape label to a fresh tape, then dd the NetWorker data after the label on the other tape onto the newly labelled tape. The volume ID (or some other unique volume identification system) is written into the savestream, and transferring that savestream onto another volume sees NetWorker reject it if you subsequently attempt to scan it.)

Net result? Data that could not be recovered short of sending it off to a specialist forensics data recovery company.

NetWorker’s fault? No. There is after all, only so much that software can do in order to prevent you from shooting yourself in the foot.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha