NetWorker on Linux – Ditching ext3 for xfs

Recently when I made an exasperated posting about lengthy ext3 check times and looking forward to btrfs, Siobhán Ellis pointed out that there was already a filesystem available for Linux that met a lot of my needs – particularly in the backup space, where I’m after:

  • Being able to create large filesystems that don’t take exorbitantly long to check
  • Being able to avoid checks on abrupt system resets
  • Speeding up the removal of files when staging completes or large backups abort

That filesystem of course is XFS.

I’ve recently spent some time shuffling data around and presenting XFS filesystems to my Linux lab servers in place of ext3, and I’ll fully admit that I’m horribly embarrassed I hadn’t thought to try this out earlier. If anything, I’m stuck looking for the right superlative to describe the changes.

Case in point – I was (and indeed still am) doing some testing where I need to generate >2.5TB of backup data from a Windows 32-bit client for a single saveset. As you can imagine, not only does this take a while to generate, but it also takes a while to clear from disk. I had got about 400 GB into the saveset the first time I was testing and realised I’d made a mistake with the setup so I needed to stop and start again. On an ext3 filesystem, it took more than 10 minutes after cancelling the backup before the saveset had been fully deleted. It may have taken longer – I gave up waiting at that point, went to another terminal to do something else and lost track of how long it actually took.

It was around that point that I recalled having XFS recommended to me for testing purposes, so I downloaded the extra packages required to use XFS within CentOS and reformatting the ~3TB filesystem to XFS.

The next test that I ran aborted due to a (!!!) comms error 1.8TB through the backup. Guess how long it took to clear the space? No, seriously, guess – because I couldn’t log onto the test server fast enough to actually see the space clearing. The backup aborted, and the space was suddenly back again. That’s a 1.8TB file deleted in seconds.

That’s the way a filesystem should work.

I’ve since done some (in VMs) nasty power-cycle mid-operation tests and the XFS filesystems come back up practically instantaneously – no extended check sessions that make you want to cry in frustration.

If you’re backing up to disk on Linux, you’d be mad to use anything other than XFS as your filesystem. Quite frankly, I’m kicking myself that I didn’t do this years ago.

8 thoughts on “NetWorker on Linux – Ditching ext3 for xfs”

  1. I went through a similar discovery bit earlier (back in 2004) when I was upgrading Networker from 6.x to 7.0 and moving from Tru64 Unix to Linux. At that time the main goal was to improve backup (and restore) performance of large file system on Tru64 with 30+ million of files. Solution: multiple concurrent savesets + new diskbackup feature of Networker 7.0 (adv_file ) which automatically demultiplexed savesets + fast and reliable disk storage and file system on Linux. Starting point was to choose then right file system first. Out of four candidates ( etx3, JFS, ReiserFS and XFS) the winner was XFS and it is still today. Out of two commercially supported Linux distributions (RedHat and SLES) the choice was SLES and it is still today, because it has integrated XFS. RedHat still doesn’t have XFS included. It possible to have RedHat, CentOS or any other Linux distro with XFS add-on but this goes into area of officially unsupported system which some businesses may not like it.
    Using, for many years, Networker’s disk backup (adv_file) on top of XFS proved to be rock solid. The only unpleasant surprise I had with adv_file type of device was that it did not support concurrent recoveries, from the same device, of multiple savesets. Legato advertised adv_file to be able to do concurrent writes and reads. Writes yes, reads yes but only from the same saveset.

  2. Hi Janusz,

    Thankfully these days concurrent recoveries are better supported under ADV_FILE devices … so long as you’re not concurrently attempting to stage from or clone from, you can get away with high numbers of parallel restores.

    It goes to prove the point though that a key requirement to getting better performance and compatibility in a backup-to-disk environment is to ensure that the right filesystem is chosen.

    Cheers,

    Preston.

  3. It depend son what you define as parallel restores. If you mean in parallel initiated by seperate commands, you are correct. However, if you put them all in a single command, you do get parallelism for restores.

    Glad to be of service 🙂

  4. I agree with you on XFS. I am using it for a long time myself; mainly because XFS offers “xfsdump” which can be used for desaster recovery purposes (e.g. dump from single user mode directly to tape; restore is quite easy: boot from recovery cd and run xfsrestore).

    1. I’m actually disappointed that it isn’t better promoted by the likes of RedHat. Here’s a filesystem that allows the OS to really hit a new mark in performance for IO, and it languishes.

  5. We’ve been using XFS for quite some time and on quite large filesystems. It took me some time to really tune it for best read/write performance, but it’s excellent now. Using a few parameters you can do clones and backups simultaneously without any major performance hit.

  6. Hi Preston,
    do you see any advantages/disadvantages in using Linux/x64 as platform for Networker server and storage nodes?
    We use Solaris/Sparc and it’s a really reliable platform. Nevertheless the price/performance compared to x64 is….
    Recently we changed our mission critical systems from Solaris/Sparc to AIX/Power. The only remaining Sparc system is our networker server. Next year we will renew the backup environment and I’m considering Linux/x64 as new platform.
    What experiences do you have?

    1. The only issues I’ve traditionally had with Linux as a NetWorker server/storage node is that I don’t believe the SCSI layer is suitably stable when a robot head shares the same bus as a tape drive.

      This seems to not happen when a robot head has its own SCSI bus, or is connected via fibre-channel.

      While there are some bugs that have historically only occurred on Linux platforms, I haven’t encountered those sorts of issues directly for a while.

      I run the majority of my lab servers on Linux (both x86 and x64), and frequently hammer them with performance testing and bug testing, so overall it holds up fairly well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.