Client side compression gets a squeeze

One of the enhancements to NetWorker v8 has been the introduction of additional client side compression options.

Compression

If you’ve been using client side compression, you’ll know that a directive for it looks somewhat like the following:

<< . >>
+compressasm: * .* *.*

This has always used a very basic NetWorker compression algorithm; it yields some compression savings, but it’s never been massively efficient in terms of space savings. So, for organisations that have needed to squeeze the client data down to the absolute minimum for transmission, it’s not been entirely successful – it gets some savings, but not as much as we know it could get if it were using a better algorithm.

Enter, NetWorker v8. The ‘compressasm’ directive gets new options: gzip, and bzip2.

However, the documentation is slightly incorrect for these, so below you’ll find the correct usage:

<< . >>
+compressasm -gzip -X: * .* *.*

And:

<< . >>
+compressasm -bzip2 -X: * .* *.*

Where ‘X’ is a number between 1 and 9 to specify the compression level aimed for, where 1 is minimal, and 9 is maximum. If that level is not specified, it defaults to 9.

In particular, the usage difference is the NetWorker documentation does not cite a dash in front of the compression number; if you follow the documentation and just use the number as-is, the compression level will be ignored, and the default of maximum compression (9) will be used. You’ll also get a warning of:

save: Ignored incorrect ASM argument '4' in file '/path/to/.nsr'

If you’re not a Unix user, you may not know too much about gzip and bzip2. Suffice it to say that those algorithms have been around for quite some time now, and generally speaking, you should expect to see better compression ratios achieved from the same sample data on bzip2 compression rather than gzip compression – at a cost in CPU cycles. Obviously though, this depends on your data.

I assembled a block of about 30GB to test client side compression with. In that data, there was around:

  • 15 GB in virtual machine files (containing 2 x Linux installs);
  • 15 GB in text files – plain text emails from a monitoring server, saved to disk as individual times.

All up, there were 633,690 files to be backed up.

For both the gzip and the bzip2 compression testing, I went for level-5 on the scale of 1 to 9. Not absolute best compression, but not as CPU intensive as the full compression options, either. Here’s the results I got:

Original Data Size (GB)% ReductionData Stored (GB)
500080%1000
500082%900
500084%800
500086%700
500088%600
500090%500
500092%400
500094%300
500096%200
500098%100
500099%50

Clearly the new compression algorithms come at a cost in backup time. Yet, they can also make for increased data reduction. Typically during testing I found that with bzip2 compression, CPU utilisation on a single save would hit around 90% or more on a single core, as opposed to the 10-15% utilisation that seemed to peak with standard ‘compressasm’ options.

5 thoughts on “Client side compression gets a squeeze”

  1. Are you seeing a lot of call for compression out there these days? I’m curious about it. Most of the stuff I architect tents to involve a Data Domain as a primary target, so compression is sort of thrown out of the window first off.
    Keep up the great work, btw, I have really enjoyed your posts, and found them very useful.

    1. I don’t see a huge demand for client side compression – and deduplication has certainly seen a continued reduction in the use. However, there’s a kernel of companies out there who remain big fans of it – usually in situations where a large number of clients are connected via firewalls, or sections of the enterprise are interconnected by lower speed links.

      Thanks for the feedback 🙂

  2. Client-side compression reduces your disk-to-disk backup licensing costs (which for us are higher than the actual cost of storage). Unless you have very long retention periods or an unusually large proportion of duplicate data across clients, TCO of compression+dumb storage likely beats deduplication.

    Ongoing annoyance: mminfo lies about the size of your backups (reporting the compressed size). This makes it more difficult to audit whether everything’s really getting backed up properly.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.