Aside: In praise of pbzip2

A while ago I got rather frustrated with the performance of compression utilities on my Mac Pro. It’s a bit of a beast; it was the last model before the Nehalem based systems, but has 8 x 3.2GHz cores and a respectable 20GB of RAM, but compression isn’t always as fast as I’d have liked.

Coming from a long-term Unix background, I have tended mostly to use bzip2 – brilliant compression ratios, but slow, slow, slow.

Eventually it occurred to me though was that the problem was simple: bzip2 is single-threaded. I can compress a file and using Activity Monitor, see a single core sit at a high utilisation rate – but that’s all.

Apple’s Grand Central though made me think: how much better would compression utilities work if they could run against multiple cores? Not running multiple compression activities at the same time – but splitting it up and hitting as many cores as possible at once.

As you’d expect, the answer is: much, much better. With a little bit of searching, I found pbzip2 – a parallel processor version of bzip2. Check it out, and be sure to donate to the programmer – he deserves the support.

Without a doubt, it’s a much faster way of compressing files when you have a bunch of cores available to throw at the problem. Here’s a test scenario:

  • Generate an 8GB, highly random file.
  • Time a regular bzip2 on the file.
  • Time a pbzip2 on the file.

The results? See for yourself:

# du -hs test.dat
8.0GB   test.dat
# date ; bzip2 < test.dat > test-bzip.dat.bz2; date
Sun 13 Feb 2011 12:33:11 EST
Sun 13 Feb 2011 13:02:16 EST
# date; pbzip2 < test.dat > test-pbzip.dat.bz2; date
Sun 13 Feb 2011 13:06:58 EST 
Sun 13 Feb 2011 13:12:05 EST

In the above case, the compressed files are both still effectively 8GB, since the source file was designed to not be susceptible to compression. Looking at a “real world” example then, I’ll pick a virtual machine. In actual fact, being able to quickly compress copies of virtual machines was the reason I first looked around for a better compression utility, so it makes sense to do so. Picking a small virtual machine, I can see:

[Sun Feb 13 13:20:42]
preston@aralathan /Volumes/Data/VMs
$ du -hs test02.pvm/
6.1G	test02.pvm/

Now, compressing with regular bzip2 via tar, we get:

$ date; tar cf - test02.pvm | bzip2 -c > ~/Desktop/test-bzip2.pvm.bz2; date
Sun 13 Feb 2011 13:25:37 EST
Sun 13 Feb 2011 13:40:45 EST
$ du -ms ~/Desktop/test-bzip2.pvm.bz2
2087	/Users/preston/Desktop/test-bzip2.pvm.bz2 

That was a total of 15 minutes, 10 seconds to compress 6.1GB down to 2087 MB using conventional bzip2.

Moving on to tar/pbzip2:

$ date; tar cf - test02.pvm | pbzip2 -c > ~/Desktop/test-pbzip2.pvm.bz2; date
Sun 13 Feb 2011 13:47:23 EST
Sun 13 Feb 2011 13:49:53 EST
$ du -ms ~/Desktop/test-pbzip2.pvm.bz2 
2092	/Users/preston/Desktop/test-pbzip2.pvm.bz2

So while it cost an extra 5MB in storage, pbzip2 compressed the same data in 150 seconds – or two and a half minutes, if you will. (I should note, that was reading from a 2 x 7200 RPM stripe, and writing to SSD. At this point the compression seems to be IO bound on the read, if anything – I had previously been compressing the same data in 104 seconds to a firewire-800 drive when reading from a 3 x 7200 RPM stripe.)

If you’re needing high speed compression, be sure to check out pbzip2.

NOTE: You may be wondering why I didn’t use the Unix ‘time’ command. Ever since encountering a bug in Tru64 Unix ‘time’, I’ve steered clear of it. (That bug resulted in ‘time’ blowing out the runtime of what it was monitoring by several orders of multitude.) I know, it would be safe to go back to ‘time’ by now, but for the purposes of what I needed to demonstrate, “date; command; date” was more than sufficient.

2 thoughts on “Aside: In praise of pbzip2”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.