A while ago I got rather frustrated with the performance of compression utilities on my Mac Pro. It’s a bit of a beast; it was the last model before the Nehalem based systems, but has 8 x 3.2GHz cores and a respectable 20GB of RAM, but compression isn’t always as fast as I’d have liked.
Coming from a long-term Unix background, I have tended mostly to use bzip2 – brilliant compression ratios, but slow, slow, slow.
Eventually it occurred to me though was that the problem was simple: bzip2 is single-threaded. I can compress a file and using Activity Monitor, see a single core sit at a high utilisation rate – but that’s all.
Apple’s Grand Central though made me think: how much better would compression utilities work if they could run against multiple cores? Not running multiple compression activities at the same time – but splitting it up and hitting as many cores as possible at once.
As you’d expect, the answer is: much, much better. With a little bit of searching, I found pbzip2 – a parallel processor version of bzip2. Check it out, and be sure to donate to the programmer – he deserves the support.
Without a doubt, it’s a much faster way of compressing files when you have a bunch of cores available to throw at the problem. Here’s a test scenario:
- Generate an 8GB, highly random file.
- Time a regular bzip2 on the file.
- Time a pbzip2 on the file.
The results? See for yourself:
# du -hs test.dat 8.0GB test.dat
# date ; bzip2 < test.dat > test-bzip.dat.bz2; date Sun 13 Feb 2011 12:33:11 EST Sun 13 Feb 2011 13:02:16 EST
# date; pbzip2 < test.dat > test-pbzip.dat.bz2; date Sun 13 Feb 2011 13:06:58 EST Sun 13 Feb 2011 13:12:05 EST
In the above case, the compressed files are both still effectively 8GB, since the source file was designed to not be susceptible to compression. Looking at a “real world” example then, I’ll pick a virtual machine. In actual fact, being able to quickly compress copies of virtual machines was the reason I first looked around for a better compression utility, so it makes sense to do so. Picking a small virtual machine, I can see:
[Sun Feb 13 13:20:42]
preston@aralathan /Volumes/Data/VMs
$ du -hs test02.pvm/
6.1G test02.pvm/
Now, compressing with regular bzip2 via tar, we get:
$ date; tar cf - test02.pvm | bzip2 -c > ~/Desktop/test-bzip2.pvm.bz2; date Sun 13 Feb 2011 13:25:37 EST Sun 13 Feb 2011 13:40:45 EST $ du -ms ~/Desktop/test-bzip2.pvm.bz2 2087 /Users/preston/Desktop/test-bzip2.pvm.bz2
That was a total of 15 minutes, 10 seconds to compress 6.1GB down to 2087 MB using conventional bzip2.
Moving on to tar/pbzip2:
$ date; tar cf - test02.pvm | pbzip2 -c > ~/Desktop/test-pbzip2.pvm.bz2; date Sun 13 Feb 2011 13:47:23 EST Sun 13 Feb 2011 13:49:53 EST $ du -ms ~/Desktop/test-pbzip2.pvm.bz2 2092 /Users/preston/Desktop/test-pbzip2.pvm.bz2
So while it cost an extra 5MB in storage, pbzip2 compressed the same data in 150 seconds – or two and a half minutes, if you will. (I should note, that was reading from a 2 x 7200 RPM stripe, and writing to SSD. At this point the compression seems to be IO bound on the read, if anything – I had previously been compressing the same data in 104 seconds to a firewire-800 drive when reading from a 3 x 7200 RPM stripe.)
If you’re needing high speed compression, be sure to check out pbzip2.
NOTE: You may be wondering why I didn’t use the Unix ‘time’ command. Ever since encountering a bug in Tru64 Unix ‘time’, I’ve steered clear of it. (That bug resulted in ‘time’ blowing out the runtime of what it was monitoring by several orders of multitude.) I know, it would be safe to go back to ‘time’ by now, but for the purposes of what I needed to demonstrate, “date; command; date” was more than sufficient.
You may want to check out the LZMA and/or XZ (LZMA2) compression utilities as well. They’re actually faster than bzip2 and quicker as well:
http://tukaani.org/lzma/benchmarks.html
http://blogs.sun.com/timc/entry/tamp_a_lightweight_multi_threaded
You can also stick with plain old gzip, use the parallel compression utility “pigz” that can use more than one core, even though it uses the same old algorithm.
If you like pbzip2 also check out lbzip2 and plzip. Benchmark comparison http://vbtechsupport.com/1614/