{"id":2840,"date":"2011-02-15T19:07:18","date_gmt":"2011-02-15T09:07:18","guid":{"rendered":"http:\/\/nsrd.info\/blog\/?p=2840"},"modified":"2011-02-15T19:07:18","modified_gmt":"2011-02-15T09:07:18","slug":"aside-in-praise-of-pbzip2","status":"publish","type":"post","link":"https:\/\/nsrd.info\/blog\/2011\/02\/15\/aside-in-praise-of-pbzip2\/","title":{"rendered":"Aside: In praise of pbzip2"},"content":{"rendered":"<p>A while ago I got rather frustrated with the performance of compression utilities on my Mac Pro. It&#8217;s a bit of a beast; it was the last model before the Nehalem based systems, but has 8 x 3.2GHz cores and a respectable 20GB of RAM, but compression isn&#8217;t always as fast as I&#8217;d have liked.<\/p>\n<p>Coming from a long-term Unix background, I have tended mostly to use bzip2 \u2013 brilliant compression ratios, but slow, <em>slow<\/em>, <strong><em>slow<\/em><\/strong>.<\/p>\n<p>Eventually it occurred to me though was that the problem was simple: bzip2 is single-threaded. I can compress a file and using Activity Monitor, see a single core sit at a high utilisation rate \u2013 but that&#8217;s all.<\/p>\n<p>Apple&#8217;s Grand Central though made me think: how much better would compression utilities work if they could run against multiple cores? Not running multiple compression activities at the same time \u2013 but splitting it up and hitting as many cores as possible at once.<\/p>\n<p>As you&#8217;d expect, the answer is: much, much better. With a little bit of searching, I found <a title=\"pbzip2\" href=\"http:\/\/compression.ca\/pbzip2\/\" target=\"_blank\">pbzip2<\/a> \u2013 a parallel processor version of bzip2. Check it out, and be sure to donate to the programmer \u2013 he deserves the support.<\/p>\n<p>Without a doubt, it&#8217;s a much faster way of compressing files when you have a bunch of cores available to throw at the problem. Here&#8217;s a test scenario:<\/p>\n<ul>\n<li>Generate an 8GB, highly random file.<\/li>\n<li>Time a regular bzip2 on the file.<\/li>\n<li>Time a pbzip2 on the file.<\/li>\n<\/ul>\n<p>The results? See for yourself:<\/p>\n<pre># <strong>du -hs test.dat<\/strong>\n8.0GB   test.dat<\/pre>\n<pre># <strong>date ; bzip2 &lt; test.dat &gt; test-bzip.dat.bz2; date<\/strong>\nSun 13 Feb 2011 12:33:11 EST\nSun 13 Feb 2011 13:02:16 EST<\/pre>\n<pre># <strong>date; pbzip2 &lt; test.dat &gt; test-pbzip.dat.bz2; date<\/strong>\nSun 13 Feb 2011 13:06:58 EST\u00a0\nSun 13 Feb 2011 13:12:05 EST<\/pre>\n<p>In the above case, the compressed files are both still effectively 8GB, since the source file was designed to not be susceptible to compression. Looking at a &#8220;real world&#8221; example then, I&#8217;ll pick a virtual machine. In actual fact, being able to quickly compress copies of virtual machines was the reason I first looked around for a better compression utility, so it makes sense to do so. Picking a small virtual machine, I can see:<\/p>\n<pre>[Sun Feb 13 13:20:42]\npreston@aralathan \/Volumes\/Data\/VMs\n$ <strong>du -hs test02.pvm\/<\/strong>\n6.1G<span style=\"white-space: pre;\">\t<\/span>test02.pvm\/<\/pre>\n<p>Now, compressing with regular bzip2 via tar, we get:<\/p>\n<pre><span style=\"font-family: Georgia, 'Bitstream Charter', serif; color: #444444;\">$ <strong>date; tar cf - test02.pvm | bzip2 -c &gt; ~\/Desktop\/test-bzip2.pvm.bz2; date<\/strong>\n<\/span><span style=\"color: #444444; font-family: Georgia, 'Bitstream Charter', serif;\">Sun 13 Feb 2011 13:25:37 EST\n<\/span><span style=\"color: #444444; font-family: Georgia, 'Bitstream Charter', serif;\">Sun 13 Feb 2011 13:40:45 EST\n$ <strong>du -ms ~\/Desktop\/test-bzip2.pvm.bz2<\/strong>\n2087\t\/Users\/preston\/Desktop\/test-bzip2.pvm.bz2\u00a0<\/span><\/pre>\n<p>That was a total of 15 minutes, 10 seconds to compress 6.1GB down to 2087 MB using conventional bzip2.<\/p>\n<p>Moving on to tar\/<strong>pbzip2<\/strong>:<\/p>\n<pre><span style=\"font-family: Georgia, 'Bitstream Charter', serif; color: #444444;\">$ <strong>date; tar cf - test02.pvm | pbzip2 -c &gt; ~\/Desktop\/test-pbzip2.pvm.bz2; date<\/strong>\n<\/span><span style=\"color: #444444; font-family: Georgia, 'Bitstream Charter', serif;\">Sun 13 Feb 2011 13:47:23 EST\n<\/span><span style=\"color: #444444; font-family: Georgia, 'Bitstream Charter', serif;\">Sun 13 Feb 2011 13:49:53 EST\n<\/span><span style=\"color: #444444; font-family: Georgia, 'Bitstream Charter', serif;\">$ <strong>du -ms ~\/Desktop\/test-pbzip2.pvm.bz2\u00a0<\/strong>\n<\/span><span style=\"font-family: Georgia, 'Bitstream Charter', serif; color: #444444;\">2092<\/span><span style=\"white-space: pre;\"><span style=\"color: #444444;\">\t<\/span><\/span><span style=\"font-family: Georgia, 'Bitstream Charter', serif; color: #444444;\">\/Users\/preston\/Desktop\/test-pbzip2.pvm.bz2<\/span><\/pre>\n<p>So while it cost an extra 5MB in storage, pbzip2 compressed the same data in 150 seconds \u2013 or two and a half minutes, if you will. (I should note, that was reading from a 2 x 7200 RPM stripe, and writing to SSD. At this point the compression seems to be IO bound on the read, if anything \u2013 I had previously been compressing the same data in 104 seconds <em>to a firewire-800 drive<\/em> when reading from a 3 x 7200 RPM stripe.)<\/p>\n<p>If you&#8217;re needing high speed compression, be sure to check out pbzip2.<\/p>\n<p>NOTE: You may be wondering why I didn&#8217;t use the Unix &#8216;time&#8217; command. Ever since encountering a bug in Tru64 Unix &#8216;time&#8217;, I&#8217;ve steered clear of it. (That bug resulted in &#8216;time&#8217; blowing out the runtime of what it was monitoring by several orders of multitude.) I know, it would be safe to go back to &#8216;time&#8217; by now, but for the purposes of what I needed to demonstrate, &#8220;date; command; date&#8221; was more than sufficient.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A while ago I got rather frustrated with the performance of compression utilities on my Mac Pro. It&#8217;s a bit&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[4,12],"tags":[242,610,611],"class_list":["post-2840","post","type-post","status-publish","format-standard","hentry","category-aside","category-general-technology","tag-compression","tag-multi-core","tag-multi-processor"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pKpIN-JO","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/2840","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/comments?post=2840"}],"version-history":[{"count":0,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/posts\/2840\/revisions"}],"wp:attachment":[{"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/media?parent=2840"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/categories?post=2840"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nsrd.info\/blog\/wp-json\/wp\/v2\/tags?post=2840"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}