It’s funny, the little tools you build up over the years as someone heavily involved in backup, particularly when it comes to testing.
I have two tools that help me with filesystem and performance testing – one I call generate-filesystem, and one called genbf (generate big file).
The genbf tool came about when I wanted files that were highly resistant to being compressed – and indeed, to subsequently being deduplicated as well. Sure, bigasm can produce good results, but it isn’t guaranteed to produce highly random data. That’s where genbf comes in. Best of all, it’s fast. For example, a 1GB file on my 12-core lab server gets created in under 10 seconds:
[pmdg@orilla test]$ date; genbf.pl -s 1024 -f test.dat; date Tue Nov 18 19:08:24 AEDT 2014 Progress: Pre-generating random data chunk. (This may take a while.) 0% of random data chunk generated. 10% of random data chunk generated. 20% of random data chunk generated. 30% of random data chunk generated. 40% of random data chunk generated. 50% of random data chunk generated. 60% of random data chunk generated. 70% of random data chunk generated. 80% of random data chunk generated. 90% of random data chunk generated. Creating 1024 MB file test.dat Wrote data file in 5121 chunks. Tue Nov 18 19:08:33 AEDT 2014
OK, OK, a 1GB file can be created quickly if you’re just pulling in from /dev/zero, but here’s the file size difference pre and post-compressed:
[pmdg@orilla test]$ ls -al test.dat -rw-rw-r-- 1 pmdg pmdg 1073741824 Nov 18 19:08 test.dat [pmdg@orilla test]$ pbzip2 -r test.dat [pmdg@orilla test]$ ls -al test.dat.bz2 -rw-rw-r-- 1 pmdg pmdg 1065615793 Nov 18 19:08 test.dat.bz2
(If you haven’t heard of pbzip2, enlighten yourself and support the author. It’s brilliant.)
When it comes to subsequently sending the generated data to Data Domain, the deduplication is extremely low – 20 x 1GB files using the standard setting above, for instance, yields an almost straight additional 20GB occupied space.
If you want to try it out, you can download it from here. (You’ll need Perl on your system.) Standard usage is below: