Files and files and files

A while ago, I gave away a utility I find quite handy in lab and testing situations called genbf. If you’ll recall, it can be used to generate large files which are not susceptible to compression or deduplication. (You can find that utility here.)

At the time I mentioned another utility I use called generate-filesystem. While genbf is designed to produce potentially very large files that don’t yield to compression, generate-filesystem (or genfs2 as I’m now calling it) is designed to create a random filesystem for you. It’s not the same of course as taking say, a snapshot copy of your production fileserver, but if you’re wanting a completely isolated lab and some random content to do performance testing against, it’ll do the trick nicely. In fact, I’ve used it (or predecessors of it) multiple times when I’ve blogged about block based backups, filesystem density and parallel save streams.

genfs2

Overall it produces files that don’t yield all that much to compression. A 26GB directory structure with 50,000 files created with it compressed down to just 25GB in a test I ran a short while ago. That’s where genfs2 comes in handy – you can create really dense test filesystems with almost no effort on your part. (Yes, 50,000 files isn’t necessarily dense, but that was just a small run.)

It is however random by default on how many files it creates, and unless you give it an actual filesystem count limit, it can easily fill a filesystem if you let it run wild. You see, rather than having fixed limits for files and directories at each directory level, it works with upper and lower bounds (which you can override) and chooses a random number at each time. It even randomly chooses how many directories it nests down based on upper/lower limits that you can override as well.

Here’s what the usage information for it looks like:

$ ./genfs2.pl -h
Syntax: genfs2.pl [-d minDir] [-D maxDir] [-f minFile] [-F maxFile] [-r minRecurse] [-R maxRecurse] -t target [-s minSize] [-S maxSize] [-l minLength] [-L maxLength] [-C] [-P dCsize] [-T mfc] [-q] [-I]

Creates a randomly populated directory structure for backup/recovery 
and general performance testing. Files created are typically non-
compressible.

All options other than target are optional. Values in parantheses beside
explanations denote defaults that are used if not supplied.

Where:

    -d minDir      Minimum number of directories per layer. (5)
    -D maxDir      Maximum number of directories per layer. (10)
    -f minFile     Minimum number of files per layer. (5)
    -F maxFile     Maximum number of files per layer. (10)
    -r minRecurse  Minimum recursion depth for base directories. (5)
    -R maxRecurse  Maximum recursion depth for base directories. (10)
    -t target      Target where directories are to start being created.
                   Target must already exist. This option MUST be supplied.
    -s minSize     Minimum file size (in bytes). (1 K)
    -S maxSize     Maximum file size (in bytes). (1 MB)
    -l minLength   Minimum filename/dirname length. (5)
    -L maxLength   Maximum filename/dirname length. (15)
    -P dCsize      Pre-generate random data-chunk at least dcSize bytes.
                   Will default to 52428800 bytes.
    -C             Try to provide compressible files.
    -I             Use lorem ipsum filenames.
    -T             mfc Specify maximum number of files that will be created.
                   Does not include directories in count.
    -q             Quiet mode. Only print updates to the file-count.

E.g.:

./genfs2.pl -r 2 -R 32 -s 512 -S 65536 -t /d/06/test

Would generate a random filesystem starting in /d/06/test, with a minimum
recursion depth of 2 and a maximum recursion depth of 32, with a minimum
filesize of 512 bytes and a maximum filesize of 64K.

Areas where this utility can be useful include:

  • …filling a filesystem with something other than /dev/zero
  • …testing anything to do with dense filesystems without needing huge storage space
  • …doing performance comparisons between block based backup and regular backups
  • …doing performance comparisons between parallel save streams and regular backups

This is one of those sorts of utilities I wrote once over a decade ago and have just done minor tweaks on it here and there since then. There’s probably a heap of areas where it’s not optimal, but it’s done the trick, and it’s done it relatively fast enough for me. (In other words: don’t judge my programming skills based on the code – I’ve never been tempted to optimise it.) For instance, on a Mac Book Pro 13″ writing to a 2TB LaCie Rugged external via Thunderbolt, the following command takes 6 minutes to complete:

$ ./genfs2.pl -T 50000 -t /Volumes/Storage/FSTest -d 5 -D 15 -f 10 -F 30 -q -I
Progress:
        Pre-generating random data chunk. (This may take a while.)
        Generating files. Standby.
         --- 100 files
         --- 200 files
         --- 300 files
         ...
         --- 49700 files
         --- 49800 files
         --- 49900 files
         --- 50000 files

Hit maximum file count (50000).

I don’t mind waiting 6 minutes for 50,000 files occupying 26GB. If you’re wondering what the root directory from this construction looks like, it goes something like this:

$ ls /Volumes/Storage/FSTest/
at-eleifend/
egestas elit nisl.dat
eget.tbz2
facilisis morbi rhoncus.7r
interdum
lacinia-in-rhoncus aliquet varius-nullam-a/
lobortis mi-malesuada aenean/
mi mi netus-habitant-tortor-interdum rhoncus.mov
mi-neque libero risus-euismod ante.gba
non-purus-varius ac.dat
quis-tortor-enim-sed-lorem pellentesque pellentesque/
sapien-in auctor-libero.anr
tincidunt-adipiscing-eleifend.xlm
ut.xls

Looking at the file/directory breakdown on GrandPerspective, you’ll see it’s reasonably evenly scattered:

grand perspective view

Since genfs2 doesn’t do anything with the directory you give it other than add random files to it, you can run it multiple times with different parameters – for instance, you might give an initial run to create 1,000,000 small files, then if you’re wanting a mix of small and large files, execute it a few more times to give yourself some much larger random files distributed throughout the directory structure as well.

Now here’s the caution: do not, definitely do not run this on one of your production filesystems, or any filesystem where running out of space might cause a data loss or access failure.

If you’re wanting to give it a spin or make use of it, you can freely download it from here.

2 thoughts on “Files and files and files”

    1. Hi Ole,

      Thanks! The script should run on Windows with ActiveState installed – though I haven’t run it there for a few years.

      Cheers,
      Preston

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.