Architectural failings of monolithic personal mail databases

This article might be summarised as:

I wish Microsoft would pull their heads in and design better email client software.

I’ve used a lot of email clients over the years. That includes a wide variety of Unix text and graphical mail clients, evolution on Linux, the various Netscape-ish mail clients (Thunderbird, Mozilla Mail and Netscape Mail), Outlook, Groupwise, Lotus Notes and Entourage*.

As a backup administrator, I think Outlook and Entourage represent a special kind of hell, due to the monolithic nature of the client storage database.

As an example, here’s a file size breakdown of the current Entourage (2008) mail database on my laptop:

[Thu Jun 04 16:20:29]
preston@archon ~/Documents/Microsoft User Data/Office 2008 Identities/Main Identity
$ du -hs *
5.7G    Database
 16K    Mailing Lists
4.0K    My Day.plist
 28K    Rules
304K    Signatures

Note that they’re all files, not directories. That’s right, my Entourage mail database is currently 5.7GB.

Every single email I’ve received in my current job is stored in a single, monolithic database. Obviously, there are copies on the central exchange server, with older copies shortcut via EmailXtender. However, I work remote to the primary exchange server, so I really do rely on easy access via my local mail store.

Now, I know I could choose not to backup this mail database, given the email is already on the server, but because I’m remote all the time, I don’t really want to either:

(a) Have to resync the database in the event of a crash

or

(b) Pull old email out of EmailXtender just because I had a crash and had to retrieve shortcuts.

So, needing to backup the database, I’m faced with a nigh-on 6GB and growing daily incremental backup, even if all I do is mark a single email as read.

Conversely, my personal email, stored within Apple Mail, is now over 8GB, and daily incrementals for that are typically less than 100KB.

I can think of no compelling architectural reason to keep all the mail in one location other than a desire to keep individual messages off the filesystem, and I no longer consider that a compelling reason. Sure, my 8GB of mail stored as individual messages takes up a lot of inodes on the filesystem, but filesystems do certainly have a lot of inodes, so that’s not really a problem.

Yes, having a lot of small files makes for a dense filesystem, but I’ll take a dense filesystem over a monolithic database with no backup tool any day for data storage. At least in the former, you can still back it up incrementally, albeit slowly, as opposed in the latter where you need to a full backup every time.

Various stabs have been made in the past, particularly for Outlook, in supporting incremental backups of the PST/local data stores – or to be more accurate, supporting delta backups (i.e., changed blocks only).

I found it somewhat ironic when Apple released Mac OS 10.5 Leopard, and its most important feature, Time Machine, that many users complained about Time Machine struggling with Entourage backups**. The expectation was that Apple was somehow responsible for the monolithic database structure of a third party application.

It’s not, in the same way that EMC isn’t responsible for the database structure for Oracle, Sybase, etc. In those cases, EMC are able to provide modules that support incremental backups due to cooperation between the various companies in making APIs and procedures available to one another. Further, for server based application storage which will frequently exceed client application storage by orders of magnitude, this is entirely appropriate.

Bear in mind I’m not saying that Microsoft say, has APIs but doesn’t release them, or have designed product with no APIs at all. I don’t know what the state of API access for Entourage and Outlook mail database formats are – and frankly, I don’t care. For client-side mail storage, there shouldn’t need to be an API to access the database, and a licensed backup product necessary to do anything more advanced than cold backups. It’s just email. It should be plain text and immediately accessible.

Given the complexity of integration achieved by say, Apple’s mail/calendaring (particularly when including Apple’s server product) using an individual file structure for mail storage, and given the complexity of integration achieved by Domino for a series of much smaller databases, and given the complexity of integration achieved by Groupwise for a series of much smaller databases, there is no excuse for Microsoft.

Is this article a rant? Yes, you could perhaps argue that it is. Maybe I was standing on a soapbox the entire time I was writing it, but it is a rant grounded in some architectural reasoning: using monolithic database storage for client side applications when it is not required and when incremental backups would be highly desirable is at best distastefully inelegant.


* If you’ve not had exposure to it, Entourage is the “Outlook”ish mail client for the Macintosh.

** I don’t, because I exclude Entourage from Time Machine backups.

7 thoughts on “Architectural failings of monolithic personal mail databases”

  1. ZFS is planned for Mac OS 10.6 Server, and I can only hope that it will arrive for non-Server not too long after that. Using snapshots and ‘zfs send / recv’ will certainly make creating copies of these things a lot easier.

  2. True, snapshots will help, but even with snapshots we still need to be able to safely quiesce the database prior to the snapshot being taken. (Admittedly with fast snapshots – e.g., copy-on-write, the time you have to quiesce for is very low.) I’d still see that one would either need to say, quit/relaunch Entourage around the generation of the snapshot, or MS would need to write in ZFS hooks to allow safe suspension of the database first…

  3. Total quiescence wouldn’t strictly be necessary. Entourage simply has to ensured that all changes to file are transactional (a la an SQLite database file or Berkeley DB).

    If all updates to the ‘PST’ are transactional, and all changes to ZFS are ACID, then the the combination would (theoretically) ensure an always-consistent file.

    Who knows if this is actually the case with the file though.

    1. I think though that using a non-monolithic file format, and ensuring that you don’t need cross-file consistency (aka regular email) would be a far more logical design…

        1. Based on the preliminary reports I’m seeing it looks like ZFS has been pulled from Snow Leopard server (perhaps only as a short-term item rather than a permanent decision, we’ll have to wait and see), so I assume it will be even longer before it appears on the non-server version of the OS.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.