On Linu, filesystems typically have two settings regarding getting complete checks on boot. These are:
- Maximum number of mounts before a check
- Interval between checks
The default settings, while reasonably suitable for smaller partitions, are very unsuitable for large partitions, such as what you find in disk backup units. In fact, if you don’t pay particular attention to these settings, you may find after a routine reboot that your backup server (or storage node) can take hours to become available. For instance, it’s not unheard of to see even sub-20TB DBU environments (as say, 10 x 2TB filesystems) take several hours to complete mandatory checks on filesystems after what should have just been a routine reboot.
There are two approaches that you can take to this:
- If you want to leave the checks enabled, it’s reasonably imperative to ensure that at most only one disk backup unit filesystem will be checked at one time after a reboot; this will at least reduce the size of any check-on-reboot. Thus, ensure you:
- Configure each filesystem so that it will have a different number of maximum mounts before check than any other filesystem, and,
- Configure the interval (days) between checks for each filesystem to be a significantly different number.
- If you don’t want periodic filesystem checks to ever interfere with the reboot process, you need to:
- Ensure that following a non-graceful restart of the server the DBU filesystems are unmounted and checked before any new backup or recovery activities are done, and,
- Ensure that there are processes – planned maintenance windows if you will – for manual running of the filesystem checks that are being skipped.
Neither option is particularly “attractive”. In the first case, you can still, if you cherish uptime or don’t need to reboot your backup server often, get into a situation where multiple filesystems need to be checked on reboot if they’ve all exceeded their days-between-checks parameter. In the second instance, you’re having to insert human driven processes into what should normally be a routine operating system function. In particular with the manual option, there must be a process in place to NetWorker shutdown + checking even in the middle of the night if an OS crash occurs.
Actually, the above list is a little limited – there’s a couple of other options that you can consider as well – though they’re a little more left of field:
- Build into the change control process the timings for complete filesystem checks in case they happen, or
- Build into the change control process or reboot procedure for the backup server/storage nodes the requirement to temporarily disable filesystem checks (using say, tune2fs) so that you know the reboot to be done won’t be costly in terms of time.
Personally, I’m looking forward to btrfs – in reality, a modern filesystem such as that should solve most, if not all, of the problems discussed above.