Introduction
When choosing to deploy backup to disk by using adv_file devices (instead of say, VTLs), there are some design considerations that you should keep in mind. It’s easy to just go in and start creating devices willy-nilly, with the consequence of that usually being poor performance and insufficient maintenance windows at some later date.
NetWorker doesn’t care what sort of physical devices (either layout, or connectivity properties) you place your ADV_FILE devices on; consequently for instance on a lab server of mine I have 3 x 1TB USB2 drives connected and each providing approximately 917GB of formatted disk backup capacity each. Now, this is something that I’d not recommend or even contemplate deploying for a production environment – but as I said, it’s a lab server, so my goal is to have copious amounts of space cheaply, not high performance.
There’s 3 layers of design factors you need to take into consideration:
- Physical LUN layout/connectivity
- Presented filesystem types and sizes
- Ongoing maintenance
If you deploy disk backup without thinking about these three factors – without planning them – then at some point you’re going to come a cropper. So, let’s go through these options.
Physical LUN layout/connectivity
Except in lab environments where you can afford, at any point, to lose all content on disk backup units, you’ll need to have some form of redundancy on the disk backup units. It’s easy for businesses to … resent … having to spend money on redundancy, and I’m afraid that no-one will be able to make a coherent argument to me that it’s appropriate to run production backups to unprotected disk.
Assuming therefore that sanity prevails, and redundancy is designed into the system, care and consideration has to be made to layout LUNs and connectivity in such a way as to maximise throughput.
Probably the single best metric to consider is that it is necessary to ensure that physical layout and connectivity is such that it allows for reads from the disk backup units to exceed the performance of whatever tape is being written to when it comes to cloning, and for the requisite number of drives. That is, if your intent is to be able to clone from disk backup to at least 2 x LTO-3 drives simultaneously, your design needs to have a read performance of around 320 MB/s. Obviously, the design should allow for simultaneous writes (i.e., backups) while achieving those cloning objectives.
This need for speed affects both physical connectivity of disk as well as the layout of the LUNs presented to the host, and by layout I refer to both RAID level and number of spindles.
Presented filesystem types and sizes
Depending on the operating system being used for the backup host, the actual filesystem type selection may be somewhat limited. For example, on Windows NT based systems, there’s a very strong chance you’ll be using NTFS. (Obviously, Veritas Storage Foundation might be another option.) For Unix style operating systems, there will usually be a few more choices.
Within NetWorker, individual savesets are written as monolithic files to ADV_FILE devices. This invariably means that you don’t necessarily need to support say, millions of files on the ADV_FILE devices, but you do need to support large amounts of data.
My first concern therefore is to ensure that the filesystem selected is fast when it comes to a lesser considered activity – checking and error correction following a crash or unexpected reboot. To give you a simplistic example, when considering non-extent based filesystems, making a choice between journalled and non-journalled should be a “no-brainer”. So long as data integrity is not an issue*, you should always ensure that you pick the fastest checking/healing filesystem that also meets operational performance requirements.
Moving on to size, I usually follow the metric that any ADV_FILE device should be large enough to support two copies of the largest saveset that could conceivably be written to them. Obviously, there’ll be exceptions to that rule, and due to various design considerations, this may mean that there’s some savesets that you’ll have to consider going direct to tape (either physical or virtual), but it’s a good starting rule.
You have to also keep in mind the selection criteria used by NetWorker for picking the next volume to be written to. For instance, in standard configurations, it’s a good idea to set “target sessions” on disk backups all to 1. That way, new savesets achieve as close as possible to round-robining distribution.
However, bear in mind that when all devices are idle, and a new round of backups starts, NetWorker always picks the oldest labelled, non-empty volume to write to first, and works backwards from there. This, unfortunately is (for want of a better description), a stupid selection criteria for backup to disk. (It’s entirely appropriate for backup to tape.) The implications of this is that your disk backup units will typically “fill” in order of oldest labelled through to most recently labelled, and the first labelled disk backup unit often gets a lot more attention than the other disk backup units. Thus, if you’re going to have disk backup units of differing sizes, try to keep the “oldest” ones the largest, and remember that if you relabel a disk backup unit, it’s going to jump to the back of the queue.
Ultimately, it’s a careful balancing act you have to maintain – if you make your disk backup units too small, they may not fit some savesets on them at all (ever), or may too frequently fill during backups requiring staging.
On the other hand, if you make the disk backup units too large, you may find yourself in an unpleasant situation where the owner-host of the disk backup devices takes an unacceptably long period of time checking filesystems when it comes up following particular reboots. This is not something to be taken lightly: consider how a comprehensive and uninterruptable check of a 10TB filesystem on reboot may impact an SLA requiring recovery of Tier-1 data to start within 15 minutes of the request being made!
Not only that, given the serial nature of certain disk backup operations (e.g., cloning or staging), you can’t afford a situation where recoveries can’t run for say, 8 hours, because 10TB of data is being staged or cloned**.
Thus, for a variety of reasons, it’s quite unwise to design a system with a single, large/monolithic ADV_FILE device. Disk backup volumes should be spread across as many ADV_FILE devices as possible within the hardware configuration.
Ongoing maintenance
For backup systems that need 24×7 availability, there should be one rule here to follow: your design must support at least one disk backup unit being offline at any time.
Such a design allows backup, recovery, cloning and staging operations to continue even in the event of maintenance. These maintenance operations would include, but not be limited to, any of the following:
- Evacuation of disk backup units to replace underlying disks and increase capacity (e.g., replacing 5 x 500GB disks with 5 x 1TB disks, etc.)
- Evacuation of disk backup units to reformat the hosting filesystem to compensate for degraded performance from gradual fragmentation***.
- Large-scale ad-hoc backups outside of the regular backup routine that require additional space.
- Connectivity path failure or even (in a SAN), tray failure.
(In short, if you can’t perform maintenance on your disk backup environment, then it’s not designed correctly.)
In summary
It’s possible you’ll look at this list of considerations and want to throw your hands up in defeat thinking that ADV_FILE backups are too difficult. That’s certainly not the point. If anything, it’s quite the opposite – ADV_FILE backups are too easy, in that they allow you to start backing up without having considered any of the above details, and it’s that ease of use that ultimately gets people into trouble.
If planned correctly from the outset however, ADV_FILE devices will serve you well.
–
* Let’s face it – there shouldn’t be any filesystem where you have to question data integrity! However, I’ve occasionally seen some crazy “bleeding edge” designs – e.g., backing up to ext3 on Linux before it was (a) officially released as a stable filesystem or (b) supported by EMC/Legato.
** This is one of the arguments for VTLs within NetWorker – by having lots of small virtual tapes, the chances of a clone or stage operation blocking a recovery is substantially reduced. While I agree this is the case, I also feel it’s an artificial need based on implemented architecture rather than theoretical architecture.
*** The frequency with which this is required will of course greatly depend on the type of filesystem the disk backup units are hosted on.












