Design considerations for ADV_FILE devices

Introduction

When choosing to deploy backup to disk by using adv_file devices (instead of say, VTLs), there are some design considerations that you should keep in mind. It’s easy to just go in and start creating devices willy-nilly, with the consequence of that usually being poor performance and insufficient maintenance windows at some later date.

NetWorker doesn’t care what sort of physical devices (either layout, or connectivity properties) you place your ADV_FILE devices on; consequently for instance on a lab server of mine I have 3 x 1TB USB2 drives connected and each providing approximately 917GB of formatted disk backup capacity each. Now, this is something that I’d not recommend or even contemplate deploying for a production environment – but as I said, it’s a lab server, so my goal is to have copious amounts of space cheaply, not high performance.

There’s 3 layers of design factors you need to take into consideration:

  • Physical LUN layout/connectivity
  • Presented filesystem types and sizes
  • Ongoing maintenance

If you deploy disk backup without thinking about these three factors – without planning them – then at some point you’re going to come a cropper. So, let’s go through these options.

Physical LUN layout/connectivity

Except in lab environments where you can afford, at any point, to lose all content on disk backup units, you’ll need to have some form of redundancy on the disk backup units. It’s easy for businesses to … resent … having to spend money on redundancy, and I’m afraid that no-one will be able to make a coherent argument to me that it’s appropriate to run production backups to unprotected disk.

Assuming therefore that sanity prevails, and redundancy is designed into the system, care and consideration has to be made to layout LUNs and connectivity in such a way as to maximise throughput.

Probably the single best metric to consider is that it is necessary to ensure that physical layout and connectivity is such that it allows for reads from the disk backup units to exceed the performance of whatever tape is being written to when it comes to cloning, and for the requisite number of drives. That is, if your intent is to be able to clone from disk backup to at least 2 x LTO-3 drives simultaneously, your design needs to have a read performance of around 320 MB/s. Obviously, the design should allow for simultaneous writes (i.e., backups) while achieving those cloning objectives.

This need for speed affects both physical connectivity of disk as well as the layout of the LUNs presented to the host, and by layout I refer to both RAID level and number of spindles.

Presented filesystem types and sizes

Depending on the operating system being used for the backup host, the actual filesystem type selection may be somewhat limited. For example, on Windows NT based systems, there’s a very strong chance you’ll be using NTFS. (Obviously, Veritas Storage Foundation might be another option.) For Unix style operating systems, there will usually be a few more choices.

Within NetWorker, individual savesets are written as monolithic files to ADV_FILE devices. This invariably means that you don’t necessarily need to support say, millions of files on the ADV_FILE devices, but you do need to support large amounts of data.

My first concern therefore is to ensure that the filesystem selected is fast when it comes to a lesser considered activity – checking and error correction following a crash or unexpected reboot. To give you a simplistic example, when considering non-extent based filesystems, making a choice between journalled and non-journalled should be a “no-brainer”. So long as data integrity is not an issue*, you should always ensure that you pick the fastest checking/healing filesystem that also meets operational performance requirements.

Moving on to size, I usually follow the metric that any ADV_FILE device should be large enough to support two copies of the largest saveset that could conceivably be written to them. Obviously, there’ll be exceptions to that rule, and due to various design considerations, this may mean that there’s some savesets that you’ll have to consider going direct to tape (either physical or virtual), but it’s a good starting rule.

You have to also keep in mind the selection criteria used by NetWorker for picking the next volume to be written to. For instance, in standard configurations, it’s a good idea to set “target sessions” on disk backups all to 1. That way, new savesets achieve as close as possible to round-robining distribution.

However, bear in mind that when all devices are idle, and a new round of backups starts, NetWorker always picks the oldest labelled, non-empty volume to write to first, and works backwards from there. This, unfortunately is (for want of a better description), a stupid selection criteria for backup to disk. (It’s entirely appropriate for backup to tape.) The implications of this is that your disk backup units will typically “fill” in order of oldest labelled through to most recently labelled, and the first labelled disk backup unit often gets a lot more attention than the other disk backup units. Thus, if you’re going to have disk backup units of differing sizes, try to keep the “oldest” ones the largest, and remember that if you relabel a disk backup unit, it’s going to jump to the back of the queue.

Ultimately, it’s a careful balancing act you have to maintain – if you make your disk backup units too small, they may not fit some savesets on them at all (ever), or may too frequently fill during backups requiring staging.

On the other hand, if you make the disk backup units too large, you may find yourself in an unpleasant situation where the owner-host of the disk backup devices takes an unacceptably long period of time checking filesystems when it comes up following particular reboots. This is not something to be taken lightly: consider how a comprehensive and uninterruptable check of a 10TB filesystem on reboot may impact an SLA requiring recovery of Tier-1 data to start within 15 minutes of the request being made!

Not only that, given the serial nature of certain disk backup operations (e.g., cloning or staging), you can’t afford a situation where recoveries can’t run for say, 8 hours, because 10TB of data is being staged or cloned**.

Thus, for a variety of reasons, it’s quite unwise to design a system with a single, large/monolithic ADV_FILE device. Disk backup volumes should be spread across as many ADV_FILE devices as possible within the hardware configuration.

Ongoing maintenance

For backup systems that need 24×7 availability, there should be one rule here to follow: your design must support at least one disk backup unit being offline at any time.

Such a design allows backup, recovery, cloning and staging operations to continue even in the event of maintenance. These maintenance operations would include, but not be limited to, any of the following:

  • Evacuation of disk backup units to replace underlying disks and increase capacity (e.g., replacing 5 x 500GB disks with 5 x 1TB disks, etc.)
  • Evacuation of disk backup units to reformat the hosting filesystem to compensate for degraded performance from gradual fragmentation***.
  • Large-scale ad-hoc backups outside of the regular backup routine that require additional space.
  • Connectivity path failure or even (in a SAN), tray failure.

(In short, if you can’t perform maintenance on your disk backup environment, then it’s not designed correctly.)

In summary

It’s possible you’ll look at this list of considerations and want to throw your hands up in defeat thinking that ADV_FILE backups are too difficult. That’s certainly not the point. If anything, it’s quite the opposite – ADV_FILE backups are too easy, in that they allow you to start backing up without having considered any of the above details, and it’s that ease of use that ultimately gets people into trouble.

If planned correctly from the outset however, ADV_FILE devices will serve you well.


* Let’s face it – there shouldn’t be any filesystem where you have to question data integrity! However, I’ve occasionally seen some crazy “bleeding edge” designs – e.g., backing up to ext3 on Linux before it was (a) officially released as a stable filesystem or (b) supported by EMC/Legato.

** This is one of the arguments for VTLs within NetWorker – by having lots of small virtual tapes, the chances of a clone or stage operation blocking a recovery is substantially reduced. While I agree this is the case, I also feel it’s an artificial need based on implemented architecture rather than theoretical architecture.

*** The frequency with which this is required will of course greatly depend on the type of filesystem the disk backup units are hosted on.

12 thoughts on “Design considerations for ADV_FILE devices”

    1. That’s going to be entirely site-dependent, with a max limiting factor being the amount you can afford to spend on disk backup.

      The other factors that will come into play is how many days/weeks/months worth of backups do you want to keep on-line on the disk backup unit, and how frequently do you do full backups, etc? Or more accurately, how long do you want to keep how much data on-line for?

      Say you backup over the course of 1 week 5TB of data. If you want to keep 2 week’s worth of data online, on disk, you’ll need more than 10TB of disk, but you’ll likely need less than 15TB. The way I tend to calculate it is to first look at amount of time I want to keep backups on disk for and the size of those backups, then look at my largest savesets.

      Typically you don’t want all your disk backup/adv_file space to be consolidated into a single unit. One general rule of thumb is for any storage node with disk backup, try to have at least one disk backup unit per tape drive that you might want to clone or stage to. It’s generally a good idea NOT to have that being “just one” because of the limits on simultaneous cloning/staging, etc.

      So what I look at is what the size of my biggest saveset for any backup that goes to disk will be, then for n full backups that I may want to keep of that saveset on disk, try to make sure that each disk backup unit is at least (n+1) * size_of_saveset.

      This rule doesn’t apply all the time – particularly if you have one or two comparatively massive savesets, so ultimately there’s some tweaking required of such sample rules to suit each environment.

      I hope however that’s helped.

  1. Hi Preston,
    Thanks for the great info on this site. I have a question on selecting the number of adv_file devices to use. You mention having as many as possible to reduce the risk of a clone job tying it up when you need to do a recovery, but wouldn’t the _AF_readonly device that is automatically created allow a recovery to be done from that device at the same time as clones were running? You couldn’t backup, recover and clone all at the same time, but you could do two of the three, couldn’t you?
    Thanks,
    Chris

    1. Hi Chris,

      Thanks for the feedback. To my knowledge, it’s still not possible to simultaneously recover from and clone from an ADV_FILE type device — unless you explicitly, manually either clone from or recover from the read-write device while the other operation runs.

      I just ran a test to confirm – kicked off a backup to a specific ADV_FILE device, waited for that to complete, started a recovery browsing so I could select files for recovery, then started a clone. Once the clone had started reading, I initiated the recovery, and it sat continually reporting “waiting 30 seconds then retrying”. (Once the clone finished, the recovery proceeded. Even deliberately picking a different saveset to clone than the saveset being recovered resulted in the same issue – not that it should make a difference anyway.)

      Until we can 100% certainly backup+clone+stage+recover using the same ADV_FILE device as source/target, keeping multiple devices (and volumes) available to spread the access load wherever possible remains a strong design recommendation.

      Cheers,

      Preston.

  2. I believe you can perform a simultaneous restore & clone. At least from the command line anyways. I have not tried it before, but I have tried 2 clones running simultaneously of the same adv_file device.

    As you know, adv_file device contains of 2 components, the read only & read write component. Each save sets on each component are callable by their ssid/cloneid

    The trick is to find the cloneid, which basically can only be obtained by running the mminfo command:

    for example: mminfo -r ssid,cloneid,client,name

    will show you the saveset id then the clone id on the 2nd field.

    Knowing the cloneids of the savesets on the adv_file device, will be enable you to perform both restore & clone

    I hope it helps

    1. This is true; however, my point is that the simultaneous clones or clones + recoveries are not accessible to the average user because of the need to isolate individual SSID/CloneID combinations. Further, that doesn’t assist in situations where the saveset that you want to recover from is only an incremental – i.e,. you are currently cloning and you want to do a filesystem recovery that will roll in other savesets + one or more savesets on the disk backup unit you are currently cloning from. Yes, you can do individual saveset recovery by instance, but that’s not how a filesystem level recovery from multiple days/levels works.

  3. Hey Preston… Thanks for the great. Info.

    I’ve always steered away from having multiply adv_file type devices on a single LUN, I guess assuming that it would then cause random I/O rather than sequential. It sounds like you are suggesting the opposite. Can you clarify? Have you tested and noticed any throughput differences on say a physical devices with 1 adv_file verses one with 3?

    Thanks,

    Joel

    1. Hi Joel,

      I don’t believe it was my intent to suggest having multiple ADV_FILE devices on a single LUN; instead the LUNs should be each presented (preferably one LUN per ADV_FILE device) with sufficient spindles and RAID to not only offer protection against individual disk failure, but also to maximise performance.

      Effectively the read performance of each LUN/ADV_FILE device should be optimised to drive whatever tape you’re using to write backups out to – e.g., for LTO-3 tapes, you’d want your ADV_FILE device to be able to stream out at say, 160MB/s (80MB/s native LTO-3 performance, 160MB/s compressed).

      Having multiple ADV_FILE devices on a single LUN/filesystem is something that it’s best to avoid these days.

      I hope this helped to clarify.

      Cheers,

      Preston.

  4. Gotcha…

    I totally agree… unfortunately I allowed space efficiency to overrule performance when I was making my adv_file luns. I created one really big raid6 volume and put 2 luns on it rather than 2 smaller raid 6 volumes. A mistake I won’t make next go around.

    Thanks,

    Joel

  5. Hi Preston,

    I have implemented ADFT for networker server 7.5SP1 running on windows and I have one LUN from the SAN presented to the networker server as a drive and have created multiple folders (one each for a group) and am simultaneously backing up different clients to that ADFT.

    That is something that you don’t recommend isn’t’ it ???

    So should I create separate LUN’s in the SAN and present each individual LUN to the networker server and backup each group to their individual LUNS ???

    Tejas

    1. I typically would not recommend having all backup to disk storage presented as a single large “bucket” and formatted for use that way.

      While it does have some advantages (to do primarily with handling larger savesets, etc.), the primary disadvantage is that you’ve only got one stream from which you can stage out or clone out from at the same time – making administrative and “meta” operations more costly. So if your disk backup unit fills and you’ve only got one of them, you hang all your backups until you reclaim enough space, rather than just whatever the active sessions were going to a single device amongst many.

      As to the breakdown of the presented LUNs/advanced file type devices, that will primarily depend on your site requirements. As an example, if I had 6TB of (formatted/usable) space available to me, and my largest saveset was 1TB, I’d at least break down that 6TB into 3 x 2TB disk backup units. The advantage of that is that I can then clone or stage 3 sessions at once rather than just one, and there’s less interruption (potentially) if a disk backup unit fills.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.