Direct to Tape is Dead, Long Live Tape

Any regular reader knows that I don’t for a minute believe that tape is dead. However, it is time to address the changing use for tape within the enterprise datacentre, and what we’re going to see in the coming decade.

To start with, let’s examine the traditional role within tape within enterprise backup and recovery. Long term backup users “grew up” with one of the two following backup strategies:

  1. Each server (or critical server) had a tape drive (or drives) directly attached, and wrote data to the media in locally attached drives, or
  2. A central backup server received network backups and pushed them directly out to tape storage locally attached to the backup server.

Over time, as backup and recovery grew up, we saw the first model continually fail until it has become almost universally derided as the antithesis of best practices. The second model though, the centralised backup model, has effectively formed the absolute nexus of enterprise backup and recovery best practices.

The effect of the evolution of the centralised backup model has been a continual tug of war between network and data throughput to tape, and the performance characteristics of tape.

I sincerely doubt that this will be the decade that tape will die. However, this is the decade where direct to tape will die. To be perfectly honest, it’s fair to say we exited the noughties with the direct to tape model on life-support.

What’s wrong, specifically, with the direct to tape model? A primary reason is that tape is getting too fast. For a while in the noughties we were in a period where it was relatively straight forward to performance tune a backup environment to be able to keep data streaming relatively well at tape. This was around the LTO-1 and LTO-2 mark. However, LTO-3 started to cause the edifice to groan, LTO-4 to creak and crumble, and LTO-5 will just finish the job.

The rest of the environment quite simply hasn’t kept up with tape. We need high capacity tape for green, long term storage of backups or archives, but getting the data out to it is becoming increasingly difficult via a multi-pronged delivery system. Consider for instance an environment with just 50 machines, a NAS, and a SAN, where 34 of those machines use storage on the SAN, two machines use storage from the NAS in addition to the NAS presenting storage direct to end users. 4 of the machines are actually ESX servers, with the remaining 30 of the 34 SAN connected machines being guests. The number of areas where performance tuning comes into play are significant:

  • How many SAN connected machines will be backed up at once?
  • What are the performance characteristics of the SAN under heavy simultaneous read load across all defined LUNs?
  • What are the performance characteristics of the SAN under heavy simultaneous read load across all defined LUNs while doing a RAID-5 reconstruction or undergoing a RAID-5 failure? (etc, etc.)
  • How many hosts on the SAN use wide striping? How many? How many of these will be simultaneously backed up?
  • How many hot spares are there on the SAN?
  • What are the ongoing operational performance requirements of the SAN while heavy simultaneous read is occurring across all defined LUNs?
  • What are the performance characteristics of the SAN when significant spikes of primary production activity occur during a backup and all LUNs are busy with reads, and then key LUNs also become extremely busy with writes?
  • How many machines that are SAN connected will get copy-on-write snapshot backups, and how many will have non-snapshot backups?
  • What are the performance characteristics of the SAN snapshot pools?
  • What’s the impact of doing an NDMP backup of the NAS server as well as hosts using its storage? (Assuming for instance that those two other hosts have iSCSI access.)
  • How many simultaneous NDMP backups does the NAS server support?
  • What are the performance characteristics of the NAS host doing multiple NDMP backups whilst simultaneously supporting primary production access?
  • How many virtualised machines will be backed up at once? How many are likely to be on any one ESX server at any given time?
  • Will VCBs/etc be used for VMware guest backups? (Only for Windows of course. Let’s mess things up and say that 20 of the virtualised systems are running Linux.)
  • Will the tape library share access to the SAN?
  • What’s the speed of the SAN? 2Gbs? 4Gbs? This (obviously) significantly impacts throughput when we start talking about high speed tape.
  • For each client in the backup environment, what is the optimum client parallelism settings for the backup? For SAN connected and virtual clients, do these per-client optimum client parallelism settings impact other hosts? (It’s like the prisoner’s dilemma).
  • Then there’s all the actual/traditional backup server (/storage node) questions:
    • What’s the base network speed?
    • How many network ports does the backup server have?
    • What’s the backplane characteristics of the backup server?
    • What impact will filesystem density make on individual client performance?
    • etc, etc, etc.

In even a relatively small environment now, performance tuning of the entire environment to focus on one item – e.g., keeping tape streaming – is just completely impractical. The entire environment has to be evaluated in a more holistic way with a focus on overall performance for primary production, not tape streaming speed.

Of course, that’s not the only issue facing tape in an enterprise environment. Drives are relatively expensive, yet you need as many as possible so you can balance backup and restore objectives. However, media sizes are becoming so large that your chances of needing to read from tape that you’re still writing to continues to grow with each generation, placing physical roadblocks to backup and recovery performance. Then you’ve got the meta-access times: load times and seek times are relatively poor compared to using disk, meaning that SLAs requiring minimum times between recovery request and recovery commence can’t readily be met with tape.

In short, we’ve hit the wall when it comes to the direct-to-tape backup model. I’m not the first backup consultant to say this, and I won’t be the last. This isn’t even the first time I’ve said it – I’ve been advising customers for years that they need <disk> inserted between the backup process and the tape, either as a simple buffer (for the smaller environments), or as a high speed/nearline recovery area for the larger environments.

The performance tuning advantages alone of migrating away from direct-to-tape are immense. Instead of worrying about how every single one of those questions above (and probably 3x as many more) will affect tape, and having to practically guess on a day to day basis on how streaming will be affected, you can instead focus tape streaming performance on just a few hosts within the environment – the backup server and any additional storage nodes you have. Get those hosts beefed up so that they can stream large chunks of data out to tape. Rather than having to “muscle up” the entire environment, you instead just have to get the performance and power out of a few select hosts. This can be a huge cost saving, and provides better, more guaranteed streaming speed to tape, since you move from dealing with all the above issues to just simple ones: how fast can you send very, very large chunks of data from the <disk> connected to the backup server/storage nodes to physical tape?

We still need tape. I do not accept the long term reliability of any solution that intends to keep everything on disk (VTL, ADV_FILE, etc) for the entire lifespan of a backup environment. Certainly not as a “blanket rule”, anyway – i.e., if you’re looking at making a broad statement, the broad statement is “tape is still needed” rather than “tape isn’t necessary”. Nothing equals tape when it comes to:

  • Long term recoverability;
  • Media that is guaranteed “offline”, completely immune to viruses and malware;
  • For green credibility and
  • For cost per GB.

The movement away from the direct to tape model is not actually about “killing tape”, but instead it’s about reorienting business practices to suit business requirements rather than molding business requirements to suit backup media characteristics. Larger companies will of course look at designing their architecture to eliminate the need for day to day cloning to tape, focusing instead on say, cloning monthly backups only to tape, with the rest being replicated between multiple datacentres, etc. But that’s not the way it will be for the majority of the enterprise. Regardless though of whether you only clone monthly backups and use replication instead, or whether you still do daily cloning, tape stays part of the overall strategy. It just isn’t the primary focus of backup any longer.

This is the decade where we stop worrying about silly terms such as D2D2T and instead work with the changed playing field. The change is that we backup to <disk>, then get copies out to physical tape.

Direct to tape is dead, long live tape.

4 thoughts on “Direct to Tape is Dead, Long Live Tape”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.