When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

 

I want to spend a few minutes discussing something that drives me nuts. It’s something I see quite regularly on technical websites that discuss data protection, and it’s about time I make my opinion clear on it.

The latest instance comes from an article at SearchStorage called “How tiering can improve your backup strategies“. Marc Staimer wrote:

In one example, all data is commonly backed up once a day, put on tape, then shipped offsite. This methodology means that the RPO is 24 hours, and the RTO is a few days or longer. This is not a good idea for an organization’s mission-critical data. First, the process in recovering the data takes much too long, bringing all of the correct tapes back from offsite, and then recovering them in order, (which is subject to common human error). This can be incredibly tiresome and annoying if all that is being recovered is a single file caused by an accidental deletion. Second, it assumes all data on all tapes are recoverable. In the end, both introduce unacceptable risks to mission-critical data.

Now, I’m not going to dispute the fact that daily backups to tape can give RPOs of 24 hours or more, and can result in RTO’s of more than 24 hours. However, I don’t agree that an RPO of 24 hours is always the case, and I certainly don’t agree that an RTO of 24 hours (or more) is a 100% inevitability. Instead, I want to spend some time picking apart the rest of this junk statement.

Let’s first consider:

[T]he process in recovering the data takes much too long, bringing back all of the correct tapes from offsite, and then recovering them in order, (which is subject to human error). This can be incredibly tiresome and annoying if all that is being recovered is a single file caused by an accidental deletion.

This would be true if we were using archaic backup scripts (perhaps in a completely decentralised environment) with no automation. On the other hand, if you’re using decent, enterprise backup software there are absolutely no reasons why this should be the case. Enterprise class backup software will:


  • Identify which media is required for a recovery.
  • Read only from the media required for a recovery.
  • Seek to positions as close to the recovery point so as to avoid reading redundant data.

If we look at NetWorker for instance, we know it’s no slouch when it comes to seeking to the right spot on media for rapid single-file recovery. Between file records and media record markers, NetWorker can very quickly direct a tape drive to seek to the optimum location to commence recovery.

So my first thought is – if that’s the sort of experience that Marc Staimer has with tape based backup and recovery systems, he’s using the wrong ones, and shouldn’t blame that on tape.

Now let’s cover the second point:

[I]t assumes all data on all tapes are recoverable.

This can only be interpreted to mean one thing: the old “tape is unreliable” mantra. If tape were half as unreliable as every second article on tape made out to believe, there wouldn’t be a single tape vendor left in the market – they’d have all been sued out of business for deceptive trading and terribly unreliable products.

I’m not claiming that tape is fault free – if I did, I’d have a heck of a lot less cause to do the Ballmer Monkey Dance shouting “Cloning! Cloning! Cloning!” than I do. Tapes aren’t infallible, but I’ve not seen a single published paper citing extreme fault rates of enterprise class media*. On a yearly basis, the number of cases I see at customer sites of tape failure could be counted on a butcher’s right hand**. And you know what? Those instances are almost always at the backup point, not the recovery point.

So where does this leave us? At FUD central.

I’m the first to admit that the role of tape is changing within backup environments – I stated my thoughts on this previously in the article “Direct to Tape is Dead, Long Live Tape“, and I stand by this; so any overall discussion about backup media tiering with a model along the lines of disk->disk->tape or disk->vtl->tape will be the sort of thing I’ll usually heartily agree with.

If someone can point out independent studies showing high tape failure rates for enterprise class tapes – I’d like to know. Until then, let’s talk about valid, non-FUD reasons for pulling tape out of the immediate backup path. These include (but are not limited to):


  • Inability of most environments to stream tape.
  • SLAs requiring faster recovery starts, which in turn necessitate recovery from disk.
  • To allow for more streamlined backup cloning operations.
  • To support target deduplication for nearline backup storage.

Tape “unreliability” is not in that list. Maybe it is in limited environments that are currently using non-enterprise tape

* On the other hand, the easiest way of storing DAT media after generating your backup is to throw it into the bin. I might trust a DAT with a backup a little more than I’d trust a monkey with a pen to take notes in a court case, but not by much.

** I’m talking an old-style butcher. Before they had to start wearing chain mail gloves.

 

A common mistake I see people make when planning VTL implementations is to aim to keep virtual media of a similar size to physical media they intend to stage/clone out to. For example, if planning to first backup to a VTL, then transfer out to LTO-4, a lot of people start planning around having virtual tapes in the order of 500GB to 1TB. This is not the way a VTL should be utilised, and instead of solving backup problems, it’ll just continue them into your new virtualised environment.

Logically, this seems to make sense.

Practically, it makes about as much sense as trying to build a power plant based on hamster wheels.

Let’s think about some of the issues that we have with tape, any tape (be it physical or virtual), that we don’t get with disk backup volumes (i.e., volumes on ADV_FILE type devices):

  • You can’t simultaneously backup to, and recover from the volume.
  • You can’t simultaneously backup to, and clone from the volume.
  • You can’t simultaneously backup to, and stage from the volume.

Now add on top of that problems you also get with ADV_FILE devices:

  • You can’t simultaneously clone from, and stage from the volume.

There are potentially more disadvantages when comparing physical/virtual tape to ADV_FILE hosted volumes, but I’m being generous and for the most part, they’re all variants on those above themes anyway.

Now, we deploy VTLs for a few very specific reasons:

  1. Unlike physical tapes, virtual tapes don’t suffer shoe-shining.
  2. If a “proper” VTL, the underlying filesystem (that you don’t get to see) should be appropriately designed to better maximise performance for storing a few very large files.
  3. Faster backup/recovery starts through almost-zero second load times.
  4. More flexible drive configuration.
  5. Better interface for dynamic drive sharing.
  6. Faster recovery times, both from load speed and seek times.

None of these specific reasons should be hindered in any way by having very large virtual media sizes. However, when we look at the advantages of ADV_FILE hosted volumes over virtual or physical volumes, we can see that having virtual media the same size as physical media will simply continue those differences. If you are writing to a 500GB virtual tape, and need to use it for recovery, you still need to wait until NetWorker has finished filling the volume as you would on a 500GB physical tape.

But if your virtual tapes are just 50GB, by comparison, your wait time is considerably reduced.

Let’s do the basic maths. We’ll assume we’ve got two virtual tapes, one 500GB, one 50GB, and both of them had previously been used to backup 5GB. We have just started to do a new backup, but after that backup starts, someone needs to recover from that initial, 5GB backup.

If we’re writing at 50MB/s to the virtual tapes, we can do some pretty basic calculations about how long we’ll have to wait before we get a media change, and therefore can get access to the virtual tape for recovery.

  • For a 500GB virtual tape, it means needing to fill 495GB at 50MB/s – that’s around 2.8 hours.
  • For a 50GB virtual tape, it means needing to fill 45GB at 50MB/s – around a quarter of an hour.

That is the absolute crux of why you design your VTLs to have small media – so that you can at least somewhat address the issues caused by virtualising the bad aspects of tape as well, i.e., being unable to simultaneously backup to and recover from the virtual media.

There’s a good chance most recoveries (except the highest important ones) will be able to remain queued for a quarter of an hour waiting for media. On the flip side, only the least important recoveries can normally be queued for almost 3 hours before commencing.

Those time-to-fill advantages extend into cloning operations as well. If you do the right thing, you’re backing up, then you’re cloning. However, normally you’ll run multiple groups, which means some clones may start while other backups are still running. If again, you’re using very large pieces of virtual media, the chances are significantly higher than a still-running backup operation from another group will block read access to virtual media from a previously completed group. Again, would you rather your cloning operation to be blocked for 3 hours waiting for media, or a quarter of an hour?

I’d actually argue that aside from buying cheap, low performance disks and expecting high performance out of them in a primitive software VTL configuration, the number one worst design mistake you could make with a VTL would be to use virtual media sizes that are too large. If they’re even a quarter the size of current generation physical media, they’re way too large. When planning on cloning out to LTO-4 media, I’d still recommend virtual media sizes of 50GB preferably, or 100GB maximum.

Ultimately, that quarter of an hour may be your best sizing comparison. Work out how much data your VTL can write to a single piece of virtual media within a quarter of an hour, and keep your virtual media size within 10% of that number.

Anything less and you’ll likely strip away most, if not all, of the advantages you would have got from deploying a virtual tape library.

 

Despite recent claims that LTO-5 is at risk of being a dead format due to Imation being the first vendor to sign on for it, over the last week there’s been stories everywhere about SpectraLogic announcing a pre-purchase program for their LTO-5 offerings. SpectraLogic’s programme is intended to allow companies to continue to purchase LTO-4 drives and replace them with LTO-5 when they become available.

Given that SpectraLogic is in the library business rather than the tape drive manufacturing business, the most important part of this announcement is that one of the key drive manufacturers is preparing to commence production. (Since SpectraLogic has apparently had a history of sourcing drives from IBM, there’s a historical reason why IBM drives may be sourced by SpectraLogic.)

I think it’s fair to say that LTO-4 still has some legs left in it – thus, I’m not surprised that the LTO-5 take up is building more slowly than previous generation formats. That shouldn’t be seen as a negative towards the format – just a sign of continuing maturity in the industry.

 

Many years ago, a company switched from ArcServe to NetWorker. They did so around the time they made their end of year backups, the ones that they intended to keep ‘forever’ for legal requirements.

Fast-forward several years, and it was requested to recover Lotus Notes backups from those original end of year archives. That’s when the support call came through. You see, those end of year archives were done on a standalone tape drive, not a tape library, and both tapes had, say, ‘YEAR2002′ written on the label. There was a little “1″ noted on the first label, and a little “2″ noted on the second label. For convenience, we’ll call them the first and second tapes.

When they put the first tape into the library for recovery, their first issue was getting NetWorker to mount the tape, since it didn’t have a barcode. Some non-GUI commands later, the tape was in the drive, but NetWorker wouldn’t keep the tape mounted – every time they tried to mount the tape, NetWorker threw up an error saying that it was expecting tape YEAR2002 with a particular volume ID, not YEAR2002 with a different volume ID that wasn’t in the media database. The second YEAR2002 tape would mount though, but NetWorker couldn’t perform a recovery because all the media wasn’t available.

So, here’s what happened:

  • The manual backup was run of a bunch of systems and Lotus Notes.
  • A tape was labelled YEAR2002 within NetWorker, and the backup ran until the tape filled up.
  • A new tape was put into the tape drive, and since they had no exposure to NetWorker, they labelled that tape as YEAR2002 as well and the backup went on its way.

I’ll qualify here – the Lotus Notes backup was done using the module.

Now here’s the thing – while NetWorker works on the volume ID being unique, it also works on the volume label being unique as well. It won’t support two volumes in the media database at the same time with the same label. It gets pretty strident about that if you try to label one tape with another tapes’ label, but I guess if you’re new to NetWorker it might just seem like there’s a bunch of confirmation boxes you have to click before you can label your next tape.

So the net result was that the backup was written to two pieces of media that couldn’t co-exist in the media database at the same time. Scanning the first necessitates removing the second from the media database, and because this isn’t a filesystem backup, there are limitations that couldn’t be stepped around in recovering from partial savesets.

For a regular filesystem backup as a last resort this still would not be impossible to recover from – using scanner and uasm you can still suck the data off the tape(s) without NetWorker needing both in the media database. Tedious, and not as good as just being able to select data in a recovery program, but it’s better than no recovery at all. But you can’t use scanner and uasm for a non-filesystem recovery

(You also can’t write a new tape label to a fresh tape, then dd the NetWorker data after the label on the other tape onto the newly labelled tape. The volume ID (or some other unique volume identification system) is written into the savestream, and transferring that savestream onto another volume sees NetWorker reject it if you subsequently attempt to scan it.)

Net result? Data that could not be recovered short of sending it off to a specialist forensics data recovery company.

NetWorker’s fault? No. There is after all, only so much that software can do in order to prevent you from shooting yourself in the foot.

 

or, not all LTO media is created equal.

There’s an assumption that because LTO is a standard shared by multiple vendors, then any Ultrium media can be used in any Ultrium drives. (NB: Of course I’m referring here to the same version – i.e., version 4 media in a version 4 drive, or version 3 media in a version 3 drive, etc.*)

While technically this should be true, in practice it usually isn’t. I don’t wish to name vendors here, but suffice to say that I’ve had real-world experience, both in implementation and support scenarios, where tape drives have come from vendor A, but media was purchased from vendor B due to cheaper prices, and there’s been no end of “fun”. (That’s for very small values of “fun”, as a one-time colleague of mine used to say.)

When this has happened it’s usually manifested in one of a few different ways:

  • Excessive numbers of media failures – e.g., hard errors.
  • High numbers of tapes filling before they should – e.g., a 400GB tape filling at 300GB, 250GB, etc.
  • Significant slow-downs accompanied by SCSI warnings.

In such cases after all other possibilities have been eliminated – hardware, software, firmware, operational handling, etc. – these sorts of problems have been eliminated by changing media. I should note that in such situations, I’ve had customers actually send their media back to whom they purchased it from, who tested it, and certified it as being 100% OK. OK in different drives, that is.

This is not a posting recommending that you always buy media from whatever vendor your tape drives came from. I would however suggest the following:

  • Media that comes from the same manufacturer as your tape drive vendor will be OK.
  • Media that comes from reputable media vendors that don’t make competing tape drives should also be OK.
  • If one vendor’s media is ridiculously cheap – e.g., half the price from one vendor than it is from all others, then maybe you should exercise caution before committing your backups to it.
  • Any decent media supplier will be able to tell you which media is recommended for use with a particular vendors’ tape drives.
  • Most hardware vendors do actually, if you look closely enough, recommend particular media vendors. This will undoubtedly include their own, but it usually includes 2 or 3 others. You should trust that information.


* I haven’t forgotten about backwards compatibility of media – e.g., any LTO-x drive must be able to read x-2 media and write x-1 media in addition to x media.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha