10 Things Still Wrong with Data Protection Attitudes

 Architecture, Backup theory, NetWorker  Comments Off on 10 Things Still Wrong with Data Protection Attitudes
Mar 072012
 

When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

If you wouldn’t drink it, don’t cook with it…

 Architecture, Backup theory, General thoughts  Comments Off on If you wouldn’t drink it, don’t cook with it…
Sep 282011
 

This blog article has been moved across to the sister site, Enterprise Systems Backup and Recovery. Read it here.

Aug 072011
 

In an earlier article, I suggested some space management techniques that need to be foremost in the minds of any deduplication user. Now, more broadly, I want to mention the top 7 things you need to avoid with deduplication:

1 – Watch your multiplexing

Make sure you take note of what sort of multiplexing you can get away with for deduplication. For instance, when using NetWorker with a deduplication VTL, you must use maximum on-tape multiplexing settings of 1; if you don’t, the deduplication system won’t be able to properly process the incoming data. It’ll get stored, but the deduplication ratios will fall through the floor.

A common problem I’ve encountered is a well running deduplication VTL system which over time ‘suddenly’ stops getting any good deduplication ratio at all. Nine times out of ten the cause was a situation (usually weeks before) where for one reason or another the VTL had to be dropped and recreated in NetWorker – but, the target and max sessions values were not readjusted for each of the virtual drives.

2 – Get profiled

Sure you could just sign a purchase order for a very spiffy looking piece of deduplication equipment. Everyone’s raving about deduplication. It must be good, right? It must work everywhere, right?

Well, not exactly. Deduplication can make a big impact in the at-rest data footprint of a lot of backup environments, but it can also be a terrible failure if your data doesn’t lend itself well to deduplication. For instance, if your multimedia content is growing, then your deduplication ratios are likely shrinking as well.

So before you rush out and buy a deduplication system, make sure you have some preliminary assessment done of your data. The better the analysis of your data, the better the understanding you’ll have of what sort of benefit deduplication will bring your environment.

Or to say it another way – people who go into a situation with starry eyes can sometimes be blinded.

3 – Assume lower dedupe ratios

A fact sheet has been thrust in front of you! A vendor fact sheet! It says that you’ll achieve a deduplication ratio of 30:1! It says that some customers have been known to see deduplication ratios of 200:1! It says …

Well, vendor fact sheets say a lot of things, and there’s always some level of truth in them.

But, step back a moment and consider compression ratios stated for tapes. Almost all tape vendors give a 2:1 compression ratio – some actually higher. This is all well and good – but now go and run ‘mminfo -mv’ in your environment, and calculate the sorts of compression ratios you’re really getting.

Compression ratios don’t really equal deduplication ratios of course – there’s a chunk more complexity in deduplication ratios. However, anyone who has been in backup for a while will know that you’ll occasionally get backup tapes with insanely high compression ratios – say, 10:1 or more, but an average for many sites is probably closer to the 1.4:1 mark.

My general rule of thumb these days is to assume a 7:1 compression ratio for an ‘average’ site where a comprehensive data analysis has not been done. Anything more than that is cream on top.

4 – Don’t be miserly

Deduplication is not to be treated as a ‘temporary staging area’. Otherwise you’ll have just bought yourself the most expensive backup to disk solution on the market. You don’t start getting any tangible benefit from deduplication until you’ve been backing up for several weeks. If you scope and buy a system that can only hold say, 1-2 weeks worth of data, you may as well just spend the money on regular disk.

I’m starting to come to the conclusion that your deduplication capacity should be able to hold at least 4x your standard full cycle. So if you do full backups once a week and incrementals all other days, you need 4 weeks worth of storage. If you do full backups once a month with incrementals/differentials the rest of the time, you need 4 months worth of storage.

5 – Have a good cloning strategy

You’ve got deduplication.

You may even have replication between two deduplication units.

But at some point, unless you’re throwing massive amounts of budgets at this and have minimal retention times, the chances are that you’re going to have to start writing data out to tape to clear off older content.

Your cloning strategy has to be blazingly fast and damn efficient. A site with 20TB of deduplicated storage should be able to keep at least 4 x LTO-5 drives running at a decent streaming speed in order to push out the data as its required. Why? Because it’s rehydrating the data as it streams back out to tape. Oh, I know some backup products offer to write the data out to tape in deduplicated format, but that usually turns out to be bat-shit crazy. Sure, it gets the data out to tape quicker, but once data is on tape you have to start thinking about the amount of time it takes to recover it.

6 – Know your trends

Any deduplication system should support you getting to see what sort of deduplication ratios you’re getting. If it’s got a reporting mechanism, all the better, but in a worst case scenario, be prepared to log in every single day for your backup cycles and see:

-a- What your current global deduplication ratio is

-b- What deduplication ratio you achieved over the past 24 hours

Use that information – store it, map it, and learn from it. When do you get your best deduplication ratios? What backups do they correlate to? More importantly, when do you get your worst deduplication ratios, and what backups do they correlate to?

(The recent addition of DD Boost functionality in NetWorker can make this trivially easy, by the way.)

If you’ve got this information at hand, you can use it to trend and map capacity utilisation within your deduplication system. If you don’t, you’re flying blind with one hand tied behind your back.

7 – Know your space reclamation process and speeds

It’s rare for space reclamation to happen immediately in a deduplication system. It may happen daily, or weekly, but it’s unlikely to be instantaneous. (See here for more details.)

Have a strong, clear understanding of:

-a- When your space reclamation runs (obviously, this should be tweaked to your environment)

-b- How long space reclamation typically takes to complete

-c- The impact that space reclamation operation has on performance of your deduplication environment

-d- An average understanding of how much capacity you’re likely to reclaim

-e- What factors may block reclamation. (E.g., hung replication, etc.)

If you don’t understand this, you’re flying blind and have the other hand tied behind your back, too.

Feb 272011
 

When it comes time to consider refreshing the hardware in your environment, do you want do do it quickly, or properly?

Because here’s the thing: If you want to do it quickly – if you feel rushed, and want to just get it done ASAP, not seeing the point of actually doing a thorough analysis of your sizing and growth requirements, here’s what you do:

  • Guess at the number of clients you’re going to backup.
  • Guess at the amount of data you’ll be backing up from first implementation.
  • Guess at the growth rate you’ll experience over the X years you want the system to last for.
  • Guess at the number of staff you’ll need to manage it.

Then, once you’ve got those numbers down, multiply each one by at least 4.

Then, ask for twice the budget necessary to achieve those numbers – just to be on the safe side.

If you think I’m joking – I’m not; I’m deadly serious. Deciding to skip an architecture phase where you actually review your needs, your growth patterns, your staffing requirements, etc., because you’re in a hurry is a costly and damning mistake to make. So if you’re going to do it, you may as well try to make sure you can survive the budget period.

And if asking for that much budget scares the heck out of you – well, there is an alternative: conduct a proper system architecture phase. Sure, it may take a little longer to get things running, or cost a little more time/money to get the plan done, but once you’ve got that done, it’ll be gold.

Feb 092011
 

Deduplication can create fantastic space saving opportunities within an environment, but it does also create the need for a much closer eye on space management.

We’re used, in conventional backup or storage situations, to the following two facts:

  • There is a 1:1 mapping between amount of data deleted and amount of space reclaimed.
  • Space reclamation after delete is near instantaneous.

Data deduplication systems throw both those facts out. In other words, there’s no free lunch: you may be able to store staggeringly large amounts of data on relatively small amounts of storage, but there’s always swings and roundabouts.

With deduplication systems, you must carefully, aggressively monitor storage utilisation since:

  • There is no longer a 1:1 mapping between amount of data and amount of space reclaimed: You might, if you’re running out of space, selectively delete several TB of data, but due to the nature of deduplication, reclaim only a very small amount of actual physical space as a consequence.
  • Space reclamation is not immediate: whenever data is deleted from a deduplication system, the system must scan remaining data to see if there’s any dependencies. Only if the data deleted was completely unique will it actually be reclaimed in earnest; otherwise all that happens is that pointers to unique data are cleared. (It may be that the only space you get back is the equivalent of what you’d pull back from a Unix filesystem when you delete a symbolic link.) Not only that, reclamation is rarely run on a continuous basis on deduplication systems – instead, you either have to wait for the next scheduled process, or manually force it to start.

The net lesson? Eternal vigilance! It’s not enough to monitor and start to intervene when there’s say, 5% of capacity remaining. Depending on the deduplication system you may find that 5% remaining space is so critically low that space reclamation becomes a complete nightmare. In reality, you want to have alerts, processes and procedures targeting the following watermarks:

  • 60% utilisation – be on the look out for unexpected data growth.
  • 70% utilisation – be actively monitoring daily consumption rates.
  • 75% utilisation – you should know by know whether you have to expand the storage, or whether usage will stabilise again.
  • 80% utilisation – start forcing space reclamation to occur more frequently.
  • 85% utilisation – If you have to expand the storage, the purchase process should be complete and you should be ready to install/configure.
  • 90% utilisation – have emergency processes in place and ready to activate for storage redirection.

With these watermarks noted and understood, deduplication will serve your environment well.

Plenty of life left in tape

 Tidbit  Comments Off on Plenty of life left in tape
Jun 302010
 

With LTO-5 now just starting to go mainstream, it’s reassuring to see that the Ultrium roadmap has been expanded with another 2 generations, taking the mapping out to 8 generations in total. Linking to the roadmap image, we see:

LTO Ultrium Roadmap(Image copyright the LTO Consortium.)

LTO6 had been roadmapped a while ago, and presents slightly more than double the native capacity of LTO5 at 3.2TB. Generation 7 and 8 are currently mapped for doubling each previous generation. Interestingly there’s predictions of higher increases in tape streaming speed. One would hope these are managed carefully; it was a real relief to see LTO5 not do the conventional doubling of streaming speed, giving backup networks and infrastructure generational time to catch up.

It’s pretty clear that an investment in LTO5 today is an investment in a well roadmapped future that has been consistently delivered on thus far. Sure, the use of tape within backup is evolving – we’re going to see it moved more into the role of long-term backup storage in larger sites, and clones-only in smaller sites, but with a healthy roadmap ahead of us and LTO5 just now starting to ramp up into mainstream, tape continues to show it’ll be around for a while to come.

LTO-5 peeking at us from the end of the tunnel

 Architecture, General Technology, General thoughts  Comments Off on LTO-5 peeking at us from the end of the tunnel
Mar 302010
 

The much long-anticipated wait for LTO-5 is now approaching fulfilment, with stories such as “Mass Production of Sony LTO-5 Media Has Started” further reinforcing that this next generation enterprise tape format is about to start rolling into datacentres.

One of the biggest advantages of LTO-5 is that while the capacity has effectively doubled from LTO-4, we’ve not seen a comparable doubling in streaming speed. LTO-4 had a native streaming speed of 120 MB/s, which has caused more than a few headaches to backup administrators trying to keep it running at full speed. (Indeed, it’s an example of why I earlier posted “Direct to Tape is Dead, Long Live Tape“).

LTO-5, while moving to a native capacity of 1.5TB increases the native streaming speed by only 20MB/s – giving us a native streaming speed of 140MB/s. This still isn’t going to always be easy to achieve, but bearing in mind that each previous generation LTO technology has typically doubled the streaming speed of the one before it, 140MB/s is going to be a lot easier to integrate into the datacentre than 240MB/s would have been!

Looking at the generational specifications, we get:

ProsCons
Provides image level backup functionality for virtual machines, allowing for considerably faster backup options, unconstrained by traditional performance limitations, particularly of dense filesystems.Only supported File Level Recovery (FLR) for Windows virtual machines.
A variety of backup connection methods, including HotAdd, NBD and SAN connect means that most environment scenarios are catered for.Earlier versions of VADP backup did experience some issues where snapshots might not be successfully released.
A VADP proxy could support the backup of virtual machines from multiple vCenter servers.Even for maximally configured VADP environments, a maximum of 8 virtual machine disk files could be backed up simultaneously. While this is mitigated by the speed of block level backups, it does result in higher numbers of VADP proxies being deployed in large environments.
When configured correctly, supports both file (Windows-only) and image level recovery from just the image level backup.Not all virtual disk types are supported for backup, which can lead to a non-holistic approach.
Supports VM image level backup to tape (virtual or physical) as well as advanced file type devices and Data Domain systems.Databases and other applications running within virtual machines (e.g., Oracle, Exchange, etc.) still require in-guest module backups, which in turn require the presence of in-guest filesystem agent software, even if it wasn't used.
In order to facilitate FLR back into the virtual machine, the NetWorker filesystem agent still needs to be installed on the client.

Note – all compression sizes and speeds quoted at standard vendor estimate of 2:1 compression ratio. In reality, we all know that 2:1 compression ratios only occur on a small subset of data, and it’s usually better to estimate either a conservative compression ratio of 1.3:1, or if you want to be optimistic, a compression ratio of 1.4:1, unless you’re very certain that your data is highly compressible.

If you want to see these figures graphically, here we go:

LTO Ultrium Streaming SpeedsLTO Capacities

I’m not aware of the hard numbers, but anecdotally I’ve heard time and time again that a lot of sites have been reluctant to go up to LTO-4 from LTO-3 because they’ve not been ready to upgrade their infrastructure to support the streaming speed of LTO-4. Some have argued this is clear indication that LTO-5 will struggle for adoption. I beg to differ – while LTO-4 was effectively ahead of its time by a considerable margin, LTO-5 will instead enter a more sophisticated datacentre with better approaches to tape usage within the backup environment. In the cases of datacentres still using LTO-3, it will also be entering environments that are well an truly ready to upgrade their infrastructure. This article about HP’s strategy for LTO-5 that I was referred to this morning shows they have a similar vein of thought to me on this front.

The end result will be that a lot of sites that have stayed on LTO-3 will see good reason to make the step directly from that format up to LTO-5. The streaming speeds will only increase by just a little over double, but the native capacity will jump on those sites from 400 GB to 1.5TB – that sort of capacity increase will justify the expenditure required to hit the new speed target of LTO-5.

IBM and the case of the crazy-dense tape

 Architecture, General Technology  Comments Off on IBM and the case of the crazy-dense tape
Jan 232010
 

According to a press release, IBM have come up with a tape format which is so dense that it’ll fit about 35TB of uncompressed data on it.

Obviously this is a “just in the lab” technology and it’s going to be a while away from hitting the market. It remains, however, a remarkable feat – by comparison LTO-4 manages a “measly” 800 GB of uncompressed data, and the soon to be released LTO-5 manages 1.6TB of uncompressed data.

The critical question of course will remain – how fast will you have to pump data at this beast in order to get it streaming? I’m guessing it will be a seriously high speed. As we continue to see tape getting faster and faster, I’ll continue to say: this is the decade where direct to tape backup models will die, long live tape.