Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

 

It seems to be a growing trend – EMC update NetWorker on the same day that I’m sick as a dog and unable to concentrate enough to blog on it.

I thought I’d run through the release notes new features and provide a bit of a summary of the changes. The full release notes are available via PowerLink, or here.

Slap another GUI on the barbie, will ya?

As a Mac user, I find this a really nice treat. EMC have finally included a recovery GUI for the Mac OS X client. It was always a bit of an omission that the most GUI-centric platform supported by EMC happened to have no client GUI support whatsoever. For what its worth, I’m not overly concerned about a lack of a backup GUI for the platform, since that’s disappeared from other platforms, and most backups should be done via a schedule anyway.

And for what it’s worth, it’s a damn fine GUI:

Mac GUI 1Mac GUI 2Mac GUI 3There were, in beta testing, a few minor glitches: I’ve not yet had the time to run the released version through its paces yet, but as you’ll note from the screen shots above, in “Favorites”, for instance, a few drives are listed twice. Though, now that I think about it, the drives that were listed twice were filesystems that had been rebuilt and/or replaced on my Mac Pro during the beta test period, so perhaps for some reason they were showing up separately.

The other point for the Mac in 7.6 SP3 of course is support for Lion. Don’t however get your hopes up – there’s no mention of 7.6 SP3 supporting backup/recovery of the hidden recovery partition that Lion installs, and given how Apple have done their best to lock down that hidden recovery partition, I’m hardly surprised.

That being said, I’ve had the beta of 7.6 SP3 running on my Macs for some time now, and the support for Lion has been very good thus far.

Default, Default!

There’s been some changes to standard system defaults, and for the most part I really like them. This is minor stuff, to be sure, but it’ll certainly help sites where NetWorker is being deployed for the first time, or where the NetWorker administrators are relatively novice at the product.

For a start, security – at long last, RAP Logging has been enabled by default, and in fact will be enabled on an upgrade, if I’m reading the release notes correctly. This is a simple yet important change; it’s a recognition that auditing and security control is important, and it turns on something that can make back-tracking changes at a site trivial. Every time I do a backup audit on a client, I look to see whether Monitor RAP is enabled. With a bit of luck as companies move to 7.6 SP3 I’ll have less to rabbit on about there.

The devil is in no longer in the details for the start time – I always thought one of the original programmers at Legato must have had a warped sense of humour to have a default group start time of 3.33. Was it because 6.66 is not a valid time? To be sure in a very small environment, a start time of 3.33 was often OK for businesses, but the amount of data to be backed up has grown, and NetWorker is well entrenched as an enterprise product, and a default start time so close to start of business hours always irked me a little. Now it’s been set to 21.00 instead, a much more suitable time.

The nsrmmd polling, control and restart intervals have all been increased; they’re now 15, 10 and 5 minutes respectively. That’s higher than the previous defaults, but actually far more sensible. Those previous intervals were oriented at small LAN networks with a minimal number of backup devices. We’re now increasingly seeing environments with hundreds of backup devices, and the default polling intervals were something that you’d have to change as soon as you went beyond a relatively small number of devices anyway. That’s another bugbear that’s gone.

Max sessions for AFTD/file devices is now set to 32 for all new devices created after the upgrade. If I recall correctly, for most of 7.6 it’s been the case that new AFTD/file devices would get target 1/max 4 by default. I’m not sure I particularly like this change, but I’ll live with it. The challenge with having large numbers of max sessions for disk backup devices is that I’ve seen it encourage sites to not really take time to think about the disk layout. In particular, some sites end up creating a few but large disk backup devices, which results in poor striping of the backup writes and reads across the per-filesystem spindles. That is, a 6TB single disk backup unit made up of 5 x 2TB drives in RAID-6 is not going to give as good a performance as say, 2 x (5 x 1TB drives in RAID 6). I.e., I worry this may leave people thinking they can get away with a higher number of streams to each disk backup unit. Please, make sure you don’t get trapped into that ‘false economy’ thinking. On a similar front, the default max sessions for data domain devices has been set to 10. I suspect again for Boost devices that it may be a little optimistic, but we’ll wait and see.

The Company That Dare Not Speak Its Name

Its safe to say that there’s a great deal of rivalry between NetApp and EMC. If you follow EMC/NetApp employees on Twitter, sometimes the discussions descend into thinly disguised vitriol. There’s bad elements to that of course, but it does speak to the passion that each company has for their own products and their own way of thinking.

That being said, 7.6 SP3 has introduced the ability to do block-level backups of SnapMirror volumes on NetApp, which can make a big difference when you’re backing up highly dense NetApp filesystems.

Ultimately, in talk of “big data” and consolidation, all the vendors have to play together much more nicely, and I’m happy to see additional integration options between NetWorker and NetApp.

NetWorker Health Check

EMC provide a freebie utility called the Health Check Tool, and it’s worth checking out. There’s a new version, 3.1, which supports NetWorker 7.6 SP3. (Of course, you can also get a bunch of diagnostic tools in IDATA Tools, too.)

Still, if you like working within the confines of a GUI, the NetWorker Health Check tool is a handy utility to have in your environment.

I’ll multi-your-plex

This is a nice one – and I’ll be interested to play with it ASAP on a Data Domain host. So long as you’re running DDOS 5.x or higher, NetWorker will work better with deduplication while still multiplexing to a DD VTL. Previously this hasn’t been the case – the way NetWorker multiplexes the data stream somewhat incompatible with most forms of deduplication. Clearly the integration of NetWorker and Data Domain in the BRS family continues to have benefits – Data Domain OS 5.x must have some level of understanding of a multiplexed NetWorker data stream and be able to perform some deduplication on it.

Now, the release notes state clearly you won’t still get optimal deduplication rates, and it will create a performance hit during the backup process – however!, that being said, if your back is up against the wall and you have a choice between stacking on a bunch of storage node enabler codes to support 128 virtual drives on a single host, or using fewer virtual drives with a lower deduplication ratio/speed, this may be beneficial to consider.

vSphere 5 Support

It had to be said, the Oregon State Highway Division not only had a whale of a problem on its hands…

Sorry, I got distracted. It had to be said that it took NetWorker a while to support the VADP backup options under ESX 4.x, etc. While vSphere 5 has been out for a while now, the elapsed time has been much smaller, and NetWorker 7.6 SP3 features vSphere 5 virtual machine/guest support.

Alphawhat?

AlphaStor continues to limp along, and AlphStor 4.0 support is included with 7.6 SP3.

Personally, I think it’s time for AlphaStor to be dropped as independent problem. Fold library virtualisation and the additional media management options directly into the NetWorker product, and ditch AlphaStor as a separate licensed option, please, EMC.

Additional Platform Support

Solaris 11 – remember that champion Unix platform that people used to either love or hate, but now people seem to universally hate since Uncle Larry purchased it? Well, Solaris 11 has official support with NetWorker 7.6 SP3. Don’t get me wrong, I’m not knocking EMC for including support for Solaris 11, but I stand by an earlier piece, RIP Solaris.

CentOS 6.x support … ahem! … RedHat Enterprise Linux 6.x support has been enhanced to include RHEL 6.1. I’m sure this will see a large number of companies currently using CentOS to look at upgrading their release, since they get a better bang for buck quality of support for using CentOS as they do from using other commercial Linux releases.

Unky Larry’s Other Product

Java 7 is now supported with NetWorker.

Upgrading

As always, before you upgrade NetWorker:

  • Read the release notes carefully. See what has been fixed from previous releases, and see what are the known issues of the new release;
  • Make sure you have a good, reliable backup of your bootstrap and index region before doing the upgrade;
  • Have available, ready to install, the version of NetWorker you’re currently using plus any patches you may have applied just in case you need to back-rev for some reason;
  • Have a change window big enough to allow you to run some preliminary tests after the upgrade;
  • Despite it not being documented anywhere, my honest opinion is that on any NetWorker server upgrade, you should remove the /nsr/tmp directory after shutting it down but before doing the upgrade. This can prevent a lot of random issues.
  • Always upgrade your storage nodes at the same time as your backup server. I know EMC says you don’t have to. I disagree, however. (As do my colleagues.)
 

Long-term blog readers will know that I advocate a zero error policy within backup environments.

This is elucidated in my posts:

You could say that those posts are precursors to this post, and if you’re not familiar with what I’ve had to say there, you may want to read those first.

One of the critical mistakes I periodically see when companies try to implement a zero error policy is they focus too much on the errors.

LookingThe errors though, are often just the “tip of the iceberg”.

For instance, take the most simple of errors – an open file error. You might run a backup of a Windows filesystem which reports a collection of errors relating to files that were skipped because they were open at the time.

Yet, those open files aren’t really the error. Seeing them as the error is usually a case of mistaking cause and effect. In this scenario, the error is one of:

  • The backup software is misconfigured, or
  • The backup software is missing modules that allow it to backup open files.

In the first case, it may be that the file(s) which are reported as open and couldn’t be backed up actually don’t need to be backed up. They may be temporary files, or cache files, or some other short-lived collection of files that have no importance in terms of data protection. So the error there isn’t the individual files that failed to backup, but the failure to configure the exclusions for the client appropriately.

In the second case, it may be that those files really do need to be backed up, but to do so requires a special module. They may be database files (e.g., Microsoft SQL Server, Microsoft Exchange, etc.), or some other collection of files that must be quiesced before backup. In this case, the error is that the system is being backed up inconsistently.

Zero error policies aren’t about playing whack-a-mole with errors; they’re about solving problems.

After all, the captain of the Titanic couldn’t have averted the disaster by stopping the ship just short of the iceberg and having someone take a pick axe to the top of it.

The net result of this is that having a zero error policy requires the following two processes/activities:

  • Discussion of errors with system owners/nominated key users;
  • Root cause analysis.

If either of those are missing, you’re more likely making (at best educated) guesses as to the correct resolution to the errors. However, if you have those in place, you can more confidently review any error as it hits and make an informed (and even documented) decision as to how to resolve the underlaying issue that it represents.

Without it, a zero error policy may actually make the situation worse.

 

Recently I wrote, “7 common problems with deduplication“. That covered some of the practicalities that you need to be aware of. However, that wasn’t a definitive list, and I wanted to expand on that a little with this post.

These are:

  1. Architecture – How will it fit together?
  2. Rehydration – Can your pipe accommodate the data?
  3. Redundancy – Are you putting all your eggs in the one basket?
  4. Replicas – How will your copies be handled and recognised by the server?
  5. Long term storage – What is your strategy for longer-term backups?

Each of these include factors that you have to consider before you go ahead with data deduplication within an environment, and I’ll go through each one individually.

Architecture

If we look at NetWorker and target based deduplication, we run into an interesting architectural issue. The way NetWorker generates multiplexed savesets can have a direct impact on the compressibility of the datastream. In particular, all VTL based deduplication devices should be configured such that each virtual drive has both target and max sessions set to 1.

In a conventional tape or backup-to-disk environment, it’s common to see configurations where 4 or more sessions are streamed to each device. For physical tape, this may be partly due to the need to keep drives streaming, but it can also be to do with making sure that there’s not a backlog of pending savesets, too – i.e., keeping the backup window as narrow as possible.

If we cut away from that process and move to an architecture that has a 1:1 ratio for streams and virtual drives, the logical solution is to increase the number of virtual drives. Typically I’d suggest that there’s at least a 4:1 ratio of virtual drives to physical drives when a VTL is replacing a PTL. I.e., if you had 4 physical drives, you’ll be configuring a VTL with at least 16 virtual drives.

However, if we look at NetWorker licensing, this has an odd effect. VTLs will either get ‘real’ VTL licenses if they’re of a particular EMC brand, or an alternate VTL license bundle, which grants 3 x Unlimited Autochanger licenses per XTB presented by the VTL.

Neither of those licenses are the issue – the issue is actually with NetWorker’s limitations relating to the number of devices per storage node or server. For NetWorker, Network Edition, you’re entitled to:

  • 16 devices on the server;
  • 16 devices on each storage node.

For NetWorker, Power Edition, you’re entitled to:

  • 32 devices on the server;
  • 32 devices on each storage node.

That’s all well and good for physical tape environments – but once you go virtual, those limitations can get very tight, very quickly. (Hint, EMC: Those limitations should be doubled or quadrupled, please.)

The net effect is that if you have say, a 4-drive PTL and a 16-drive VTL, but just a single server, no storage nodes, you’ll need to do one of the following:

  • Upgrade from Network edition to Power Edition, or
  • Purchase an additional storage node license to ‘stack on’ an extra 16 devices.

Yes – you can purchase and add-on storage node licenses to add to the permitted device count within the environment, without adding an actual storage node. This is handy to know in normal situations, but when it comes to deduplicating VTLs in particular, it’s a must.

Rehydration

It’s all very well to have a fabulous deduplication ratio. Let’s say you’re achieving 10:1 or something along those lines. However, we don’t just deal in deduplicated data. At some point, that data is going to have to be rehydrated. Typically this’ll be for one of the following:

  • As part of a recovery, or
  • For tape-out functionality.

In either case, you’re no longer concerned about the deduplication ratio you’ve achieved, but the amount of rehydrated data you’ll be streaming out. One immediate consideration is that if you’ve deployed deduplication backups for branch-office scenarios, and you’ve been loving the ‘trickle’ effect of only sending unique data across the WAN, you’re going to be somewhat less enamoured by having to send the entire data stream, rehydrated, back across the WAN.

Unless, of course, you’ve architected for that situation.

If you’re doing tape-out – either cloning or staging, then you need to still factor that actual rehydrated size into any sizing calculations for a physical tape library. In particular, a common mistake I’m seeing is that people think that by implementing deduplication they can substantially reduce the number of physical tape drives in the environment. I would suggest that as a general rule of thumb for most sites, a reduction of between one quarter and one third of the physical devices is the most you can hope to achieve. If you pull out more than that, you’re likely going to suffer serious contention during tape out operations. You’ll also be totally blown out of the water whenever there’s a physical fault.

Redundancy

Deduplication should never be deployed on its own. E.g., you can’t just have a single Avamar RAIN or a single target deduplication unit. It’s putting all your eggs in one basket. You need some form of atomic-unit redundancy, be that a second grid you replicate to, or a second DD you replicate to, or tape-out.

I’ve heard of solutions deployed that have a single Avamar RAIN for instance – and just a few nodes in the grid – with no tape out, and no replication to another site. I personally think that’s data-suicide. Sure, any individual node in a RAIN can fail and the grid will continue, but you’ve still got the fundamental problem – what happens if you lose your grid?

The same applies to target based deduplication. For ease of consideration, any deduplication configuration, be it Avamar, Data Domain, Quantum, FalconStor or anything else should be considered to have one unit per physical location. And if, under those definitions, you’ve only got one unit – well, you’ve got insufficient redundancy.

Replicas

In particular with target based deduplication, if you’re using the replication functionality of the deduplication device (to avoid a NetWorker clone rehydrate+deduplicate again scenario), you introduce a new challenge – how do you get NetWorker to actually know about the replicas? Items for consideration here are:

  1. Can both replicas be online at the same time? I.e., does the deduplication environment support this?
  2. Will NetWorker perceive the replicas as the same physical media? I.e., do the replicas have the same volume ID? If so, NetWorker won’t permit them to be mounted in two different locations at once.
  3. How ‘atomically’ can replicas be brought online? If replicas do have the same volume ID, what is the smallest replica that can be brought online? Typically this will be either a single virtual tape, or a single disk backup unit. For virtual tapes, that’ll be more manageable. For disk backup units, it presents more of a problem.

Newer technology, such as DD Boost, which integrates NetWorker’s cloning facilities with the inherent replication capabilities of the hardware, address this issue. If you’re not using DD Boost though, you need to come up with your own solution.

Long Term Storage

Want deduplication? Want enough deduplication to handle 7 years of backups? 10 years? 15 years? ‘Forever’ years? Long term storage can’t be left by the way-side, you have to plan and architect this into your solution.

Some deduplication vendors (EMC included) are starting to tout new archive credentials in their deduplication arrays, but to be perfectly frank, the long-term cost of maintaining large amounts of either spinning or partially spun down disks with deduplicated storage, vs a batch of tapes with rehydrated storage, is still not at a point that can be entertained by many businesses. Tape is, and shall continue to be cheap for longer term storage and archival storage. Anyone who tries to tell you otherwise likely has a vested interest in dropping more storage on your datacentre floor.

When planning for longer-term storage in a deduplication environment, you have to make a few decisions in advance:

  • Do longer term backups go direct to tape (or conventional disk staging areas) instead of ever hitting deduplicated storage?
  • If the longer-term backups do sit on deduplicated storage, what will be the additional size requirements?
  • Are those size requirements worth it? E.g., if you have to buy a unit that has an additional 20TB of deduplication capabilities in order to hold all the long-term backups that you want to keep ‘nearline’, is it actually worth it, given it’ll always be staged out/relocated to longer-term storage, or do you go for a cheaper initial storage option as well?

Summing up

Between this and other articles, one might think that I’m actually against deduplication. I’m not. However, I am dead-set against the mis-use of technology. Wasteful spending, particularly in the backup environment, just leads to bigger issues – such as artificial and inaccurate budgetary restraints at a later point in time.

When it comes to deduplication, I guess there can only be one rule: eyes wide open.

 

With apologies for the delays, I’m pleased to announce that the report following the NetWorker Usage Survey in June/July 2011 is now ready.

With two prior reports, some comparison for trending was possible, and this will continue in future reports. Some interesting results were noticed in the backup cloning section, and new questions were posed to track encryption usage within environments.

The report is available here.

 

In a previous post, I described how one could use jobquery and jobkill to terminate running scheduled clones in situations where NMC doesn’t allow the clone to be stopped from within the GUI. However, jobquery isn’t necessarily the most intuitive of interfaces if you’re not using it all the time.

I was pleasantly surprised when I was preparing some documentation to note that jobkill, as of 7.6 SP2, has become interactive if there are multiple jobs running, which reduces the need to run jobquery if you’re wanting to just stop one scheduled operation.

In 7.6 SP2, if you run jobkill without any arguments, and there are jobs running, you’ll run into an interactive session such as the following.

# jobkill
                      job id: 3104018;
                        name: tara-5;
                        type: savegroup job;
                     command: ;
           NW Client name/id: ;
                  start time: 1312763880;
------------------------------------------------------
                      job id: 3104025;
                        name: /d/01;
                        type: save job;tara.pmdg.lab
                     command: \
save -s tara.pmdg.lab -g nox-5 -LL -f - -m tara.pmdg.lab -t 1312026303 \
-l 5 -q -W 78 -N /d/01 /d/01;
           NW Client name/id: tara.pmdg.lab;
                  start time: 1312763880;
------------------------------------------------------
                      job id: 3104026;
                        name: /;
                        type: save job;
                     command: \
save -s tara.pmdg.lab -g nox-5 -LL -f - -m tara.pmdg.lab -t 1312026306 \
-l 5 -q -W 78 -N / /;
           NW Client name/id: tara.pmdg.lab;
                  start time: 1312763880;
------------------------------------------------------
Specify jobid to kill ('q' to quit, 'r' to refresh): 3104018
Terminating job 3104018
Specify jobid to kill ('q' to quit, 'r' to refresh): q

So there you go – jobkill is interactive, helpful and now saves the hassle of running jobquery first.

 

In an earlier article, I suggested some space management techniques that need to be foremost in the minds of any deduplication user. Now, more broadly, I want to mention the top 7 things you need to avoid with deduplication:

1 – Watch your multiplexing

Make sure you take note of what sort of multiplexing you can get away with for deduplication. For instance, when using NetWorker with a deduplication VTL, you must use maximum on-tape multiplexing settings of 1; if you don’t, the deduplication system won’t be able to properly process the incoming data. It’ll get stored, but the deduplication ratios will fall through the floor.

A common problem I’ve encountered is a well running deduplication VTL system which over time ‘suddenly’ stops getting any good deduplication ratio at all. Nine times out of ten the cause was a situation (usually weeks before) where for one reason or another the VTL had to be dropped and recreated in NetWorker – but, the target and max sessions values were not readjusted for each of the virtual drives.

2 – Get profiled

Sure you could just sign a purchase order for a very spiffy looking piece of deduplication equipment. Everyone’s raving about deduplication. It must be good, right? It must work everywhere, right?

Well, not exactly. Deduplication can make a big impact in the at-rest data footprint of a lot of backup environments, but it can also be a terrible failure if your data doesn’t lend itself well to deduplication. For instance, if your multimedia content is growing, then your deduplication ratios are likely shrinking as well.

So before you rush out and buy a deduplication system, make sure you have some preliminary assessment done of your data. The better the analysis of your data, the better the understanding you’ll have of what sort of benefit deduplication will bring your environment.

Or to say it another way – people who go into a situation with starry eyes can sometimes be blinded.

3 – Assume lower dedupe ratios

A fact sheet has been thrust in front of you! A vendor fact sheet! It says that you’ll achieve a deduplication ratio of 30:1! It says that some customers have been known to see deduplication ratios of 200:1! It says …

Well, vendor fact sheets say a lot of things, and there’s always some level of truth in them.

But, step back a moment and consider compression ratios stated for tapes. Almost all tape vendors give a 2:1 compression ratio – some actually higher. This is all well and good – but now go and run ‘mminfo -mv’ in your environment, and calculate the sorts of compression ratios you’re really getting.

Compression ratios don’t really equal deduplication ratios of course – there’s a chunk more complexity in deduplication ratios. However, anyone who has been in backup for a while will know that you’ll occasionally get backup tapes with insanely high compression ratios – say, 10:1 or more, but an average for many sites is probably closer to the 1.4:1 mark.

My general rule of thumb these days is to assume a 7:1 compression ratio for an ‘average’ site where a comprehensive data analysis has not been done. Anything more than that is cream on top.

4 – Don’t be miserly

Deduplication is not to be treated as a ‘temporary staging area’. Otherwise you’ll have just bought yourself the most expensive backup to disk solution on the market. You don’t start getting any tangible benefit from deduplication until you’ve been backing up for several weeks. If you scope and buy a system that can only hold say, 1-2 weeks worth of data, you may as well just spend the money on regular disk.

I’m starting to come to the conclusion that your deduplication capacity should be able to hold at least 4x your standard full cycle. So if you do full backups once a week and incrementals all other days, you need 4 weeks worth of storage. If you do full backups once a month with incrementals/differentials the rest of the time, you need 4 months worth of storage.

5 – Have a good cloning strategy

You’ve got deduplication.

You may even have replication between two deduplication units.

But at some point, unless you’re throwing massive amounts of budgets at this and have minimal retention times, the chances are that you’re going to have to start writing data out to tape to clear off older content.

Your cloning strategy has to be blazingly fast and damn efficient. A site with 20TB of deduplicated storage should be able to keep at least 4 x LTO-5 drives running at a decent streaming speed in order to push out the data as its required. Why? Because it’s rehydrating the data as it streams back out to tape. Oh, I know some backup products offer to write the data out to tape in deduplicated format, but that usually turns out to be bat-shit crazy. Sure, it gets the data out to tape quicker, but once data is on tape you have to start thinking about the amount of time it takes to recover it.

6 – Know your trends

Any deduplication system should support you getting to see what sort of deduplication ratios you’re getting. If it’s got a reporting mechanism, all the better, but in a worst case scenario, be prepared to log in every single day for your backup cycles and see:

-a- What your current global deduplication ratio is

-b- What deduplication ratio you achieved over the past 24 hours

Use that information – store it, map it, and learn from it. When do you get your best deduplication ratios? What backups do they correlate to? More importantly, when do you get your worst deduplication ratios, and what backups do they correlate to?

(The recent addition of DD Boost functionality in NetWorker can make this trivially easy, by the way.)

If you’ve got this information at hand, you can use it to trend and map capacity utilisation within your deduplication system. If you don’t, you’re flying blind with one hand tied behind your back.

7 – Know your space reclamation process and speeds

It’s rare for space reclamation to happen immediately in a deduplication system. It may happen daily, or weekly, but it’s unlikely to be instantaneous. (See here for more details.)

Have a strong, clear understanding of:

-a- When your space reclamation runs (obviously, this should be tweaked to your environment)

-b- How long space reclamation typically takes to complete

-c- The impact that space reclamation operation has on performance of your deduplication environment

-d- An average understanding of how much capacity you’re likely to reclaim

-e- What factors may block reclamation. (E.g., hung replication, etc.)

If you don’t understand this, you’re flying blind and have the other hand tied behind your back, too.

 

I’ve said it before – auditing should only be done by the experts. I first realised this when a security auditor from one of the (then) “Big 5″ accounting companies audited the Solaris servers I was administering 11 years ago. Having checked /etc/passwd, the auditor noted in the report:

All user passwords are set to *, which is highly insecure and should be addressed immediately to ensure continued security compliance.

The fallout from that was briefly atrocious, and resolved only by convincing a manager to try to log onto a list of user accounts using * as the password.

It appears that there’s still room for security auditors who don’t really understand security, as evidenced by “Our security auditor is an idiot, how do I give him the information he wants? – Server Fault“. The system administrator was told he had to handover the following as part of the audit:

  • A list of current usernames and plain-text passwords for all user accounts on all servers
  • A list of all password changes for the past six months, again in plain-text
  • A list of “every file added to the server from remote devices” in the past six months
  • The public and private keys of any SSH keys
  • An email sent to him every time a user changes their password, containing the plain text password

Up until this point, I thought that it would be impossible for anyone to have an experience to trump my “all user passwords are set to *” experience.

It turns out I was wrong.

What’s this got to do with backups, I hear you ask?

Well, everything. If your company is getting in auditors who aren’t subject matter experts (or at least product experts), then your audit isn’t worth the paper it’s written on. Maybe you’ll get a compliance rubber stamp. Maybe you won’t. But it won’t make one iota of difference as to whether there’s been any valid checking of your environment.

Please, ensure that if you want your backups audited you ask some experts in. Knowing the sorts of prices the “big” auditing companies charge, it’ll likely not only cost you less, but actually give you more!

 

Your backup server is behaving perfectly normally, but you want to do one minor change to it. For example, you’ve read the performance tuning guide and realised you need to double the amount of RAM in the server. So you shut it down, install the extra memory, reboot it and it … goes to hell in a handbasket.

What happened?

Maybe filesystems didn’t mount.

Maybe a tape drive or library didn’t reappear.

Maybe … just maybe, someone made a change previously, but either (a) didn’t commit it to happen permanently or (b) didn’t test it with a reboot.

Your backup server is like any other production system, and therefore there’s a strong risk that uncontrolled change will cause issues. So, always make sure you follow these two rules:

  • If you make a change that takes you from a non-working to a working-state, make sure you commit the change and reboot to test;
  • If you make an addition to the system that would be lost or otherwise not present after a reboot, make sure you commit the change and have it peer reviewed. If unsure, reboot.
Peer review is everything in these situations, but reboot tests are quite critical. In particular, the more hardware is involved in the system (and nothing says hardware like “tape library”!), the more you should be rigorously testing change. No ifs, no buts. This is important.
 

Consider the following two questions:

  1. Do you manage your backups, or do your backups manage you?
  2. Does your organisation decide how backups should be done based on SLAs, etc., or do the backups dictate how production operates?

As you can well imagine, the answers to the above questions will very quickly tell you whether you’ve got a healthy, or a sick backup environment.

While it’s obvious how both questions should be answered, I’d wager that at least some readers will be getting that little twinge reading the above knowing that I’ve just described their backup environment as sick. And I don’t mean sick as in Gen-Y “fully sick”, I mean unwell.

If your backup environment manages you (most specifically your time and the amount of hair you’ve got left), or your backup environment dictates how production works, then you’ve got some problems you need to address. Now.

A lazy backup admin is a healthy backup admin

In 1996, I joined a system administration team that had one guiding motto: be lazy. Their attitude towards work was without a doubt the most influential one I’ve ever encountered, and it still guides my work life to this day.

I don’t mean lazy as in “avoid work”.

I mean lazy as in “automate! automate! automate!”

As far as they were concerned, the goal of the system administrator should be to automate all regular activities to the point that they should either be only ever doing one of four activities:

  1. Automating processes.
  2. Checking results of automated processes.
  3. Waiting for something to go wrong/intervention to be required.
  4. Working on a project.

The same approach should be taken in backups. You should not be say, mindlessly doing repetitive tasks that could be automated – you should be automating them and then checking the automation results. You shouldn’t be fixing errors on a daily basis, you should have a zero error policy, and error processing as an exceptional rather than an every day task. Or you should be working on the next phase of expanding or updating the backup environment.

Et tu, defendo?

The backup system shouldn’t be ambushing primary production. It should be there as a guardian, a defender – not the system that stabs from the shadows, or hogs the limelight.

Every backup product, and every backup system, will of course have limitations. But these limitations should not prevent critical activities in production from being undertaken. Instead, limitations should be ameliorated such that what needs to be done in production can still be done, with appropriate workarounds in place. If the limitations are hard ones which require a rethink of how production is done, it should not be at the expense of the business functions or the end users. This may require mitigation with other technologies – for instance, a classic scenario in situations where the backup product can’t run backups as frequently as SLAs require is to mix traditional backups and snapshots.

Some SLAs, in the light of the available budget and technology should be reassessed. However, that’s not to say all of them should in such situations. A sick backup system is where any SLA, no matter how justified, that can’t be immediately met by the backup system “as is”, is abandoned.

You’re not the boss of me

So, are you in charge of your backup system, or is your backup system is in charge of you?

If you can’t answer that question the right way, it’s time to seize control and make sure next time someone asks you, you can.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha