It’s that time of the year where I sit back for a moment and look at what articles have attracted the most readers over the year, and it’s a fairly eclectic bunch. Interestingly, for the first time since forever, the article about fixing NSR Peer Information issues didn’t come first – we have some new winners.

10 – New Micromanual – LinuxVTL and NetWorker

The second micromanual was a step-by-step guide for configuring the open source LinuxVTL system with NetWorker. I had hoped when I started writing micromanuals that I’d get them more frequently delivered, but various factors get in the way of this. Maybe in 2012 I’ll be able to get a couple more out and available.

9 – Killing scheduled cloning operations

When NetWorker’s scheduled clone option was introduced, there were a few bugs relating to stopping a scheduled clone operation from the GUI. Sometimes you could, and sometimes you couldn’t. However, you could always kill a scheduled clone job from the command line, which is what this post explained.

8 – NetWorker Firewall Configuration on Windows

Very early in the year I was doing a lot of work with NetWorker on Windows 2008 R2, and I was noticing a few gaps in the installation process when it came to the process of automated configuration of the Windows Firewall to work with NetWorker daemons. This post explained the lessons I learnt.

7 – Carry a jukebox with you (if you’re using Linux)

This article was my first post about configuring the open source LinuxVTL system with NetWorker. Since then LinuxVTL has evolved quite a lot, and I’ll likely even need to update that micromanual early in the new year as a consequence.

6 – Why I’d choose NetWorker over NetBackup Every Time

Despite the fact that the article was titled “Why I’d choose…”, I had a rather indignant response to this post insisting I was being a jerk by writing it. I stand by every word in that post. I would not, personally, elect to choose NetBackup over NetWorker on the basis that NetBackup only has true image recovery as an option, and that NetBackup doesn’t support dependency chains for backup images. I see both of these factors as critical to a true enterprise backup product, and NetBackup only half supports one of them. That doesn’t make me a jerk, it makes me someone who gives a damn about your data.

5 – Using NetWorker Client with Opensolaris

A guest article written by Ronny Egner, this post covered off getting the NetWorker client working with the OpenSolaris version of Solaris.

4 – Basics – Fixing “NSR peer information” errors

A persistent challenge in NetWorker is when the NSR peer information gets out of whack; usually this can happen when a significant change happens on a client, and the server must have this information reset. I’d still love to see this article become irrelevant by seeing an option appear in NMC to handle it, but until then, this will remain a fairly popular article.

3 – This is wrong

Earlier this year, an Australian hosting service lost thousands of hosted domains and websites due to a “hack attack”. Supposedly the clever hackers destroyed not only the production data, but also all the backups.

What really went wrong was that the company in question had designed a very poor and inadequate backup solution. Rumours were abounding at the time that backups were just simply replicated snapshots. Snapshots may be able to act as backups, but not indefinitely, and not if they’re the only thing configured. (Backups and snapshots are effectively ‘sister’ activities in ILP.)

2 – micromanual: NetWorker Power User Guide to nsradmin

The original micromanual – “NetWorker power user guide to nsradmin” was and remains extremely popular. There’s been thousands of downloads of it since its release, including quite a number from EMC themselves, so it’s clearly a handy resource. If you’ve not downloaded it yourself but you want to boost your NetWorker productivity, it’s a must read.

1 – NetWorker 7.6 SP1

When NetWorker 7.6 SP1 came out, it was a huge release. In my opinion, it should have been numbered NetWorker 7.7 at least; it wasn’t a minor set of changes or a round of bug fixes, it included significant functionality updates (including one of my favourites – support for Boost). As the number one read article of the year, it’s been a big resource for people looking at the functionality of newer releases of NetWorker.

And that, they say, is that

This year has personally been a huge year for me. My partner and I moved state/city in June, going from a regional area just outside of Sydney to the inner west of Melbourne. We also celebrated our 15th anniversary together, surrounded by many of our new friends (who are like family to us) and a few of our old friends. We were even invited to get on the radio to talk about that, not only from the longevity of the relationship and having run the anniversary party up against the monthly Melbourne Den night. (There’s a podcast coming…) It was also the year when I sorted a lot of stuff out, and to boil all this down: it was the year that I spent a lot of time focusing on my personal life and not so much on the blog.

There may still be one or two posts left for 2011, but I’m also starting to get my head around changes and new material for 2012, and I believe 2012 will be a big year for NetWorker users.

 

For some time I’ve been debating whether to generate podcasts for the NetWorker blog.

Rather than continue to vacillate, I’ve decided to do a sample podcast, make it available here for downloading, and decide what to do based on feedback received.

While raw technical posts don’t translate well to podcasts (how do you quote screen output, for instance?), there’s a lot of backup theory related posts I make which can readily converted.

So, please follow the link below to the first podcast, in which I go over a topic near and dear to my heart: What is a zero error policy?

If you’re interested in me producing more podcasts, please let me know. Without feedback, I’ll likely leave it at just this trial. If people are interested though, I’ll setup a proper podcast stream within iTunes and get to work.

Podcast 001: What is a zero error policy?

Cheers!

 

Tape LivesWhen I first started in backup and recovery, my primary backup medium was DDS-1 tapes, distributed across probably 15 servers in a computer room. Over time the number of hosts with dedicated tape drives dropped as systems were consolidated into NetWorker, and the NetWorker server got a couple of gravity-fed DDS autoloaders.

Needless to say, since that point I’ve watched lots of changes in tape technology, particularly since LTO burst onto the scene. DLT had been seemingly stagnant for years, a practical monopoly in the server space, and suffering a severe lack of innovation.

Despite years of various vendors trying to push that tape is dead, we’ll see it remain for some time yet, mainly because it still represents an incredibly economic way of storing large amounts of backup data. Sure, you can avoid using tape if you’ve got replicated backup-to-disk storage between two sites, but that either requires a substantial MAID-style footprint, or some deduplication unit – and either way it’s going to cost you a lot of money. (My personal belief is that 10TB per week backup is the minimum cut-off for consideration of deduplication technologies; and there’s a lot of businesses still backing up less than 10TB per week.)

So, here’s what I see as the key continuing trends for tape:

  1. Minimised usage for primary copy – This is a no-brainer, really. Backup to disk has taken over as the primary mechanism in a significant percentage of businesses – the “B2D2T” model, so to speak. There’s no doubt that model will continue, regardless of what that initial “to disk” looks like.
  2. Fallback/secondary copy – Tape will continue to reign supreme as the preferred fallback/secondary copy of backups for some time to come. This decade is indeed the one where some form of backup to disk will become the norm for the vast majority of businesses, but when it comes to those monthly backups that need to be kept for 7+ years, etc., tape will continue to shine.
  3. Enterprise tape is squeezed down – It used to be that there were two distinct tiers of tape: enterprise technology such as LTO (unless you believed the IBM hype that said LTO was toy-tape) and commercial/consumer tape, such as AIT, DDS, etc. That enterprise technology remained largely out of reach of the smaller businesses, but as backup to disk continues to press into the nearline/immediate recovery arena, use of enterprise tape as a primary backup and recovery source will be pushed down into smaller businesses.
  4. Commercial/consumer tape is squeezed out – Those non-enterprise tape formats, such as AIT, DDS, etc., are dead. Sony discontinued AIT to work with HP et al on DDS development, and DDS effectively died at v5. Oh, HP blather on about DDS still having a future – DDS-6/160 was released a while ago, and DDS-7/320 is supposedly in development, but these are dead duck technologies. These non-enterprise tapes were at best unreliable formats – they actually gave a lot of fodder to the “tape is dodgy” meme, and the way they’re kept on life-support by vendors unwilling to concede their time is past is frankly embarrassing.
  5. Deduplication will not migrate in any usable form to tape – Various companies blather about having “deduplication out” to tape from their products, be they target or source deduplication, but this writing of deduplicated data to tape format is fundamentally flawed and logically incompatible. Why? Deduplication requires massive amounts of random access to be able to rehydrate efficiently, but tape is sequential-access by design. So instead what is written out to tape in “deduplicated” format is entire deduplication environments, which must be read back and recovered to systems before a regular recovery can be run. Instead, they just create situations where recoveries aren’t done unless they’re hyper-critical because there’s too much effort involved.
  6. Hardware encryption will become the norm – Initially introduced in LTO-4, we’ll see continued adoption of hardware-encryption at the per-cartridge level as businesses become acutely aware of the potential damage caused by media theft. We’re already seeing various countries legislate requiring encryption of at-rest data in particular industries, and this is driving more businesses to use hardware encryption “just in case”.
  7. We’ll continue to be told tape is dead – As sure as the sun rises each day, we’ll awake almost every day to another story about the imminent death of tape.
  8. Direct iSCSI tape drives are here – Some vendors are already selling them; as the war settles between FC and IP, it’s logical that we’ll see tape drives and tape libraries appearing with 10Gbe connections. This should make connectivity simpler and quite possibly more flexible.

Other predictions

OK, the above list are the things I’m certain about. Here are a few things I’m not certain about, but I’ve been idly speculating on for some time…

  1. QR Barcodes – Personally, I think these are a joke. However, I’m betting that someone will start selling combo tape barcodes where for reach regular tape barcode you get a QR barcode so that operators and administrators can scan them from their phones, etc. They’ll be sold as allowing a whole new level of integration, automation and control, and a few businesses will get sucked into buying them. They won’t last long though. That’s assuming that QR barcodes themselves stay popular enough for this to happen.
  2. Tape RFID will get bigger – Some tape vendors are already selling tapes with RFID embedded. This’ll be a low-traction market for some time to come, but I suspect it’ll eventually become standard. I.e., this is an evolutionary rather than revolutionary progression in tape.
  3. Hardware twinning with software recognition – RAIT lost its appeal years ago, though some proprietary control systems such as ACSLS still support it. I suspect we’re going to reach a point though where hardware enabled tape twinning will be offered as a feature from those enterprise tape vendors who are being squeezed down. However, the difference will be that there’ll be APIs between the libraries/drives and the backup software to allow the backup software to see the secondary tapes as registered copies. Why? Tracking and accountability. Auditing and data tracking requirements will see to that. I don’t necessarily think that this will gain a lot of traction, but I do think it’ll become an offering again.
 

(Quick note: I posted this on my personal blog – insufficient coffees thus far this morning, and decided to repost here.)

In case it’s not been immediately obvious to anyone, I’ve done some simple diagrams to explain where RIM went wrong in this catastrophic outage they’ve been suffering.

You see, most companies implement what we call redundant infrastructure. In systems that require high availability, this is often accomplished with something as simple as clustered (either LAN or WAN) hardware and communications. Sometimes it’s designed that each component runs at the same time, sharing the load, but if one fails, the other one takes over and runs all the load. In simple terms, it looks like this:

Active/Active Cluster

That all makes sense, right?

Unfortunately, RIM seemed more focused on having failover capabilities for upper level management, so it instead clustered its’ CEOs:

Active/Active CEOs

The supposed theory behind this is that the two CEOs, working in an active/active arrangement, could handle load better and get the job done better than a single CEO – and provide resiliency!

 

Unfortunately though, the hardware resiliency wasn’t as up to scratch, and when it started to fail, RIM started having a catastrophic outage.

 

Now, you may have expected at that point for the active/active CEO cluster to step in and help. Unfortunately though, they’ve barely been heard from. So, in cluster terms, we have to assume a sort of reversed split-brain situation has occurred, where both components of the cluster think the other component is still running:

RIM-splitbrain

And there you have it – why RIM is having their current outage.

 

It’s also a lesson for all you other companies out there: you need fault tolerant infrastructure as well as CEOs.

 

There’s a pertinent adage in cooking when it comes to using wine in recipes:

If you wouldn’t drink it, don’t cook with it.

It’s simple: if you don’t like the taste of it in a glass, what makes you think you’ll like the taste of food you’ve added it to?

There are two similar rules for backup, and they’re particularly important when it comes time to do those periodic hardware refreshes in your environment:

If it’s not good enough to run production, don’t use it for DR.

If it’s not good enough to run production, don’t use it for backup.

The way in which both of these come into play is quite simple:

  1. If it’s not good enough to run production, don’t use it for DR. I’ve seen companies have a hardware refresh cycle of “move production equipment to DR, buy new production equipment”. However, invariably that equipment is being pulled out of production because it’s either lacking in capacity, or lacking in performance. That equipment is then going to be replaced with new equipment with planned usage time of (typically) 2-3 years. So let’s assume you get a year down the track – your in-use storage capacity has gone up, your processing load has increased, then there’s a major production fault and you have to failover to DR. At which point, you’re trying to run your production environment on something that was sized to max out 12 months ago. Chances of it adequately running production? Minimal.
  2. If it’s not good enough to run production, don’t use it for backup. Another common mistake is a situation whereby say, a storage array is pulled out of production and replaced with a new, faster array with more capacity. People invariably hate to see things go to waste, so someone suggests “let’s use the old array as {backup to disk | VTL | etc}”. Again, sounds simple enough on the face of it, except the equipment was either lacking in performance, or lacking in capacity. If it was lacking in performance, you’re putting it into a situation where you’re going to be copying off something that is purchased, on the outset, to be significantly faster than it. It’s similar with capacity – you’re going to be trying to backup a very large bucket to a much smaller bucket.

Whether your company likes the idea of it or not, backup and disaster recovery are not areas that should be assigned “hand me downs” by the rest of the business. They require their own capital budget, and a planning that allows for the following two factors:

  1. Performance should at least match the throughput on offer from production;
  2. It should exceed your production capacity.

If either of these conditions are not met, your strategy is insufficient.

 

Recently I wrote, “7 common problems with deduplication“. That covered some of the practicalities that you need to be aware of. However, that wasn’t a definitive list, and I wanted to expand on that a little with this post.

These are:

  1. Architecture – How will it fit together?
  2. Rehydration – Can your pipe accommodate the data?
  3. Redundancy – Are you putting all your eggs in the one basket?
  4. Replicas – How will your copies be handled and recognised by the server?
  5. Long term storage – What is your strategy for longer-term backups?

Each of these include factors that you have to consider before you go ahead with data deduplication within an environment, and I’ll go through each one individually.

Architecture

If we look at NetWorker and target based deduplication, we run into an interesting architectural issue. The way NetWorker generates multiplexed savesets can have a direct impact on the compressibility of the datastream. In particular, all VTL based deduplication devices should be configured such that each virtual drive has both target and max sessions set to 1.

In a conventional tape or backup-to-disk environment, it’s common to see configurations where 4 or more sessions are streamed to each device. For physical tape, this may be partly due to the need to keep drives streaming, but it can also be to do with making sure that there’s not a backlog of pending savesets, too – i.e., keeping the backup window as narrow as possible.

If we cut away from that process and move to an architecture that has a 1:1 ratio for streams and virtual drives, the logical solution is to increase the number of virtual drives. Typically I’d suggest that there’s at least a 4:1 ratio of virtual drives to physical drives when a VTL is replacing a PTL. I.e., if you had 4 physical drives, you’ll be configuring a VTL with at least 16 virtual drives.

However, if we look at NetWorker licensing, this has an odd effect. VTLs will either get ‘real’ VTL licenses if they’re of a particular EMC brand, or an alternate VTL license bundle, which grants 3 x Unlimited Autochanger licenses per XTB presented by the VTL.

Neither of those licenses are the issue – the issue is actually with NetWorker’s limitations relating to the number of devices per storage node or server. For NetWorker, Network Edition, you’re entitled to:

  • 16 devices on the server;
  • 16 devices on each storage node.

For NetWorker, Power Edition, you’re entitled to:

  • 32 devices on the server;
  • 32 devices on each storage node.

That’s all well and good for physical tape environments – but once you go virtual, those limitations can get very tight, very quickly. (Hint, EMC: Those limitations should be doubled or quadrupled, please.)

The net effect is that if you have say, a 4-drive PTL and a 16-drive VTL, but just a single server, no storage nodes, you’ll need to do one of the following:

  • Upgrade from Network edition to Power Edition, or
  • Purchase an additional storage node license to ‘stack on’ an extra 16 devices.

Yes – you can purchase and add-on storage node licenses to add to the permitted device count within the environment, without adding an actual storage node. This is handy to know in normal situations, but when it comes to deduplicating VTLs in particular, it’s a must.

Rehydration

It’s all very well to have a fabulous deduplication ratio. Let’s say you’re achieving 10:1 or something along those lines. However, we don’t just deal in deduplicated data. At some point, that data is going to have to be rehydrated. Typically this’ll be for one of the following:

  • As part of a recovery, or
  • For tape-out functionality.

In either case, you’re no longer concerned about the deduplication ratio you’ve achieved, but the amount of rehydrated data you’ll be streaming out. One immediate consideration is that if you’ve deployed deduplication backups for branch-office scenarios, and you’ve been loving the ‘trickle’ effect of only sending unique data across the WAN, you’re going to be somewhat less enamoured by having to send the entire data stream, rehydrated, back across the WAN.

Unless, of course, you’ve architected for that situation.

If you’re doing tape-out – either cloning or staging, then you need to still factor that actual rehydrated size into any sizing calculations for a physical tape library. In particular, a common mistake I’m seeing is that people think that by implementing deduplication they can substantially reduce the number of physical tape drives in the environment. I would suggest that as a general rule of thumb for most sites, a reduction of between one quarter and one third of the physical devices is the most you can hope to achieve. If you pull out more than that, you’re likely going to suffer serious contention during tape out operations. You’ll also be totally blown out of the water whenever there’s a physical fault.

Redundancy

Deduplication should never be deployed on its own. E.g., you can’t just have a single Avamar RAIN or a single target deduplication unit. It’s putting all your eggs in one basket. You need some form of atomic-unit redundancy, be that a second grid you replicate to, or a second DD you replicate to, or tape-out.

I’ve heard of solutions deployed that have a single Avamar RAIN for instance – and just a few nodes in the grid – with no tape out, and no replication to another site. I personally think that’s data-suicide. Sure, any individual node in a RAIN can fail and the grid will continue, but you’ve still got the fundamental problem – what happens if you lose your grid?

The same applies to target based deduplication. For ease of consideration, any deduplication configuration, be it Avamar, Data Domain, Quantum, FalconStor or anything else should be considered to have one unit per physical location. And if, under those definitions, you’ve only got one unit – well, you’ve got insufficient redundancy.

Replicas

In particular with target based deduplication, if you’re using the replication functionality of the deduplication device (to avoid a NetWorker clone rehydrate+deduplicate again scenario), you introduce a new challenge – how do you get NetWorker to actually know about the replicas? Items for consideration here are:

  1. Can both replicas be online at the same time? I.e., does the deduplication environment support this?
  2. Will NetWorker perceive the replicas as the same physical media? I.e., do the replicas have the same volume ID? If so, NetWorker won’t permit them to be mounted in two different locations at once.
  3. How ‘atomically’ can replicas be brought online? If replicas do have the same volume ID, what is the smallest replica that can be brought online? Typically this will be either a single virtual tape, or a single disk backup unit. For virtual tapes, that’ll be more manageable. For disk backup units, it presents more of a problem.

Newer technology, such as DD Boost, which integrates NetWorker’s cloning facilities with the inherent replication capabilities of the hardware, address this issue. If you’re not using DD Boost though, you need to come up with your own solution.

Long Term Storage

Want deduplication? Want enough deduplication to handle 7 years of backups? 10 years? 15 years? ‘Forever’ years? Long term storage can’t be left by the way-side, you have to plan and architect this into your solution.

Some deduplication vendors (EMC included) are starting to tout new archive credentials in their deduplication arrays, but to be perfectly frank, the long-term cost of maintaining large amounts of either spinning or partially spun down disks with deduplicated storage, vs a batch of tapes with rehydrated storage, is still not at a point that can be entertained by many businesses. Tape is, and shall continue to be cheap for longer term storage and archival storage. Anyone who tries to tell you otherwise likely has a vested interest in dropping more storage on your datacentre floor.

When planning for longer-term storage in a deduplication environment, you have to make a few decisions in advance:

  • Do longer term backups go direct to tape (or conventional disk staging areas) instead of ever hitting deduplicated storage?
  • If the longer-term backups do sit on deduplicated storage, what will be the additional size requirements?
  • Are those size requirements worth it? E.g., if you have to buy a unit that has an additional 20TB of deduplication capabilities in order to hold all the long-term backups that you want to keep ‘nearline’, is it actually worth it, given it’ll always be staged out/relocated to longer-term storage, or do you go for a cheaper initial storage option as well?

Summing up

Between this and other articles, one might think that I’m actually against deduplication. I’m not. However, I am dead-set against the mis-use of technology. Wasteful spending, particularly in the backup environment, just leads to bigger issues – such as artificial and inaccurate budgetary restraints at a later point in time.

When it comes to deduplication, I guess there can only be one rule: eyes wide open.

 

In yesterday’s post, I suggested that it was time for businesses to recognise and setup a new role – the Data Protection Advocate (DPA). This would be the key person tasked within the organisation to think of data protection scenarios, potential gaps, etc., and be the advocate for ensuring that data generated by or on behalf of the company is protected.

However, a DPA by him or herself is probably not going to achieve much within an organisation, so the next step is to try to work out where the DPA fits within the organisational structure. For that, we need a diagram. And here’s one I prepared earlier:

Data Protection Advocate Org Chart

Assuming there are multiple backup administrators within an organisation, there will be fewer DPAs than there are administrators. So, nominally, backup administrators will in some way or another report through to the DPA.

The DPA would logically need to liaise with a large group of people within the organisation. At bare minimum, this would be:

  • Key users – These are the people in each business group who just “know” what is done. They’re the long-term people, the “go to” people within each department. They’re going to have a lot of intrinsic knowledge that the DPA should be regularly mining.
  • Function owners – Previously we’d have called these people the department heads, but functional ownership within businesses is shifting to be broader as traditional employee/management interaction continues to change, so “function owners” seems more appropriate.
  • IT Team Leaders – IT obviously represents a significant portion of the data iceberg within a company, and therefore the DPA should be liaising with each of the team leaders – including storage, virtualisation, networking, security, etc., as well as the traditional server teams.
  • HR/Finance – Smaller organisations traditionally see HR and Finance as a combined group. In larger organisations this will obviously not be the case. Regardless, both HR and Finance will have a very strong understanding of the types of data they need kept and protected. You could argue that this is no different from any other group, but HR and Finance data is usually at the core of the “business critical” data we protect, and thus deserve to be singled out.
  • Legal – Somewhere, someone has to have an understanding of the legal ramifications of (a) choosing not to protect some data or (b) how long data should be kept for. In larger organisations, IT people should be able to consult with someone from corporate legal to get a very clear and straight forward answer.

The DPA however does not work in isolation once the requirements have been gathered. This person will then coordinate with (and be a voting member of) the Information Protection Advisory Council. That will be a group of reasonably senior people within the organisation from across a spectrum including IT, Finance and traditional business functions, who are empowered to make decisions that affect the entire company in relation to data protection policies on behalf of the board. For want of a better term, this is the “policy team” for data and information protection. You’ll note that I’ve switched at this level from referring to Data Protection to referring to Information Protection. That’s quite deliberate. The DPA will be concerned with the minutia of data within the organisation. The IPAC should be able to focus on the broader information view, instead.

Logically, this group will sit at an organisational level on par with the most senior Change Control Board. That board will, for the average organisation, report directly to the CIO.

So there you have it – a new role, and a new group.

Have you appointed a DPA yet? Have you started forming your IPAC yet? If not, get cracking!

 

I think this is a question that the average company wholly, inadequately, fails to understand. You see, when it’s asked, people start thinking about their servers – “data X is backed up, data Y can be reconstructed, so we don’t backup that…”

At the end of this article though, I hope you’ll want to take a walk.

At this point, the average backup administrator is responsible for just the backups of servers and storage servers for which discrete agents can be connected to. Yet this is woefully inadequate and demonstrates a wholly inappropriate level of planning within a company. That is, the person or people responsible for core data protection don’t get buy-in or oversight on all data protection.

What else is there within an environment? Well, quite a lot, potentially.

You’ve got the obvious things of course – end user desktops and laptops. Is there potential for local data storage on those machines? If there is, is that data protected?

You’ve got the slightly less obvious things – smart phones with critical business contacts, memos, etc., on them. Is that data being routinely being synced? What is it being synced to? Is that synced data accessible if say, the person leaves? Is that synced data backed up?

Moving right along past the “easy” questions, we’ve got the start of the really tricky questions – look at all the appliances within the organisation. No, I’m not talking about microwaves and toaster ovens in the kitchenettes on each floor. I’m talking about those boxes in racks that don’t have either a traditional operating system or an NDMP agent on them.

The network switches.

The fibre-channel switches.

The PABXs.

The encryption routers.

The encryption FC routers.

And so on.

All of these sorts of devices have configuration/state data on them. A month or so ago, I was talking to another third party consultant at a site, and that person whispered to me, with a slightly deer-in-the-headlights facial expression, “Their SAN FC zoning hasn’t even been saved to the switches, because they’re older and they can’t schedule the outage to save the config.”

And I thought, what sort of bizarro world have I entered? Because I’d bet money that if the running state wasn’t committed, it certainly wasn’t backed up either.

So, here’s my challenge to you, as a backup administrator – take ownership and become a Data Protection Advocate. I know, EMC have a product called DPA, but IT is rife with overloaded TLAs, so this is just another one. You need to stop being just the backup administrator, and start being the company’s Data Protection Advocate (DPA).

And how do you do that? You take a walk:

  1. Grab a notepad or an iPad and a suitable writing implement, be that pen or finger.
  2. Go into the server room.
  3. Note every bit of non-server equipment in that room.
  4. Next, start wandering around the offices.
  5. Note the electronic devices people are using. Smartphones? Tablets? PDAs? (Don’t laugh – I actually saw someone still using a Palm V just three weeks ago.)
  6. Ask at least two or three random people in each workgroup where they save their files to.
  7. Now go to your manager’s office.
  8. Tell your manager you want to have the title of DPA, and explain why.

I would suggest to you that very few, if any organisations, have actually formalised and thought through the process of just how much data goes unprotected on a daily basis. As such, it’s time for a new breed of backup administrators. Why? Because it’s damn unlikely that anyone else in the organisation will have anywhere near the level of appreciation for data protection than you – because it’s part of your job.

Do you want to be a Backup Administrator, or do you want to be a Data Protection Advocate?

I previously said that backup administrators should be part of the change control process, but realistically this isn’t the case. In fact, the DPA for the organisation should be part of the change control process. That person should be tasked with speaking out on behalf of the data – how will it be protected? How will it be recovered? If it can’t be protected, how can the risk be ameliorated?

What don’t you backup?

Are you ready to be a DPA?

If you are, read on at “But where does the DPA fit in?

 

In an earlier article, I suggested some space management techniques that need to be foremost in the minds of any deduplication user. Now, more broadly, I want to mention the top 7 things you need to avoid with deduplication:

1 – Watch your multiplexing

Make sure you take note of what sort of multiplexing you can get away with for deduplication. For instance, when using NetWorker with a deduplication VTL, you must use maximum on-tape multiplexing settings of 1; if you don’t, the deduplication system won’t be able to properly process the incoming data. It’ll get stored, but the deduplication ratios will fall through the floor.

A common problem I’ve encountered is a well running deduplication VTL system which over time ‘suddenly’ stops getting any good deduplication ratio at all. Nine times out of ten the cause was a situation (usually weeks before) where for one reason or another the VTL had to be dropped and recreated in NetWorker – but, the target and max sessions values were not readjusted for each of the virtual drives.

2 – Get profiled

Sure you could just sign a purchase order for a very spiffy looking piece of deduplication equipment. Everyone’s raving about deduplication. It must be good, right? It must work everywhere, right?

Well, not exactly. Deduplication can make a big impact in the at-rest data footprint of a lot of backup environments, but it can also be a terrible failure if your data doesn’t lend itself well to deduplication. For instance, if your multimedia content is growing, then your deduplication ratios are likely shrinking as well.

So before you rush out and buy a deduplication system, make sure you have some preliminary assessment done of your data. The better the analysis of your data, the better the understanding you’ll have of what sort of benefit deduplication will bring your environment.

Or to say it another way – people who go into a situation with starry eyes can sometimes be blinded.

3 – Assume lower dedupe ratios

A fact sheet has been thrust in front of you! A vendor fact sheet! It says that you’ll achieve a deduplication ratio of 30:1! It says that some customers have been known to see deduplication ratios of 200:1! It says …

Well, vendor fact sheets say a lot of things, and there’s always some level of truth in them.

But, step back a moment and consider compression ratios stated for tapes. Almost all tape vendors give a 2:1 compression ratio – some actually higher. This is all well and good – but now go and run ‘mminfo -mv’ in your environment, and calculate the sorts of compression ratios you’re really getting.

Compression ratios don’t really equal deduplication ratios of course – there’s a chunk more complexity in deduplication ratios. However, anyone who has been in backup for a while will know that you’ll occasionally get backup tapes with insanely high compression ratios – say, 10:1 or more, but an average for many sites is probably closer to the 1.4:1 mark.

My general rule of thumb these days is to assume a 7:1 compression ratio for an ‘average’ site where a comprehensive data analysis has not been done. Anything more than that is cream on top.

4 – Don’t be miserly

Deduplication is not to be treated as a ‘temporary staging area’. Otherwise you’ll have just bought yourself the most expensive backup to disk solution on the market. You don’t start getting any tangible benefit from deduplication until you’ve been backing up for several weeks. If you scope and buy a system that can only hold say, 1-2 weeks worth of data, you may as well just spend the money on regular disk.

I’m starting to come to the conclusion that your deduplication capacity should be able to hold at least 4x your standard full cycle. So if you do full backups once a week and incrementals all other days, you need 4 weeks worth of storage. If you do full backups once a month with incrementals/differentials the rest of the time, you need 4 months worth of storage.

5 – Have a good cloning strategy

You’ve got deduplication.

You may even have replication between two deduplication units.

But at some point, unless you’re throwing massive amounts of budgets at this and have minimal retention times, the chances are that you’re going to have to start writing data out to tape to clear off older content.

Your cloning strategy has to be blazingly fast and damn efficient. A site with 20TB of deduplicated storage should be able to keep at least 4 x LTO-5 drives running at a decent streaming speed in order to push out the data as its required. Why? Because it’s rehydrating the data as it streams back out to tape. Oh, I know some backup products offer to write the data out to tape in deduplicated format, but that usually turns out to be bat-shit crazy. Sure, it gets the data out to tape quicker, but once data is on tape you have to start thinking about the amount of time it takes to recover it.

6 – Know your trends

Any deduplication system should support you getting to see what sort of deduplication ratios you’re getting. If it’s got a reporting mechanism, all the better, but in a worst case scenario, be prepared to log in every single day for your backup cycles and see:

-a- What your current global deduplication ratio is

-b- What deduplication ratio you achieved over the past 24 hours

Use that information – store it, map it, and learn from it. When do you get your best deduplication ratios? What backups do they correlate to? More importantly, when do you get your worst deduplication ratios, and what backups do they correlate to?

(The recent addition of DD Boost functionality in NetWorker can make this trivially easy, by the way.)

If you’ve got this information at hand, you can use it to trend and map capacity utilisation within your deduplication system. If you don’t, you’re flying blind with one hand tied behind your back.

7 – Know your space reclamation process and speeds

It’s rare for space reclamation to happen immediately in a deduplication system. It may happen daily, or weekly, but it’s unlikely to be instantaneous. (See here for more details.)

Have a strong, clear understanding of:

-a- When your space reclamation runs (obviously, this should be tweaked to your environment)

-b- How long space reclamation typically takes to complete

-c- The impact that space reclamation operation has on performance of your deduplication environment

-d- An average understanding of how much capacity you’re likely to reclaim

-e- What factors may block reclamation. (E.g., hung replication, etc.)

If you don’t understand this, you’re flying blind and have the other hand tied behind your back, too.

 

I’m not a storage person, as I’ve been at pains to highlight in the past. My personal focus is at all times ILP, not ILM, and so I don’t get all giddy about array speeds and feeds, or anything along those lines.

Of course, if someone were to touch base with me tomorrow and offer me a free 10TB SSD array that I could fit under my desk, my opinion would change.

Queue the chirping crickets.

But seriously, in my “lay technical” view of arrays, I do have this theory and the problems introduced by hot spot migration, and I’m going to throw the theory out there with my reasoning.

First, the background:

  1. When I was taught to program, the credo was “optimise, optimise, optimise”. With limited memory and CPU functionality, we didn’t have the luxury to do lazy programming.
  2. With the staggering increase in processor speeds and memory, many programmers have lost focus on optimisation.
  3. Many second-rate applications can be deemed as such not by pure bugginess, but a distinct lack of optimisation.
  4. The transition from Leopard to Snow Leopard was a perfect example of the impacts of optimisation – the upgrade was about optimisation, not about major new features. And it made a huge difference.
And now, a classic example:
  1. In my first job, I was a system administrator for a very customised SAP system running on Tru64.
  2. Initially the system ran really smoothly all through the week.
  3. Over the 2-3 years I was administering, rumbling slowly developed that on Friday the system would get slower and slower.
  4. This always happened while people were entering their timesheets.
  5. Eventually, as part of Y2K remediation, someone took a look at the SQL commands used for timesheets, and noticed that someone had written a really bad query years ago which basically started by selecting all time sheet entries by all employees, then narrowing down. (Your classic problem of having an SQL query select the wrong results first.)
  6. This was fixed.
  7. System performance leapt through the roof.
  8. Users congratulated everyone on the fantastic “upgrade” that was done.
So, here’s my concern:
  1. For most applications, even complex ones these days, performance will be first IO bound before they become CPU or memory bound.
  2. Hot spot migration to faster media will mask, but not solve performance problems such as those described above.
  3. An application administrator (e.g., DBA) trying to solve application performance will find it challenging to resolve it around hot spot migration, particularly if they run multiple attempts to resolve the problem.
The problem, in short, is two-fold:
  1. First, hot spot migration will mask the problem.
  2. Second, hot spot migration will make problem debugging and resolution more problematic.
Clearly, there’s solutions to this. As someone said to me by reply today – a lot of what we do in IT already introduces these problems. It’s why, for instance, I’d never configure a NetWorker storage node as a virtual machine, because it’s using shared resources for performance. It’s why for instance, I’m always reluctant to use blades in the same situation. The solution, I think, is to to always be mindful of the following:
  1. Hot spot migration, while fantastic for handling load spikes, masquerades rather than solves application architecture/design issues.
  2. Hot spot migration, if supported by the array, but unknown by the application administrator, at best makes analysis and rectification extremely challenging, and at worst may actually make it impossible.
  3. It will always be important to have the option of turning off hot spot migration for deep analysis and debugging.
At least, that’s what I think. What do you think?
© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha