What’s the ravine?

When we talk about data flow rates into a backup environment, it’s easy to focus on the peak speeds – the maximum write performance you can get to a backup device, for instance.

However, sometimes that peak flow rate is almost irrelevant to the overall backup performance.

Backup ravine

Many hosts will exist within an environment where only a relatively modest percentage of their data can be backed up at peak speed; the vast majority of their data will instead be backed up at suboptimal speeds. For instance, consider the following nsrwatch output:

High Performance

That’s a write speed averaging 200MB/s per tape drive (peaks were actually 265MB/s in the above tests), writing around 1.5-1.6GB/s.

However, unless all your data is highly optimised structured data running on high performance hardware with high performance networking, your real-world experiences will vary considerably on a minute to minute basis. As soon as filesystem overheads become a significant factor in the backup activity (i.e., you hit fileservers, regular OS and application parts of the operating system, etc.), your backup performance is generally going to drop by a substantial margin.

This is easy enough to test in real-world scenarios; take a chunk of a filesystem (at least 2x the memory footprint of the host in question), and compare the time to backup:

  • The actual files;
  • A tar of the files.

You’ll see in that situation that there’s a massive performance difference between the two. If you want to see some real-world examples on this, check out “In-lab review of the impact of dense filesystems“.

Unless pretty much all of your data environment consists of optimised structured data which is optimally available, you’ll likely need to focus your performance tuning activities on the performance ravine – those periods of time where performance is significantly sub-optimal. Or to consider it another way – if absolute optimum performance is 200MB/s, spending a day increasing that to 205MB/s doesn’t seem productive if you also determine that 70% of the time the backup environment is running at less than 100MB/s. At that point, you’re going to achieve much more if you flatten the ravine.

Looking for a quick fix

There’s various ways that you can aim to do this. If we stick purely within the backup realm, then you might look at factoring in some form of source based deduplication as well. Avamar, for instance, can ameliorate some issues associated with unstructured data. Admittedly though, if you don’t already have Avamar in your environment, adding it can be a fairly big spend, so it’s at the upper range of options that may be considered, and even then won’t necessarily always be appropriate, depending on the nature of that unstructured data.

Traditional approaches have included sending multiple streams per filesystem, and (in some occasions) considering block-level backup of filesystem data (e.g., via SnapImage – though, increasing virtualisation is further reducing SnapImage’s number of use-cases), or using NDMP if the data layout is more amenable to better handling by a NAS device.

What the performance ravine demonstrates is that backup is not an isolated activity. In many organisations there’s a tendency to have segmentation along the lines of:

  • Operating system administration;
  • Application/database administration;
  • Virtualisation teams;
  • Storage teams;
  • Backup administration.

Looking for the real fix

In reality, fixing the ravine needs significant levels of communication and cooperation between the groups, and, within most organisations, a merger of the final three teams above, viz:

Crossing the ravine

The reason we need such close communication, and even team merger, is that baseline performance improvement can only come when there’s significant synergy between the groups. For instance, consider the classic dense-filesystem issue. Three core ways to solve it are:

  • Ensure the underlying storage supports large numbers of simultaneous IO operations (e.g., a large number of spindles) so that multistream reads can be achieved;
  • Shift the data storage across to NAS, which is able to handle processing of dense filesystems better;
  • Shift the data storage across to NAS, and do replicated archiving of infrequently accessed data to pull the data out of the backup cycle all together.

If you were hoping this article might be about quick fixes to the slower part of backups, I have to disappoint you: it’s not so simple, and as suggested by the above diagram, is likely to require some other changes within IT.

If merger in itself is too unwieldy to consider, the next option is the forced breakdown of any communication barriers between those three groups.

A ravine of our own making

In some senses, we were spoilt when gigabit networking was introduced; the solution became fairly common – put the backup server and any storage nodes on a gigabit core, and smooth out those ravines by ensuring that multiple savesets would always be running; therefore even if a single server couldn’t keep running at peak performance, there was a high chance that aggregated performance would be within acceptable levels of peak performance.

Yet unstructured data has grown at a rate which quite frankly has outstripped sequential filesystem access capabilities. It might be argued that operating system vendors and third party filesystem developers won’t make real inroads on this until they can determine adequate ways of encapsulating unstructured filesystems in structured databases, but development efforts down that path haven’t as yet yielded any mainstream available options. (And in actual fact just caused massive delays.)

The solution as environments switch over to 10Gbit networking however won’t be so simple – I’d suggest it’s not unusual for an environment with 10TB of used capacity to have a breakdown of data along the lines of:

  • 4 TB filesystem
  • 2 TB database (prod)
  • 3 TB database (Q/A and development)
  • 500 GB mail
  • 500 GB application & OS data

Assuming by “mail” we’ve got “Exchange”, then it’s quite likely that 5.5TB of the 10TB space will backup fairly quickly – the structured components. That leaves 4.5TB hanging around like a bad smell though.

Unstructured data though actually proves a fundamental point I’ve always maintained – that Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP) are two reasonably independent activities. If they were the same activity, then the resulting synergy would ensure the data were laid out and managed in such a way that data protection would be a doddle. Remember that ILP resembles the following:

Components of ILP

One place where the ravine can be tackled more readily is in the deployment of new systems, which is where that merger of storage, backup and virtualisation comes in, not to mention the close working relationship between OS, Application/DB Admin and the backup/storage/virtualisation groups. Most forms and documents used by organisations when it comes to commissioning new servers will have at most one or two fields for storage – capacity and level of protection. Yet, anyone who works in storage, and equally anyone who works in backup will know that such simplistic questions are the tip of the iceberg for determining performance levels, not only for production access, but also for backup functionality.

The obvious solution to this is service catalogues that cover key factors such as:

  • Capacity;
  • RAID level;
  • Snapshot capabilities;
  • Performance (IOPs) for production activities;
  • Performance (MB/s) for backup/recovery activities (what would normally be quantified under Service Level Agreements, also including recovery time objectives);
  • Recovery point objectives;
  • etc.

But what has all this got to do with the ravine?

I said much earlier in the piece that if you’re looking for a quick solution to the poor-performance ravine within an environment, you’ll be disappointed. In most organisations, once the ravine appears, there’ll need to be at least technical and process changes in order to adequately tackle it – and quite possibly business structural changes too.

Take (as always seems to be the bad smell in the room) unstructured data. Once it’s built up in a standard configuration beyond a certain size, there’s no “easy” fix because it becomes inherently challenging to manage. If you’ve got a 4TB filesystem serving end users across a large department or even an entire company, it’s easy enough to think of a solution to the problem, but thinking about a problem and solving it are two entirely different things, particularly when you’re discussing production data.

It’s here where team merger seems most appropriate; if you take storage in isolation, a storage team will have a very specific approach to configuring a large filesystem for unstructured data access – the focus there is going to be on maximising the number of concurrent IOs and ensuring that standard data protection is in place. That’s not, however, always going to correlate to a configuration that lends itself to traditional backup and recovery operations.

Looking at ILP as a whole though – factoring in snapshot, backup and replication, you can build an entirely different holistic data protection mechanism. Hourly snapshots for 24-48 hours allow for near instantaneous recovery – often user initiated, too. Keeping one of those snapshots per day for say, 30 days, extends this considerably to cover the vast number of recovery requests a traditional filesystem would get. Replication between two sites (including the replication of the snapshots) allows for a form of more traditional backup without yet going to a traditional backup package. For monthly ‘snapshots’ of the filesystem though, regular backup may be used to allow for longer term retention. Suddenly when the ravine only has to be dealt with once a month rather than daily, it’s no longer much of an issue.

Yet, that’s not the only way the problem might be dealt with – what if 80% of that data being backed up is stagnant data that hasn’t been looked at in 6 months? Shouldn’t that then require deleting and archiving? (Remember, first delete, then archive.)

I’d suggest that a common sequence of problems when dealing with backup performance runs as follows:

  1. Failure to notice: Incrementally increasing backup runtimes over a period of weeks or months often don’t get noticed until it’s already gone from a manageable problem to a serious problem.
  2. Lack of ownership: Is a filesystem backing up slowly the responsibility of the backup administrators or the operating system administrators, or the storage administrators? If they are independent teams, there will very likely be a period where the issue is passed back and forth for evaluation before a cooperative approach (or even if a cooperative approach) is decided upon.
  3. Focus on the technical: The current technical architecture is what got you into the mess – in and of itself, it’s not necessarily going to get you out of the mess. Sometimes organisations focus so strongly on looking for a technical solution that it’s like someone who runs out of fuel on the freeway running to the boot of their car, grabbing a jerry can, then jumping back in the driver’s seat expecting to be able to drive to the fuel station. (Or, as I like to put it: “Loop, infinite: See Infinite Loop; Infinite Loop: See Loop, Infinite”.)
  4. Mistaking backup for recovery: In many cases the problem ends up being solved, but only for the purposes of backup, without attention to the potential impact that may make on either actual recoverability or recovery performance.

The first issue is caused by a lack of centralised monitoring. The second, by a lack of centralised management. The third, by a lack of centralised architecture, and the fourth, by a lack of IT/business alignment.

If you can seriously look at all four of those core issues and say replacing LTO-4 tape drives with LTO-5 tape drives will 100% solve a backup-ravine problem every time, you’re a very, very brave person.

If we consider that backup-performance ravine to be a real, physical one, the only way you’re going to get over it is to build a bridge, and that requires a strong cooperative approach rather than a piecemeal approach that pays scant regard for anything other than the technical.

I’ve got a ravine, what do I do?

If you’re aware you’ve got a backup-performance ravine problem plaguing your backup environment, the first thing you’ve got to do is to pull back from the abyss and stop staring into it. Sure, in some cases, a tweak here or a tweak there may appear to solve the problem, but likely it’s actually just addressing a symptom, instead. One symptom.

Backup-performance ravines should in actual fact be viewed as an opportunity within a business to re-evaluate the broader environment:

  1. Is it time to consider a new technical architecture?
  2. Is it time to consider retrofitting an architecture to the existing environment?
  3. Is it time to evaluate achieving better IT administration group synergy?
  4. Is it time to evaluate better IT/business alignment through SLAs, etc.?

While the problem behind a backup-performance ravine may not be as readily solvable as we’d like, it’s hardly insurmountable – particularly when businesses are keen to look at broader efficiency improvements.

 

There’s a pertinent adage in cooking when it comes to using wine in recipes:

If you wouldn’t drink it, don’t cook with it.

It’s simple: if you don’t like the taste of it in a glass, what makes you think you’ll like the taste of food you’ve added it to?

There are two similar rules for backup, and they’re particularly important when it comes time to do those periodic hardware refreshes in your environment:

If it’s not good enough to run production, don’t use it for DR.

If it’s not good enough to run production, don’t use it for backup.

The way in which both of these come into play is quite simple:

  1. If it’s not good enough to run production, don’t use it for DR. I’ve seen companies have a hardware refresh cycle of “move production equipment to DR, buy new production equipment”. However, invariably that equipment is being pulled out of production because it’s either lacking in capacity, or lacking in performance. That equipment is then going to be replaced with new equipment with planned usage time of (typically) 2-3 years. So let’s assume you get a year down the track – your in-use storage capacity has gone up, your processing load has increased, then there’s a major production fault and you have to failover to DR. At which point, you’re trying to run your production environment on something that was sized to max out 12 months ago. Chances of it adequately running production? Minimal.
  2. If it’s not good enough to run production, don’t use it for backup. Another common mistake is a situation whereby say, a storage array is pulled out of production and replaced with a new, faster array with more capacity. People invariably hate to see things go to waste, so someone suggests “let’s use the old array as {backup to disk | VTL | etc}”. Again, sounds simple enough on the face of it, except the equipment was either lacking in performance, or lacking in capacity. If it was lacking in performance, you’re putting it into a situation where you’re going to be copying off something that is purchased, on the outset, to be significantly faster than it. It’s similar with capacity – you’re going to be trying to backup a very large bucket to a much smaller bucket.

Whether your company likes the idea of it or not, backup and disaster recovery are not areas that should be assigned “hand me downs” by the rest of the business. They require their own capital budget, and a planning that allows for the following two factors:

  1. Performance should at least match the throughput on offer from production;
  2. It should exceed your production capacity.

If either of these conditions are not met, your strategy is insufficient.

 

I’m not a storage person, as I’ve been at pains to highlight in the past. My personal focus is at all times ILP, not ILM, and so I don’t get all giddy about array speeds and feeds, or anything along those lines.

Of course, if someone were to touch base with me tomorrow and offer me a free 10TB SSD array that I could fit under my desk, my opinion would change.

Queue the chirping crickets.

But seriously, in my “lay technical” view of arrays, I do have this theory and the problems introduced by hot spot migration, and I’m going to throw the theory out there with my reasoning.

First, the background:

  1. When I was taught to program, the credo was “optimise, optimise, optimise”. With limited memory and CPU functionality, we didn’t have the luxury to do lazy programming.
  2. With the staggering increase in processor speeds and memory, many programmers have lost focus on optimisation.
  3. Many second-rate applications can be deemed as such not by pure bugginess, but a distinct lack of optimisation.
  4. The transition from Leopard to Snow Leopard was a perfect example of the impacts of optimisation – the upgrade was about optimisation, not about major new features. And it made a huge difference.
And now, a classic example:
  1. In my first job, I was a system administrator for a very customised SAP system running on Tru64.
  2. Initially the system ran really smoothly all through the week.
  3. Over the 2-3 years I was administering, rumbling slowly developed that on Friday the system would get slower and slower.
  4. This always happened while people were entering their timesheets.
  5. Eventually, as part of Y2K remediation, someone took a look at the SQL commands used for timesheets, and noticed that someone had written a really bad query years ago which basically started by selecting all time sheet entries by all employees, then narrowing down. (Your classic problem of having an SQL query select the wrong results first.)
  6. This was fixed.
  7. System performance leapt through the roof.
  8. Users congratulated everyone on the fantastic “upgrade” that was done.
So, here’s my concern:
  1. For most applications, even complex ones these days, performance will be first IO bound before they become CPU or memory bound.
  2. Hot spot migration to faster media will mask, but not solve performance problems such as those described above.
  3. An application administrator (e.g., DBA) trying to solve application performance will find it challenging to resolve it around hot spot migration, particularly if they run multiple attempts to resolve the problem.
The problem, in short, is two-fold:
  1. First, hot spot migration will mask the problem.
  2. Second, hot spot migration will make problem debugging and resolution more problematic.
Clearly, there’s solutions to this. As someone said to me by reply today – a lot of what we do in IT already introduces these problems. It’s why, for instance, I’d never configure a NetWorker storage node as a virtual machine, because it’s using shared resources for performance. It’s why for instance, I’m always reluctant to use blades in the same situation. The solution, I think, is to to always be mindful of the following:
  1. Hot spot migration, while fantastic for handling load spikes, masquerades rather than solves application architecture/design issues.
  2. Hot spot migration, if supported by the array, but unknown by the application administrator, at best makes analysis and rectification extremely challenging, and at worst may actually make it impossible.
  3. It will always be important to have the option of turning off hot spot migration for deep analysis and debugging.
At least, that’s what I think. What do you think?
 

We are approaching the point where it would be conceivable for someone to build a PB disk system in their home. It wouldn’t be cheap, but it’s a hell of a lot cheaper than at any point in computing history.

Do you think I’ve gone crazy? Do the math – using 3TB drives, you’d only need 342 drives to get to 1PB. If you wanted to do it really nasty, you could use USB drives, 7-port hubs and a series of PCIe USB cards.

You can get standard motherboards now with 8 PCIe ports. 8 x 4-port USB-2 cards would yield 32 incoming USB channels. Throw a 7-port USB hub onto the end of each of those channels, and you’d get 224 USB connections. Not quite 342, so expand your design a little bit, deploy 2 hosts, split the drives between them and throw in a clustered filesystem across them.

Violá! It’ll be a messy pile of cables and you’ll need decent 3-phase power coming into your house, but you’ll have a PB of storage! Here’s the start of the system diagram before I got too bored and realised I’d need a much bigger screen:

Performance or it performs?

And if you haven’t lost your current meal laughing at this yet – I’ll state the brutally obvious. Performance would suck. For very, very large values of suck. Reliability would suck too, for equally large values of suck.

But you’d have a PB of storage.

That’s what it’s all about, isn’t it? Well, no. And we all know that.

So if we recognise that, when we look at a blatantly absurd example, why is it that we can be sucked into equally absurd configurations?

I remember in the early noughties, when the CLARiiON line got its first ATA line. The sales guys at the company I worked for at the time started selling FC arrays to customers with snapshots going to ATA disk to make snapshots cheap.

The snapshots were indeed cheap.

But as soon as the customers started using their storage with active snapshots, their performance sucked – for large values of suck.

And the customers got angry.

And there was much running around and gnashing of teeth.

And I, a software-only guy sat there thinking “who would be insane enough to think that this would have been OK?”

It’s the old saying – you can have cheap, good or fast. Pick two.

Sometimes, cheap means you don’t even get to pick a second option.

By now you’re probably think that I’m on some meandering ramble that doesn’t have a point. And, if you’re about to give up and close the browser tab, you’d be right.

But, if you went onto this paragraph, you’d get to read my point: if you wouldn’t use a particular server, storage array or configuration in a primary production server, you shouldn’t be considering it for the backup server that has to protect those primary production servers either.

No, not in scenario X, or possibility Y. If you wouldn’t use it for primary production, you don’t use it for support production.

It’s that simple.

And if you want to, feel free to go build that cheap and cheerful 1PB storage array for your production storage. I’m sure your users will love it – about as much as they’d love it if you used that 1PB array to backup your production systems and then had to do a recovery from it. That’s my point; it’s not just about buying something that performs for backup – it’s about having something that has sufficient performance for recovery.

So if someone starts talking to you about deploying X for backup, and X is a bit slower, but it’s a lot cheaper, just consider this: would you use X for your primary production system? If the answer is no, then you’d better have some damn good performance data at hand to show that it’s appropriate for the backup of those primary production systems.

 

Last month, I posted a survey with the following questions:

  1. What is your backup server (currently)?
    1. Physical server
    2. Virtual server, backing up directly
    3. Virtual server, in director mode only
    4. Blade server, backing up directly
    5. Blade server, director mode only
  2. Would you run a virtual backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.
  3. Would you run a blade backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.

Now, I did preface this survey with my own feelings at the time:

I have to admit, I have great personal reservations towards virtualising backup servers. There’s a simple, fundamental reason for this: the backup server should have as few dependencies as possible in an environment. Therefore to me it seems completely counter-intuitive to make the backup server dependent on an entire virtualisation layer existing before it can be used.

For this reason I also have some niggling concerns with running a backup server as a blade server.

Personally, at this point in time, I would never willingly advocate deploying a NetWorker server as a virtual machine (except in a lab situation) – even when running in director mode.

At the time of the survey, I already knew from a few different sources that EMC run virtualised NetWorker servers as part of their own environment, and are happy to recommend it. I however, wasn’t. (And let’s face it, I’ve been working with NetWorker for longer than EMC’s owned it.) That being said, I wasn’t looking for confirmation that I was right – I was looking for justifiable reasons why I might be wrong.

First, I want to present the survey findings, and then I’ll discuss some of the comments and where I now stand.

There were 122 respondents to the survey, and the answers were:

Current Backup Server

Did this number surprise me? Not really – by its very nature, backup operations and administration is about being conservative: keep things simple, don’t go bleeding edge, and trust what is known. As such, the majority of sites are running a physical backup server. Of the respondents, only 10% were running any form of virtualised backup server, regardless of whether that was a software or hardware virtualised server, and regardless of whether it was directly doing backups or backing up in director mode only.

Would you run a virtual backup server?

So this question was a simple one – would you run a backup server that was virtual? Anyone who has done any surveys would claim (rightly so) that my leading questions into the survey may have coloured the results of the survey, and I’d not disagree with them.

Yet, let’s look at those numbers – less than 50% (admittedly only by a small margin) gave an outright “No” response to this question. I was pleased though that those who would run a virtualised backup server seemed to mirror my general thoughts on the matter – the majority would only do so in director mode, with the next biggest group being willing to backup to disk to the backup server, but not using other devices.

Would you run a blade backup server?

The final question asked the same about blade servers. To be fair to those using blade servers, this probably should have been prefaced with a question “Do you use blade servers in your environment already?”, since it would seem logical that anyone currently not using blade servers probably wouldn’t answer yes to this. But I was still curious – as you may be aware, I’ve had some questions about blade servers in the past; and other than offering better rack density I see them having no tangible benefits. (Then again, I am in a country that has no lack of space.)

The big difference between a software virtualised backup server and a hardware virtualised backup server though was that people who would run a backup server in a blade environment were more willing to backup to any device. That’s probably understandable. It smells like and looks like regular hardware, so it feels easier than say, a virtual machine accessing a physical tape drive does.

So, the survey showed me fairly much what I was expecting I’d see – a high level of users with physical backup servers. I was hoping though that I might see some comments from people who were either using, or considering using virtual servers, and get some feedback on what they found to be the case.

One of the best comments that came through was from Alex Kaasjager. He started with this:

I agree with you that a backup server (master, director) should be as independent as possible – and right for that specific reason, I’d prefer the server virtualised. Virtualisation solves the problem of a hardware, a hardware-bound OS, location and redundancy.

That immediately got my attention – and so Alex followed with these examples:

- if my hardware breaks (and it will at a certain point in time) I will have to keep a spare machine or go with reinstall-recovery, which, as you will agree, poses its own very peculiar set of problems
- the OS, regardless which one, is bound to the hardware, be it for licensing, MAC address, or drivers. A change in the OS (because of a move to another datacenter for example) may hurt (although it probably won’t, in all fairness)
- I can move my VM anywhere, to another rack, datacenter, or country without much hassle, I can copy, make a snap and even export it. Hardware will prevent this.

Of all the things I hadn’t considered, it was the simple ability to move your backup server between virtual servers wasn’t what I’d considered. Alex’s first point – about protection from hardware failure – is very cogent on its own, but being able to just move the backup server around without impacting any operations, or disrupting licenses – now that’s the kind of “bonus” argument I was looking for. (It’s why, for instance, I’ve advocated that if you’re going to have a License Manager server, you make that virtual.)

Another backup administrator (E. O’S) advocated:

It absolutely has to be in director mode as you describe. All the benefits of hardware abstraction and HA/FT that you get with VM are just as relevant to a critical an app as NetWorker, especially for storage mobility and expansion for a growing and changing datazone. Snapshots before major upgrades? Cloning for testing or redeployment to another site? Yes please. You have to be more confident than ever in your ability to recover NetWorker with bootstraps and indices (even onto a physical host if you need to, to solve your virtualisation layer dependency conundrum) if and when the time comes. Plan for it, practice it, and sleep easy.

The final part of what I’ve quoted there comes to the heart of my reservations of running NetWorker virtualised, even in a director role – how do you do an mmrecov of it? In particular, even when running as a backup director, the NetWorker server still has to back its own bootstrap information up to a local device. Ensuring that you can still recover from such a device would become of paramount importance.

I think the solution here is three-fold:

  • (Already available) Design a virtualised backup server such that the risk of having to do a bootstrap recovery in DR is as minimal as possible.
  • (Already available) Assuming you’re doing those bootstrap backups to disk/virtual disk, be sure to keep them as a separate disk file to the standard disk file for the VM, so that you can run any additional cloning/copying of that you want at a lower level, or attach it to another VM in an emergency.
  • (EMC please take note) It’s time that we no longer needed to do any backups to devices directly attached to the backup server. NetWorker does need architectural enhancements to allow bootstrap backup/recovery to/from storage node devices. Secondary to this: DR should not be dependent on the original and the destination host having the same names.)

So, has this exercise changed my mind or reinforced my belief that you should always run a physical backup server?

I’m probably now awkwardly sitting on the fence – facing the “virtual is OK for director mode only” camp. That would be with strong caveats to do with recoverability arrangements for the virtual machine. In particular, what I’d suggest is that I would not agree with virtualising the backup server if you were in such a small environment that there’s no provisioning for moving the guest machine between virtual servers. The absolute minimum, for me, in terms of reliability of such a solution is being able to move the backup server from one physical host to another. If you can do that, and you can then have a very well practiced and certain recovery plan in the event of a DR, then yeah, I’m sold on the merits of having a virtualised backup director server.

(If EMC updated NetWorker as per that final bullet point above? I’d be very happy to pitch my tent in that camp.)

I’ve got a couple of follow-up points and questions I’ll be making over the coming week, but I wanted to at least get this initial post out.

 

Pumping data

The age-old consideration in backup is the most simple one: how to pump the required data through in the required time frame in such a way that it can be readily recovered. This challenges us to constantly find the best way to achieve the data throughput required. What worked 10 years ago was not always applicable 5 years ago; what worked 5 years ago is not always applicable now. Consider for instance the adage:

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

(Andrew Tanenbaum, 1996.)

What surprises me, to a degree, is that still, in 2011, we’re having discussions about data throughput where people focus on the wrong thing. I would humbly respect, that you shouldn’t give a flying fracas about how fast  you can back your data up when compared to how fast you can recover it.

That’s right: when talking feeds and speeds, the only one to give a damn about in backup is how quickly you can recover the data once it’s been captured.

This is, in fact, why the terms RPO and RTO were invented. In particular for the topic of “pumping data”, RTO – Recovery Time Objective – is most important. How quickly do you need to get the data back?

In this scenario, Andrew Tanenbaum’s caution about a station wagon full of tapes hurtling down the highway is entirely appropriate. In fact, so much so that when companies start talking about how fast they need to backup (or how fast they can backup) without reference to recovery, I unfortunately go into this loop:

Why? Because it’s like when my grandmother wants to tell me a story about how she bumped into someone she hadn’t seen for 57 years in the supermarket, but gets stuck on an irrelevant detail. “Peaches or pears!” I used to say to her as a kid, perhaps a little disrespectfully – it didn’t matter whether she was out shopping for peaches or pears before the important thing happened! Same here – it doesn’t matter how fast you can pump data into the backup system – it’s how fast you can pump data out of it that is the only number worth focusing on.

We have to, as storage industry insiders, experts, advisors, consultants – whatever we want to call ourselves – keep vendors and customers focused on the real important metric: how fast they can recover. We have a duty of care to stand between the FUD and the hype and steer companies on a safe trajectory. The safe trajectory in this case is talking about recovery speeds rather than backup speeds.

This is, for instance, why I rarely get excited about remote office backup strategies. For instance, a current meme in remote office backup strategy is the use of deduplication – most likely source based. The goal? Reduce the amount of data you have to transfer from the remote office to the head office to a small trickle, and all your problems are solved … until, of course, you need to recover that data.

Don’t get me wrong, I’m not against remote office backups – I’m also not against centralised remote office backups, regardless of whether they’re achieved by deduplication, compression, magic pixies or faerie dust. In this example though there’s a simple fact: to talk about remote office backup without discussing remote office recovery is reprehensible.

Yes, reprehensible. I’ll use that term. It’s not a nice term, I know, but nor is the practice of ignoring the elephant in the room – recovery.

Look folks, do you really want me to prance around a stage doing the monkey dance shouting “Recovery! Recovery! Recovery!”? Is that what it has to take? Because, if it is, I’ll do it. (I might, if you don’t mind, try to avoid the flop sweat though.)

What am I asking for? Maybe it’s this simple thought:

Starting this year, let no company (vendor or otherwise) talk about a product’s backup performance without citing real world recovery scenarios and performance in those scenarios.

There is not a guaranteed 1:1 mapping between backup and recovery performance, and to imply there is, either by obfuscation or omission is disrespectful to the data protection industry.

 

There’s a lot of talk about tracking data growth by watching SAN and NAS usage, counting allocated storage by the gigabyte, etc., but I’ve always thought that backup and recovery systems offered an elegant way of closely monitoring data growth within an environment.

Recently I was asked to contribute some articles about how backup and recovery can help to improve IT processes and performance within an organisation, and the first thing that occurred to me was to write about this very topic.

If you’re worried about tracking and trending data growth within your environment, and want to see some simple examples of how to account for peaks and troughs in backup capacity while still predicting data growth, please head over to “Using Backup and Recovery to Track and Forecast Data Growth” at IT Performance Improvement.

 

While NetWorker 7.6 is not available for download as of the time I write this, the documentation is available on PowerLink. For those of you chomping at the bit to at least read up on NetWorker 7.6, now is the time to wander over to PowerLink delve into the documentation.

The last couple of releases of NetWorker have been interesting for me when it comes to beta testing. In particular, I’ve let colleagues delve into VCB functionality, etc., and I’ve stuck to “niggly” things – e.g., checking for bugs that have caused us and our customers problems in earlier versions, focusing on the command line, etc.

For 7.6 I also decided to revisit the documentation, particularly in light of some of the comments that regularly appear on the NetWorker mailing list about the sorry state of the Performance Tuning and Optimisation Guide.

It’s pleasing, now that the documentation is out, to read the revised and up to date version of the Performance Tuning Guide. Regularly critics of the guide for instance will be pleased to note that FDDI does not appear once. Not once.

Does it contain every possible useful piece of information that you might use when trying to optimise your environment? No, of course not – nor should it. Everyone’s environment will differ in a multitude of ways. Any random system patch can affect performance. A single dodgy NIC can affect performance. A single misconfigured LUN or SAN port can affect performance.

Instead, the document now focuses on providing a high level overview of performance optimisation techniques.

Additionally, recommendations and figures have been updated to support current technology. For instance:

  • There’s a plethora of information on PCI-X vs PCIeXpress.
  • RAM guidelines for the server based on the number of clients has been updated.
  • NMC finally gets a mention as a resource hog! (Obviously, that’s not the words used, but it’s the implication for larger environments. I’ve been increasingly encouraging larger customers to put NMC on a separate host for this reason.)
  • There’s a whole chunk on client parallelism optimisation, both for the clients and the backup server itself.

I don’t think this document is perfect, but if we’re looking at the old document vs the new, and the old document scored a 1 out of 10 on the relevancy front, this at least scores a 7 or so, which is a vast improvement.

Oh, one final point – with the documentation now explicitly stating:

The best approach for client parallelism values is:

– For regular clients, use the lowest possible parallelism settings to best balance between the number of save sets and throughput.

– For the backup server, set highest possible client parallelism to ensure that index backups are not delayed. This ensures that groups complete as they should.

Often backup delays occur when client parallelism is set too low for the NetWorker server. The best approach to optimize NetWorker client performance is to eliminate client parallelism, reduce it to 1, and increase the parallelism based on client hardware and data configuration.

(My emphasis)

Isn’t it time that the default client parallelism value were decreased from the ridiculously high 12 to 1, and we got everyone to actually think about performance tuning? I was overjoyed when I’d originally heard that the (previous) default parallelism value of 4 was going to be changed, then horrified when I found out it was being revised up, to 12, rather than down to 1.

Anyway, if you’ve previously dismissed the Performance Tuning Guide as being hopelessly out of date, it’s time to go back and re-read it. You might like the changes.

 

On Linu, filesystems typically have two settings regarding getting complete checks on boot. These are:

  • Maximum number of mounts before a check
  • Interval between checks

The default settings, while reasonably suitable for smaller partitions, are very unsuitable for large partitions, such as what you find in disk backup units. In fact, if you don’t pay particular attention to these settings, you may find after a routine reboot that your backup server (or storage node) can take hours to become available. For instance, it’s not unheard of to see even sub-20TB DBU environments (as say, 10 x 2TB filesystems) take several hours to complete mandatory checks on filesystems after what should have just been a routine reboot.

There are two approaches that you can take to this:

  • If you want to leave the checks enabled, it’s reasonably imperative to ensure that at most only one disk backup unit filesystem will be checked at one time after a reboot; this will at least reduce the size of any check-on-reboot. Thus, ensure you:
    • Configure each filesystem so that it will have a different number of maximum mounts before check than any other filesystem, and,
    • Configure the interval (days) between checks for each filesystem to be a significantly different number.
  • If you don’t want periodic filesystem checks to ever interfere with the reboot process, you need to:
    • Ensure that following a non-graceful restart of the server the DBU filesystems are unmounted and checked before any new backup or recovery activities are done, and,
    • Ensure that there are processes – planned maintenance windows if you will – for manual running of the filesystem checks that are being skipped.

Neither option is particularly “attractive”. In the first case, you can still, if you cherish uptime or don’t need to reboot your backup server often, get into a situation where multiple filesystems need to be checked on reboot if they’ve all exceeded their days-between-checks parameter. In the second instance, you’re having to insert human driven processes into what should normally be a routine operating system function. In particular with the manual option, there must be a process in place to NetWorker shutdown + checking even in the middle of the night if an OS crash occurs.

Actually, the above list is a little limited – there’s a couple of other options that you can consider as well – though they’re a little more left of field:

  • Build into the change control process the timings for complete filesystem checks in case they happen, or
  • Build into the change control process or reboot procedure for the backup server/storage nodes the requirement to temporarily disable filesystem checks (using say, tune2fs) so that you know the reboot to be done won’t be costly in terms of time.

Personally, I’m looking forward to btrfs – in reality, a modern filesystem such as that should solve most, if not all, of the problems discussed above.

 

The Register has some coverage at the moment of Intel demonstrating a (highly customised/optimised) 7 disk SSD configuration which delivered 1 million IOPS on a desktop configuration. As the article says, regardless of the level of tweaking to get there, this is a fabulous example of the world that is to come with SSD.

Clearly this is still a while off from regular commercial use for the “average” business, but regardless, it’s a fascinating development.

There’s good cause, when you look at these sorts of figures, to see why most storage vendors are getting onto the SSD bandwagon and declaring SSDs to be the “zeroth tier” in storage performance.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha