I’m not a storage person, as I’ve been at pains to highlight in the past. My personal focus is at all times ILP, not ILM, and so I don’t get all giddy about array speeds and feeds, or anything along those lines.

Of course, if someone were to touch base with me tomorrow and offer me a free 10TB SSD array that I could fit under my desk, my opinion would change.

Queue the chirping crickets.

But seriously, in my “lay technical” view of arrays, I do have this theory and the problems introduced by hot spot migration, and I’m going to throw the theory out there with my reasoning.

First, the background:

  1. When I was taught to program, the credo was “optimise, optimise, optimise”. With limited memory and CPU functionality, we didn’t have the luxury to do lazy programming.
  2. With the staggering increase in processor speeds and memory, many programmers have lost focus on optimisation.
  3. Many second-rate applications can be deemed as such not by pure bugginess, but a distinct lack of optimisation.
  4. The transition from Leopard to Snow Leopard was a perfect example of the impacts of optimisation – the upgrade was about optimisation, not about major new features. And it made a huge difference.
And now, a classic example:
  1. In my first job, I was a system administrator for a very customised SAP system running on Tru64.
  2. Initially the system ran really smoothly all through the week.
  3. Over the 2-3 years I was administering, rumbling slowly developed that on Friday the system would get slower and slower.
  4. This always happened while people were entering their timesheets.
  5. Eventually, as part of Y2K remediation, someone took a look at the SQL commands used for timesheets, and noticed that someone had written a really bad query years ago which basically started by selecting all time sheet entries by all employees, then narrowing down. (Your classic problem of having an SQL query select the wrong results first.)
  6. This was fixed.
  7. System performance leapt through the roof.
  8. Users congratulated everyone on the fantastic “upgrade” that was done.
So, here’s my concern:
  1. For most applications, even complex ones these days, performance will be first IO bound before they become CPU or memory bound.
  2. Hot spot migration to faster media will mask, but not solve performance problems such as those described above.
  3. An application administrator (e.g., DBA) trying to solve application performance will find it challenging to resolve it around hot spot migration, particularly if they run multiple attempts to resolve the problem.
The problem, in short, is two-fold:
  1. First, hot spot migration will mask the problem.
  2. Second, hot spot migration will make problem debugging and resolution more problematic.
Clearly, there’s solutions to this. As someone said to me by reply today – a lot of what we do in IT already introduces these problems. It’s why, for instance, I’d never configure a NetWorker storage node as a virtual machine, because it’s using shared resources for performance. It’s why for instance, I’m always reluctant to use blades in the same situation. The solution, I think, is to to always be mindful of the following:
  1. Hot spot migration, while fantastic for handling load spikes, masquerades rather than solves application architecture/design issues.
  2. Hot spot migration, if supported by the array, but unknown by the application administrator, at best makes analysis and rectification extremely challenging, and at worst may actually make it impossible.
  3. It will always be important to have the option of turning off hot spot migration for deep analysis and debugging.
At least, that’s what I think. What do you think?
 

We are approaching the point where it would be conceivable for someone to build a PB disk system in their home. It wouldn’t be cheap, but it’s a hell of a lot cheaper than at any point in computing history.

Do you think I’ve gone crazy? Do the math – using 3TB drives, you’d only need 342 drives to get to 1PB. If you wanted to do it really nasty, you could use USB drives, 7-port hubs and a series of PCIe USB cards.

You can get standard motherboards now with 8 PCIe ports. 8 x 4-port USB-2 cards would yield 32 incoming USB channels. Throw a 7-port USB hub onto the end of each of those channels, and you’d get 224 USB connections. Not quite 342, so expand your design a little bit, deploy 2 hosts, split the drives between them and throw in a clustered filesystem across them.

Violá! It’ll be a messy pile of cables and you’ll need decent 3-phase power coming into your house, but you’ll have a PB of storage! Here’s the start of the system diagram before I got too bored and realised I’d need a much bigger screen:

Performance or it performs?

And if you haven’t lost your current meal laughing at this yet – I’ll state the brutally obvious. Performance would suck. For very, very large values of suck. Reliability would suck too, for equally large values of suck.

But you’d have a PB of storage.

That’s what it’s all about, isn’t it? Well, no. And we all know that.

So if we recognise that, when we look at a blatantly absurd example, why is it that we can be sucked into equally absurd configurations?

I remember in the early noughties, when the CLARiiON line got its first ATA line. The sales guys at the company I worked for at the time started selling FC arrays to customers with snapshots going to ATA disk to make snapshots cheap.

The snapshots were indeed cheap.

But as soon as the customers started using their storage with active snapshots, their performance sucked – for large values of suck.

And the customers got angry.

And there was much running around and gnashing of teeth.

And I, a software-only guy sat there thinking “who would be insane enough to think that this would have been OK?”

It’s the old saying – you can have cheap, good or fast. Pick two.

Sometimes, cheap means you don’t even get to pick a second option.

By now you’re probably think that I’m on some meandering ramble that doesn’t have a point. And, if you’re about to give up and close the browser tab, you’d be right.

But, if you went onto this paragraph, you’d get to read my point: if you wouldn’t use a particular server, storage array or configuration in a primary production server, you shouldn’t be considering it for the backup server that has to protect those primary production servers either.

No, not in scenario X, or possibility Y. If you wouldn’t use it for primary production, you don’t use it for support production.

It’s that simple.

And if you want to, feel free to go build that cheap and cheerful 1PB storage array for your production storage. I’m sure your users will love it – about as much as they’d love it if you used that 1PB array to backup your production systems and then had to do a recovery from it. That’s my point; it’s not just about buying something that performs for backup – it’s about having something that has sufficient performance for recovery.

So if someone starts talking to you about deploying X for backup, and X is a bit slower, but it’s a lot cheaper, just consider this: would you use X for your primary production system? If the answer is no, then you’d better have some damn good performance data at hand to show that it’s appropriate for the backup of those primary production systems.

 

Pumping data

The age-old consideration in backup is the most simple one: how to pump the required data through in the required time frame in such a way that it can be readily recovered. This challenges us to constantly find the best way to achieve the data throughput required. What worked 10 years ago was not always applicable 5 years ago; what worked 5 years ago is not always applicable now. Consider for instance the adage:

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

(Andrew Tanenbaum, 1996.)

What surprises me, to a degree, is that still, in 2011, we’re having discussions about data throughput where people focus on the wrong thing. I would humbly respect, that you shouldn’t give a flying fracas about how fast  you can back your data up when compared to how fast you can recover it.

That’s right: when talking feeds and speeds, the only one to give a damn about in backup is how quickly you can recover the data once it’s been captured.

This is, in fact, why the terms RPO and RTO were invented. In particular for the topic of “pumping data”, RTO – Recovery Time Objective – is most important. How quickly do you need to get the data back?

In this scenario, Andrew Tanenbaum’s caution about a station wagon full of tapes hurtling down the highway is entirely appropriate. In fact, so much so that when companies start talking about how fast they need to backup (or how fast they can backup) without reference to recovery, I unfortunately go into this loop:

Why? Because it’s like when my grandmother wants to tell me a story about how she bumped into someone she hadn’t seen for 57 years in the supermarket, but gets stuck on an irrelevant detail. “Peaches or pears!” I used to say to her as a kid, perhaps a little disrespectfully – it didn’t matter whether she was out shopping for peaches or pears before the important thing happened! Same here – it doesn’t matter how fast you can pump data into the backup system – it’s how fast you can pump data out of it that is the only number worth focusing on.

We have to, as storage industry insiders, experts, advisors, consultants – whatever we want to call ourselves – keep vendors and customers focused on the real important metric: how fast they can recover. We have a duty of care to stand between the FUD and the hype and steer companies on a safe trajectory. The safe trajectory in this case is talking about recovery speeds rather than backup speeds.

This is, for instance, why I rarely get excited about remote office backup strategies. For instance, a current meme in remote office backup strategy is the use of deduplication – most likely source based. The goal? Reduce the amount of data you have to transfer from the remote office to the head office to a small trickle, and all your problems are solved … until, of course, you need to recover that data.

Don’t get me wrong, I’m not against remote office backups – I’m also not against centralised remote office backups, regardless of whether they’re achieved by deduplication, compression, magic pixies or faerie dust. In this example though there’s a simple fact: to talk about remote office backup without discussing remote office recovery is reprehensible.

Yes, reprehensible. I’ll use that term. It’s not a nice term, I know, but nor is the practice of ignoring the elephant in the room – recovery.

Look folks, do you really want me to prance around a stage doing the monkey dance shouting “Recovery! Recovery! Recovery!”? Is that what it has to take? Because, if it is, I’ll do it. (I might, if you don’t mind, try to avoid the flop sweat though.)

What am I asking for? Maybe it’s this simple thought:

Starting this year, let no company (vendor or otherwise) talk about a product’s backup performance without citing real world recovery scenarios and performance in those scenarios.

There is not a guaranteed 1:1 mapping between backup and recovery performance, and to imply there is, either by obfuscation or omission is disrespectful to the data protection industry.

 

While NetWorker 7.6 is not available for download as of the time I write this, the documentation is available on PowerLink. For those of you chomping at the bit to at least read up on NetWorker 7.6, now is the time to wander over to PowerLink delve into the documentation.

The last couple of releases of NetWorker have been interesting for me when it comes to beta testing. In particular, I’ve let colleagues delve into VCB functionality, etc., and I’ve stuck to “niggly” things – e.g., checking for bugs that have caused us and our customers problems in earlier versions, focusing on the command line, etc.

For 7.6 I also decided to revisit the documentation, particularly in light of some of the comments that regularly appear on the NetWorker mailing list about the sorry state of the Performance Tuning and Optimisation Guide.

It’s pleasing, now that the documentation is out, to read the revised and up to date version of the Performance Tuning Guide. Regularly critics of the guide for instance will be pleased to note that FDDI does not appear once. Not once.

Does it contain every possible useful piece of information that you might use when trying to optimise your environment? No, of course not – nor should it. Everyone’s environment will differ in a multitude of ways. Any random system patch can affect performance. A single dodgy NIC can affect performance. A single misconfigured LUN or SAN port can affect performance.

Instead, the document now focuses on providing a high level overview of performance optimisation techniques.

Additionally, recommendations and figures have been updated to support current technology. For instance:

  • There’s a plethora of information on PCI-X vs PCIeXpress.
  • RAM guidelines for the server based on the number of clients has been updated.
  • NMC finally gets a mention as a resource hog! (Obviously, that’s not the words used, but it’s the implication for larger environments. I’ve been increasingly encouraging larger customers to put NMC on a separate host for this reason.)
  • There’s a whole chunk on client parallelism optimisation, both for the clients and the backup server itself.

I don’t think this document is perfect, but if we’re looking at the old document vs the new, and the old document scored a 1 out of 10 on the relevancy front, this at least scores a 7 or so, which is a vast improvement.

Oh, one final point – with the documentation now explicitly stating:

The best approach for client parallelism values is:

– For regular clients, use the lowest possible parallelism settings to best balance between the number of save sets and throughput.

– For the backup server, set highest possible client parallelism to ensure that index backups are not delayed. This ensures that groups complete as they should.

Often backup delays occur when client parallelism is set too low for the NetWorker server. The best approach to optimize NetWorker client performance is to eliminate client parallelism, reduce it to 1, and increase the parallelism based on client hardware and data configuration.

(My emphasis)

Isn’t it time that the default client parallelism value were decreased from the ridiculously high 12 to 1, and we got everyone to actually think about performance tuning? I was overjoyed when I’d originally heard that the (previous) default parallelism value of 4 was going to be changed, then horrified when I found out it was being revised up, to 12, rather than down to 1.

Anyway, if you’ve previously dismissed the Performance Tuning Guide as being hopelessly out of date, it’s time to go back and re-read it. You might like the changes.

 

Back when I first started doing enterprise backup, DLT 7000 had just been introduced. There were a few systems I had to administer that still had DLT 4000 drives attached, but DLT 7000 was rapidly becoming the standard.

With DLT 7000 came a batch of additional headaches, most notably: how do I keep the damn thing streaming? With a 5MB/s write time and at least half of the servers in my environment still connected by 10Mbit rather than 100Mbit ethernet, keeping a drive of that speed streaming was a challenge involving juggling of backup timings and parallelism.

Fast forward 13 years, and we’ve come full circle. For a while systems and networks leapfrogged tape, or at least were able to mostly keep up with tape, but we’re now, with high speed tape like LTO-4, back to a situation the average site will struggle to keep tape streaming.

First, I guess I should qualify – what’s this streaming that I refer to? If you want to get down to the utter nuts and bolts of it, it refers to keeping the tape running through the drive mechanism at a consistent (and high) number of metres per second. (For instance, several LTO-4 drives are rated at 7 metres per second.) In backup terms, what we’re talking about is keeping a consistently high number of MB/s running to the drive.

When we’re unable to keep a consistently high number of MB/s running to the drive, one of two things will typically happen – if the drive is able to (and it depends entirely on the manufacturer and tape format), it may “step down” its streaming speed to a number that is more suitable to the environment. This has variable success. You might be able to argue it’s like only ever going up to 3rd gear in a Ferrari, but I don’t know cars so that’s likely to be a terribly analogy for a whole suite of reasons I don’t understand … :-)

The second thing that may happen is that the tape will start to shoe-shine. Shoe-shining is where the minimum threshold throughput for drive streaming can’t be achieved. The drive eventually starts stopping and starting when its buffers are emptied, etc., and this slows the backup down even further, plus creates additional wear and tear both on drives and on media.

To be blunt – the minimum goal of any backup administrator when it comes to performance tuning an environment should be to eliminate shoe-shining wherever possible.

So, back to that “full circle”; years ago, we’re now at the point again where keeping media streaming is a real challenge.

One problem that frequently occurs on new sites is that when evaluating tape formats for purchase, they look at that magic “bang for buck” number – the size of the media, in GB. For this reason, LTO-4 looks appealing to a large number of sites – 800 GB native, 1.6TB compressed (assuming 2:1 compression), it just seems like a great media format.

The problem that frequently happens though is that the streaming speed isn’t taken into consideration. LTO-4 on average has an uncompressed streaming speed of 120MB/s. This is not easy to achieve, and as you can imagine, achieving faster with compression is even more challenging.

Now, there are undoubtedly big environments that can easily keep LTO-4 streaming with direct backups from client to tape. But these aren’t your average environments. Look at the speed – 120MB/s – that’s faster than gigabit ethernet. We’re immediately talking either large trunked environments at both the server and the clients, or stepping up to 10 gigabit ethernet. We’re talking lots of spindles on high speed disk. Or to be perhaps a little crass, we’re talking buckets of $$$.

To me then the primary impact of high speed tape on backup is the need for organisations to rethink backup when using high speed tape. Using even LTO-3, it was possible for a gigabit based environment to achieve a modicum of tape streaming just by using higher levels of parallelism, etc. However, once you reach the point where your average streaming speed for native/uncompressed backups exceeds your average network speed, you must adjust the backup architecture.

The most common, and most appropriate way to achieve this is to move to a 2-tier storage system, comprising of a layer of disk and then the layer of tape.

Within NetWorker, there’s two ways to achieve this:

  • First backup to disk backup units (ADV_FILE devices), then clone/stage to tape.
  • First backup to virtual tape libraries (VTLs), then clone/stage to tape.

The purpose of either of these mechanisms is to put all the backups that would be done overnight, etc., into a single location where once it is streamed to tape the network is no longer a factor.

So, if we go down the disk backup unit option, this would mean attaching some high speed storage to the backup server (or a storage node – let’s assume in this instance that every time I say “backup server”, I could equally mean “storage node”), and also attach the LTO-4 drives to the backup server. When the backup is initially done though, it is run across the network to the backup server’s disk backup units. Once the backup completes, the backup server runs first cloning operations to write tape copies – without the network in play, and assuming we have suitable hardware connectivity, we should be able to easily keep LTO-4 streaming from one consistent and uninterrupted read from high speed disk. At a later point, we then stage that data – write a second copy, which when completes, removes the copy from the disk backup unit.

(I should note, there’s a raft of other options that can be deployed to assist with getting high speed tape streaming, many of which I discuss in the performance tuning section of my book. I’ve just picked the most common scenario here.)

If we go down the VTL path, we’re still essentially relying on the same mechanism, but in a different format. That is, we’re relying on the scenario that once all the data we want to transfer out to physical tape is on one “chunk” of high speed disk, we can do that transfer at streaming speed.

My first recommendation then to any site that is using LTO-4* in a direct-to-tape scheme, and can’t get drives streaming, is that they need to rethink their backup architecture. In the end it doesn’t matter how much time you spend tweaking software settings here and there, if the hardware can’t cut it, you won’t get it.


* More generally, as you may have imagined, this can apply to any tape format where, as I mentioned earlier in the article, the native streaming speed exceeds the native network speed.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha