I’m not a storage person, as I’ve been at pains to highlight in the past. My personal focus is at all times ILP, not ILM, and so I don’t get all giddy about array speeds and feeds, or anything along those lines.

Of course, if someone were to touch base with me tomorrow and offer me a free 10TB SSD array that I could fit under my desk, my opinion would change.

Queue the chirping crickets.

But seriously, in my “lay technical” view of arrays, I do have this theory and the problems introduced by hot spot migration, and I’m going to throw the theory out there with my reasoning.

First, the background:

  1. When I was taught to program, the credo was “optimise, optimise, optimise”. With limited memory and CPU functionality, we didn’t have the luxury to do lazy programming.
  2. With the staggering increase in processor speeds and memory, many programmers have lost focus on optimisation.
  3. Many second-rate applications can be deemed as such not by pure bugginess, but a distinct lack of optimisation.
  4. The transition from Leopard to Snow Leopard was a perfect example of the impacts of optimisation – the upgrade was about optimisation, not about major new features. And it made a huge difference.
And now, a classic example:
  1. In my first job, I was a system administrator for a very customised SAP system running on Tru64.
  2. Initially the system ran really smoothly all through the week.
  3. Over the 2-3 years I was administering, rumbling slowly developed that on Friday the system would get slower and slower.
  4. This always happened while people were entering their timesheets.
  5. Eventually, as part of Y2K remediation, someone took a look at the SQL commands used for timesheets, and noticed that someone had written a really bad query years ago which basically started by selecting all time sheet entries by all employees, then narrowing down. (Your classic problem of having an SQL query select the wrong results first.)
  6. This was fixed.
  7. System performance leapt through the roof.
  8. Users congratulated everyone on the fantastic “upgrade” that was done.
So, here’s my concern:
  1. For most applications, even complex ones these days, performance will be first IO bound before they become CPU or memory bound.
  2. Hot spot migration to faster media will mask, but not solve performance problems such as those described above.
  3. An application administrator (e.g., DBA) trying to solve application performance will find it challenging to resolve it around hot spot migration, particularly if they run multiple attempts to resolve the problem.
The problem, in short, is two-fold:
  1. First, hot spot migration will mask the problem.
  2. Second, hot spot migration will make problem debugging and resolution more problematic.
Clearly, there’s solutions to this. As someone said to me by reply today – a lot of what we do in IT already introduces these problems. It’s why, for instance, I’d never configure a NetWorker storage node as a virtual machine, because it’s using shared resources for performance. It’s why for instance, I’m always reluctant to use blades in the same situation. The solution, I think, is to to always be mindful of the following:
  1. Hot spot migration, while fantastic for handling load spikes, masquerades rather than solves application architecture/design issues.
  2. Hot spot migration, if supported by the array, but unknown by the application administrator, at best makes analysis and rectification extremely challenging, and at worst may actually make it impossible.
  3. It will always be important to have the option of turning off hot spot migration for deep analysis and debugging.
At least, that’s what I think. What do you think?
 

We all have appliances, right? Teapots and toasters and microwaves and automatic coffee machines, etc. They’re all appliances. So are clock radios, electric razors, heaters and fans.

They’re appliances.

VTLs, SANs and NASs are not appliances, despite what any vendor would try to tell you. As soon as you’ve got an OS + software layer, you’re moving beyond “appliance” into “black box”. Or maybe we’re talking the difference between an appliance and an Appliance. If a vendor wants to tell you otherwise, they’re not telling you the whole story.

There’s a simple test on whether you’re being sold an appliance, or an Appliance – a simple yes/no question:

Is there a training course for the unit or an instruction manual with more than 1 page of instructions per language?

If the answer is “no”, then congratulations, you’ve got an appliance; if the answer is yes, then despite whatever your vendor wants to tell you, you’ve got an Appliance.

Now, there’s nothing wrong with having an Appliance within your organisation, and in fact I’d suggest that frequently they add a lot of value. VTLs, SANs and NASs, to use the example I previously provided, are all capable of greatly extending the storage and data protection options within your environment and should of course be considered in many architectures.

Knowing that they’re Appliances rather than appliances though means that you can treat them appropriately. I personally don’t care about backing up my toaster, or keeping a close eye on the logs from my microwave. As the appliance complexity increases, I pay more attention – so for instance the most critical appliance in my home is arguably the automatic espresso machine, and since it has blinking lights that can tell me whether I’m able to get a cup of coffee from it or not, I pay attention to it.

Extending this process, when you move from having appliances in your organisation to having Appliances, it’s critical that they are treated as full blown systems that require the same level of support, administration and consideration when it comes to problem resolution. Or another way to consider it, from a support perspective – if there’s an error happening in your environment, don’t ignore the “black boxes” when it comes to problem diagnosis. This means being aware of at least the following:

  • How to view basic status;
  • How to extract logs;
  • Any caveats to reading logs (e.g., are they time/date stamped using a different GMT offset to your environment?);
  • How to review the logs;
  • How to escalate requests to the Appliance vendor.

Once you’ve been working with Appliances for a while, all of these start to come naturally. The big trick for beginners in the Appliance realm though is to ignore the “black box” you’ve been sold and instead be aware of the components and how to access the diagnostic information for the unit. If you can’t, you’ve created a “black hole” – and that’s not something you’ll get a lot of satisfaction from.

 

With their recent acquisition of Data Domain, some people at EMC have become table thumping experts overnight on why you it’s absolutely imperative that you backup to Data Domain boxes as disk backup over NAS, rather than a fibre-channel connected VTL.

Their argument seems to come from the numbers – the wrong numbers.

The numbers constantly quoted are number of sales of disk backup Data Domain vs VTL Data Domain. That is, some EMC and Data Domain reps will confidently assert that by the numbers, a significantly higher percentage of Data Domain for Disk Backup has been sold than Data Domain with VTL. That’s like saying that Windows is superior to Mac OS X because it sells more. Or to perhaps pick a little less controversial topic, it’s like saying that DDS is better than LTO because there’s been more DDS drives and tapes sold than there’s ever been LTO drives and tapes.

I.e., an argument by those numbers doesn’t wash. It rarely has, it rarely will, and nor should it. (Otherwise we’d all be afraid of sailing too far from shore because that’s how it had always been done before…)

Let’s look at the reality of how disk backup currently stacks up in NetWorker. And let’s preface this by saying that if backup products actually started using disk backup properly tomorrow, I would be the first to shout “Don’t let the door hit your butt on the way out” to every VTL on the planet. As a concept, I wish VTLs didn’t have to exist, but in the practical real world, I recognise their need and their current ascendency over ADV_FILE. I have, almost literally at times, been dragged kicking and screaming to that conclusion.

Disk Backup, using ADV_FILE type devices in NetWorker:

  • Can’t move a saveset from a full disk backup unit to a non-full one; you have to clear the space first.
  • Can’t simultaneously clone from, stage from, backup to and recover from a disk backup unit. No, you can’t do that with tape either, but when disk backup units are typically in the order of several terabytes, and virtual tapes are in the order of maybe 50-200 GB, that’s a heck of a lot less contention time for any one backup.
  • Use tape/tape drive selection algorithms for deciding which disk backup unit gets used in which order, resulting in worst case capacity usage scenarios in almost all instances.
  • Can’t accept a saveset bigger than the disk backup unit. (It’s like, “Hello, AMANDA, I borrowed some ideas from you!”)
  • Can’t be part-replicated between sites. If you’ve got two VTLs and you really need to do back-end replication, you can replicate individual pieces of media between sites – again, significantly smaller than entire disk backup units. When you define disk backup units in NetWorker, that’s the “smallest” media you get.
  • Are traditionally space wasteful. NetWorker’s limited staging routines encourages clumps of disk backup space by destination pool – e.g., “here’s my daily disk backup units, I use them 30 days out of 31, and those over there that occupy the same amount of space (practically) are my monthly disk backup units, I use them 1 day out of 31. The rest of the time they sit idle.”
  • Have poor staging options (I’ll do another post this week on one way to improve on this).

If you get a table thumping sales person trying to tell you that you should buy Data Domain for Disk Backup for NetWorker, I’d suggest thumping the table back – you want the VTL option instead, and you want EMC to fix ADV_FILE.

Honestly EMC, I’ll lead the charge once ADV_FILE is fixed. I’ll champion it until I’m blue in the face, then suck from an oxygen tank and keep going – like I used to, before the inadequacies got too much. Until then though, I’ll keep skewering that argument of superiority by sales numbers.

 

Over at SearchStorage, there’s an article at the moment about using NAS disk as a disk backup target – i.e., where (in NetWorker), the ADV_FILE device would be created.

I have to say, I strongly disagree with the notion of using NAS mounted filesystems for disk backup, even if NetWorker lets you. In short, it’s a very bad idea, and primarily for performance reasons.

Consider this – the optimal backup configuration for NAS is to use NDMP wherever possible; otherwise, if we backup the volume(s) as they are mounted on another host, every backup involves a double network transfer – once to retrieve the data from the NAS device to the mounter, and then a second transfer to have the backup product copy the data from the mounter to backup storage.

So, let me ask the obvious question – if performance issues act as a primary reason to not backup NAS via mounts, are there any compelling performance reasons why the reverse would be acceptable?

I don’t believe there are. If wishing to use array presented storage for disk backup, it would be far more advisable to use SAN storage, where the volume(s) are presented and attached as just another form of local storage.

Backing up to NAS is one of those activities that falls into the realm of “just because you can do something doesn’t mean you should do it.”

[Edit, 2009-11-15]

In recent discussions with a couple of vendors, I’m willing to entertain the notion that backing up to NAS may be acceptable in an enterprise environment, but my caveat would still be a dedicated 10 Gbit ethernet link between the NAS server and the backup server.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha