Aug 092016
 

I’ve recently been doing some testing around Block Based Backups, and specifically recoveries from them. This has acted as an excellent reminder of two things for me:

  • Microsoft killing Technet is a real PITA.
  • You backup to recover, not backup to backup.

The first is just a simple gripe: running up an eval Windows server every time I want to run a simple test is a real crimp in my style, but $1,000+ licenses for a home lab just can’t be justified. (A “hey this is for testing only and I’ll never run a production workload on it” license would be really sweet, Microsoft.)

The second is the real point of the article: you don’t backup for fun. (Unless you’re me.)

iStock Racing

You ultimately backup to be able to get your data back, and that means deciding your backup profile based on your RTOs (recovery time objectives), RPOs (recovery time objectives) and compliance requirements. As a general rule of thumb, this means you should design your backup strategy to meet at least 90% of your recovery requirements as efficiently as possible.

For many organisations this means backup requirements can come down to something like the following: “All daily/weekly backups are retained for 5 weeks, and are accessible from online protection storage”. That’s why a lot of smaller businesses in particular get Data Domains sized for say, 5-6 weeks of daily/weekly backups and 2-3 monthly backups before moving data off to colder storage.

But while online is online is online, we have to think of local requirements, SLAs and flow-on changes for LTR/Compliance retention when we design backups.

This is something we can consider with things even as basic as the humble filesystem backup. These days there’s all sorts of things that can be done to improve the performance of dense filesystem (and dense-like) filesystem backups – by dense I’m referring to very large numbers of files in relatively small storage spaces. That’s regardless of whether it’s in local knots on the filesystem (e.g., a few directories that are massively oversubscribed in terms of file counts), or whether it’s just a big, big filesystem in terms of file count.

We usually think of dense filesystems in terms of the impact on backups – and this is not a NetWorker problem; this is an architectural problem that operating system vendors have not solved. Filesystems struggle to scale their operational performance for sequential walking of directory structures when the number of files starts exponentially increasing. (Case in point: Cloud storage is efficiently accessed at scale when it’s accessed via object storage, not file storage.)

So there’s a number of techniques that can be used to speed up filesystem backups. Let’s consider the three most readily available ones now (in terms of being built into NetWorker):

  • PSS (Parallel Save Streams) – Dynamically builds multiple concurrent sub-savestreams for individual savesets, speeding up the backup process by having multiple walking/transfer processes.
  • BBB (Block Based Backup) – Bypasses the filesystem entirely, performing a backup at the block level of a volume.
  • Image Based Backup – For virtual machines, a VBA based image level backup reads the entire virtual machine at the ESX/storage layer, bypassing the filesystem and the actual OS itself.

So which one do you use? The answer is a simple one: it depends.

It depends on how you need to recover, how frequently you might need to recover, what your recovery requirements are from longer term retention, and so on.

For virtual machines, VBA is usually the method of choice as it’s the most efficient backup method you can get, with very little impact on the ESX environment. It can recover a sufficient number of files in a single session for most use requirements – particularly if file services have been pushed (where they should be) into dedicated systems like NAS appliances. You can do all sorts of useful things with VBA backups – image level recovery, changed block tracking recovery (very high speed in-place image level recovery), instant access (when using a Data Domain), and of course file level recovery. But if your intent is to recover tens of thousands of files in a single go, VBA is not really what you want to use.

It’s the recovery that matters.

For compatible operating systems and volume management systems, Block Based Backups work regardless of whether you’re in a virtual machine or whether you’re on a physical machine. If you’re needing to backup a dense filesystem running on Windows or Linux that’s less than ~63TB, BBB could be a good, high speed method of achieving that backup. Equally, BBB can be used to recover large numbers of files in a single go, since you just mount the image and copy the data back. (I recently did a test where I dropped ~222,000 x 511 byte text files into a single directory on Windows 2008 R2 and copied them back from BBB without skipping a beat.)

BBB backups aren’t readily searchable though – there’s no client file index constructed. They work well for systems where content is of a relatively known quantity and users aren’t going to be asking for those “hey I lost this file somewhere in the last 3 weeks and I don’t know where I saved it” recoveries. It’s great for filesystems where it’s OK to mount and browse the backup, or where there’s known storage patterns for data.

It’s the recovery that matters.

PSS is fast, but in any smack-down test BBB and VBA backups will beat it for backup speed. So why would you use them? For a start, they’re available on a wider range of platforms – VBA requires ESX virtualised backups, BBB requires Windows or Linux and ~63TB or smaller filesystems, PSS will pretty much work on everything other than OpenVMS – and its recovery options work with any protection storage as well. Both BBB and VBA are optimised for online protection storage and being able to mount the backup. PSS is an extension of the classic filesystem agent and is less specific.

It’s the recovery that matters.

So let’s revisit that earlier question: which one do you use? The answer remains: it depends. You pick your backup model not on the basis of “one size fits all” (a flawed approach always in data protection), but your requirements around questions like:

  • How long will the backups be kept online for?
  • Where are you storing longer term backups? Online, offline, nearline or via cloud bursting?
  • Do you have more flexible SLAs for recovery from Compliance/LTR backups vs Operational/BAU backups? (Usually the answer will be yes, of course.)
  • What’s the required recovery model for the system you’re protecting? (You should be able to form broad groupings here based on system type/function.)
  • Do you have any externally imposed requirements (security, contractual, etc.) that may impact your recovery requirements?

Remember there may be multiple answers. Image level backups like BBB and VBA may be highly appropriate for operational recoveries, but for long term compliance your business may have needs that trigger filesystem/PSS backups for those monthlies and yearlies. (Effectively that comes down to making the LTR backups as robust in terms of future infrastructure changes as possible.) That sort of flexibility of choice is vital for enterprise data protection.

One final note: the choices, once made, shouldn’t stay rigidly inflexible. As a backup administrator or data protection architect, your role is to constantly re-evaluate changes in the technology you’re using to see how and where they might offer improvements to existing processes. (When it comes to release notes: constant vigilance!)

Client Load: Filesystem and Database Backups

 Backup theory, Best Practice, Databases  Comments Off on Client Load: Filesystem and Database Backups
Feb 032016
 

A question I get asked periodically is “can I backup my filesystem and database at the same time?”

As is often the case, the answer is: “it depends”.

Server on FireOr, to put it another way: it depends on what the specific client can handle at the time.

For the most part, backup products have a fairly basic design requirement: get the data from the source (let’s say “the client”, ignoring options like ProtectPoint for the moment) to the destination (protection storage) as quickly as possible. The faster the better, in fact. So if we want backups done as fast as possible, wouldn’t it make sense to backup the filesystem and any databases on the client at the same time? Well – the answer is “it depends”, and it comes down to the impact it has on the client and the compatibility of the client to the process.

First, let’s consider compatibility – if both the filesystem and database backup process use the same snapshot mechanism for instance, and only one can have a snapshot operational at any given time, that immediately rules out doing both at once. That’s the most obvious scenario, but the more subtle one almost comes back to the age-old parallelism problem: how fast is too fast?

If we’re simultaneously conducting a complete filesystem read (say, in the case of a full backup) and simultaneously reading an entire database and the database and filesystem we’re reading from both reside on the same physical LUN, there is the potential the two reads will be counter-productive: if the underlying physical LUN is in fact a single disk, you’re practically guaranteed that’s the case, for instance. We wouldn’t normally want RAID-less storage for pretty much anything in production, but just slipping RAID into the equation doesn’t guarantee we can achieve both reads simultaneously without impact to the client – particularly if the client is already doing other thingsProduction things.

Virtualisation doesn’t write a blank cheque, either; image level backup with databases in the image are a bit of a holy grail in the backup industry but even in those situations where it may be supported, it’s not supported for every database type; so it’s still more common than not to see situations where you have virtual/image level backups for the guest for crash consistency on the file and operating system components, and then an in-guest database agent running for that true guaranteed database recoverability. Do you want a database and image based backup happening at the same time? Your hypervisor is furiously reading the image file while the in-guest agent is furiously reading the database.

In each case that’s just at a per client level. Zooming out for a bit in a datacentre with hundreds or thousands of hosts all accessing shared storage via shared networking, usually via shared compute resources as well, how long is a piece of string becomes a exponentially increasing question as the number of shared resources and items sharing those resources start to come into play.

Unless you have an overflow of compute resources and SSD offering more IO than your systems can ever need, can I backup my filesystem and databases at the same time is very much a non-trivial question. In fact, it becomes a bit of an art, as does all performance tuning. So rather than directly answering the question, I’ll make a few suggestions to be considered along the way as you answer the question for your environment:

  • Recommendation: Particularly for traditional filesystem agent + traditional database agent backups, never start the two within five minutes, and preferably give half an hour gap between starts. I.e., overlap is OK, concurrency for starting should be avoided where possible.
  • Recommendation: Make sure the two functions can be concurrently executed. I.e., if one blocks the other from running at the same time, you have your answer.
  • Remember: It’s all parallelism. Rather than a former CEO leaping around stage shouting “developers, developers, developers!” imagine me leaping around shouting “parallelism, parallelism, parallelism!”* – at the end of the day each concurrent filesystem backup uses a unit of parallelism and each concurrent database backup uses a unit of parallelism, so if you exceed what the client can naturally do based on memory, CPU resources, network resources or disk resources, you have your answer.
  • Remember: Backup isn’t ABC, it’s CDECompression, Deduplication, Encryption: Each function will adjust the performance characteristics of the host you’re backing up – sometimes subtly, sometimes not so. Compression and encryption are easier to understand: if you’re doing either as a client-CPU function you’re likely going to be hammering the host. Deduplication gets trickier of course – you might be doing a bit more CPU processing on the host, but over a shorter period of time if the net result if a 50-99% reduction in the amount of data you’re sending.
  • Remember: You need the up-close and big picture view. It’s rare we have systems so isolated any more that you can consider this in the perspective of a single host. What’s the rest of the environment doing or likely to be doing?
  • Remember: ‘More magic’ is better than ‘magic’. (OK, it’s unrelated, but it’s always a good story to tell.)
  • Most importantly: Test. Once you’ve looked at your environment, once you’ve worked out the parallelism, once you’re happy the combined impact of a filesystem and database backup won’t go beyond the operational allowances on the host – particularly on anything remotely approaching mission critical – test it.

If you were hoping there was an easy answer, the only one I can give you is don’t, but that’s just making a blanket assumption you can never or should never do it. It’s the glib/easy answer – the real answer is: only you can answer the question.

But trust me: when you do, it’s immensely satisfying.

On another note: I’m pleased to say I made it into the EMC Elect programme for another year – that’s every year since it started! If you’re looking for some great technical people within the EMC community (partners, employees, customers) to keep an eye on, make sure you check out the announcement page.


* Try saying “parallelism, parallelism, parallelism!” three times fast when you had a speech impediment as a kid. It doesn’t look good.

Oct 262015
 

As mentioned in my introductory post about it, NetWorker 9 introduces the option to perform Block Based Backups (BBB) for Linux systems. (This was introduced in NetWorker 8 for Windows, and has actually had its functionality extended for Windows in v9 as well, with the option to now perform BBB for Hyper-V and Exchange systems.)

BBB is a highly efficient mechanism for backing up without worrying about the cost of walking the filesystem. Years ago I showed just how much filesystem density can have a massive detrimental impact on the performance of a backup. While often the backup product is blamed for being “slow”, the fault sits completely with operating system and filesystem vendors for having not produced structures that scale sufficiently.

BBB gets us past that problem by side-stepping the filesystem and reading directly from the underlying disk or LUN. Instead of walking files, we just have to traverse the blocks. In cases where filesystems are really dense, the cost of walking the filesystem can increase the run-time of the backup by an order of magnitude or more. Taking that out of the picture allows businesses to protect these filesystems much faster than via conventional means.

Since BBB needs to integrate at a reasonably low level within a system structure in order to successfully operate, NetWorker currently supports only the following systems:

  • CentOS 7
  • RedHat Enterprise Linux v5 and higher
  • SLES Linux 11 SP1 and higher

In all cases, you need to be running LVM2 or Veritas Volume Manager (VxVM), and be using ext3 or ext4 filesystems.

To demonstrate the benefits of BBB in Linux, I’ve setup a test SLES 11 host and used my genfs2 utility on it to generate a really (nastily) dense filesystem. I actually aborted the utility when I had 1,000,000+ files on the filesystem – consuming just 11GB of space:

genfs2 run

genfs2 run

I then configured a client resource and policy/workflow to do a conventional backup of the /testfs filesystem. That’s without any form of performance enhancement. From NetWorker’s perspective, this resulted in about 8.5GB of backup, and with 1,178,358 files (and directories) total took 36 minutes and 37 seconds to backup. (That’s actually not too bad, all things considered – but my lab environment was pretty much quiesced other than the test components.)

Conventional Backup Performance

Conventional Backup Performance

Next, I switched over to parallel savestreams – which has become more capable in NetWorker 9 given NetWorker will now dynamically rebalance remaining backups all the way through to the end of the backup. (Previously the split was effectively static, meaning you could have just one or two savestreams left running by themselves after others had completed. I’ll cover dynamic parallel savestreams in more detail in a later post.)

With dynamic parallel savestreams in play, the backup time dropped by over ten minutes – a total runtime of 23 minutes and 46 seconds:

Dynamic Parallel Savestream Runtime

Dynamic Parallel Savestream Runtime

The next test, of course, involves enabling BBB for the backup. So long as you’ve met the compatibility requirements, this is just a trivial checkbox selection:

Enabling Block Based Backup

Enabling Block Based Backup

With BBB enabled the workflow executed in just 6 minutes and 48 seconds:

Block Based Backup Performance

Block Based Backup Performance

That’s a substantially shorter runtime – the backups have dropped from over 36 minutes for a single savestream to under 7 minutes using BBB and bypassing the filesystem. While Dynamic Parallel Savestreams did make a substantial difference (shaving almost a third from the backup time), BBB was the undisputed winner for maximising backup performance.

One final point – if you’re doing BBB to Data Domain, NetWorker now automatically executes a synthetic full (using the Data Domain virtual synthetic full functionality) at the end of every incremental backup BBB you perform:

Automatic virtual synthetic full

Automatic virtual synthetic full

The advantage of this is that recovery from BBB is trivial – just point your recovery process (either command line, or via NMC) at the date you want to recover from, and you have visibility of the entire filesystem at that time. If you’re wondering what FLR from BBB looks like on Linux, by the way, it’s pretty straight forward. Once you identify the saveset (based on date – remember, it’ll contain everything), you can just fire up the recovery utility and get:

BBB FLR

BBB FLR

Logging in using another terminal session, it’s just a simple case of browsing to the directory indicated above and copying the files/data you want:

BBB FLR directory listing

BBB FLR directory listing

And there you have it. If you’ve got highly dense Linux filesystems, you might want to give serious thought towards upgrading to NetWorker 9 so you can significantly increase the performance of their backup. NetWorker + Linux + BBB is a winning combination.

Sampling device performance

 NetWorker, Scripting  Comments Off on Sampling device performance
Aug 032015
 

Data Protection Advisor is an excellent tool for producing information about your backup environment, but not everyone has it in their environment. So if you’re needing to go back to basics to monitor device performance unattended without DPA in your environment, you need to look at nsradmin.

High Performance

Of course, if you’ve got realtime access to the NetWorker environment you can simply run nsrwatch or NMC. In either of those systems, you’ll see device performance information such as, say:

writing at 154 MB/s, 819 MB

It’s that same information that you can get by running nsradmin. At its most basic, the command will look like the following:

nsradmin> show name:; message:
nsradmin> print type: NSR device

Now, nsradmin itself isn’t intended to be a full scripting language aka bash, Perl, PowerShell or even (heaven forbid) the DOS batch processing system. So if you’re going to gather monitoring details about device performance from your NetWorker server, you’ll need to wrap your own local operating system scripting skills around the process.

You start with your nsradmin script. For easy recognition, I always name them with a .nsri extension. I saved mine at /tmp/monitor.nsri, and it looked like the following:

show name:; message:
print type: NSR device

I then created a basic bash script. Now, the thing to be aware of here is that you shouldn’t run this sort of script too regularly. While NetWorker can sustain a lot of interactions with administrators while it’s running without an issue, why add to it by polling too frequently? My general feeling is that polling every 5 minutes is more than enough to get a view of how devices are performing overnight.

If I wanted to monitor for 12 hours with a five minute pause between checks, that would be 12 checks an hour – 144 checks overall. To accomplish this, I’d use a bash script like the following:

#!/bin/bash
for i in `/usr/bin/seq 1 144`
do
        /bin/date
        /usr/sbin/nsradmin -i /tmp/monitor.nsri
        /bin/echo
        /bin/sleep 300
done >> /tmp/monitor.log

You’ll note from the commands above that I’m writing to a file called /tmp/monitor.log, using >> to append to the file each time.

When executed, this will produce output like the following:

Sun Aug 02 10:40:32 AEST 2015
                        name: Backup;
                     message: "reading, data ";
 
                        name: Clone;
                     message: "writing at 94 MB/s, 812 MB";
 
 
Sun Aug 02 10:45:32 AEST 2015
                        name: Backup;
                     message: "reading, data ";
 
                        name: Clone;
                     message: "writing at 22 MB/s, 411 MB";
 
 
Sun Aug 02 10:50:32 AEST 2015
                        name: Backup;
                     message: "reading, data ";
 
                        name: Clone;
                     message: "writing at 38 MB/s, 81 MB";
 
 
Sun Aug 02 10:55:02 AEST 2015
                        name: Clone;
                     message: "writing at 8396 KB/s, 758 MB";
 
                        name: Backup;
                     message: "reading, data ";

There you have it. In actual fact, this was the easy bit. The next challenge you’ll have will be to extract the data from the log file. That’s scriptable too, but I’ll leave that to you.

Oct 272014
 

One of the great features in NetWorker 8.1 was Parallel Save Streams (PSS). This allows for a single High Density File System (HDFS) to be split into multiple concurrent savesets to speed up the backup walk process and therefore the overall backup.

In NetWorker 8.2 this was expanded to also support Windows filesystems.

Traditionally, of course, a single filesystem or single saveset, if left to NetWorker, will be backed up as a single save operation:

Traditional Saveset Breakdown

With PSS enabled, what would otherwise be a single saveset is split automatically by NetWorker and ends up looking like:

Parallel Save Streams

I’ve previously mentioned parallel save streams, but it occurred to me that periodic test backups I do in my home lab server against a Synology filesystem might be the perfect way of seeing the difference PSS can make.

Now, we all know how fun Synology storage is, and I have a 1513+ with 5 x Hitachi 3TB HDS723030ALA640 drives in a RAID-5 configuration, which is my home NAS server*. It’s connected to my backbone gigabit network via a TP-Link SG2216 16 port managed switch, as is my main lab server, a HP Microserver N40L with Dual AMD 1.5 Turion processors and 4GB of RAM. Hardly a power-house server, and certainly not even a recommended NetWorker server configuration.

Synology of course, curse them, don’t support NDMP, so the Synology filesystem is mounted on the backup server via read-only NFS and backed up via the mount point.

In a previous backup attempt using a standard single save stream, the backup device was an AFTD consisting of RAID-0 SATA drives plugged into the server directly. Here was the backup results:

 orilla.turbamentis.int: /synology/homeshare level=full, 310 GB 48:47:48  44179 files

48 hours, 47 minutes. With saveset compression turned on.

It occurred to me recently to see whether I’d get a performance gain by switching such a backup to parallel save streams. Keeping saveset compression turned on, this was the result:

orilla.turbamentis.int:/synology/homeshare parallel save streams summary orilla.turbamentis.int: /synology/homeshare level=full, 371 GB 04:00:14  40990 files

4,000 less files to be sure, but a drop in backup time from 48 hours 47 minutes down to 4 hours and 14 seconds.

If you’re needing to do traditional backups with high density filesystems, you really should evaluate parallel save streams.


* Yes, I gave in and bought a home NAS server.

Apr 212012
 

What’s the ravine?

When we talk about data flow rates into a backup environment, it’s easy to focus on the peak speeds – the maximum write performance you can get to a backup device, for instance.

However, sometimes that peak flow rate is almost irrelevant to the overall backup performance.

Backup ravine

Many hosts will exist within an environment where only a relatively modest percentage of their data can be backed up at peak speed; the vast majority of their data will instead be backed up at suboptimal speeds. For instance, consider the following nsrwatch output:

High Performance

That’s a write speed averaging 200MB/s per tape drive (peaks were actually 265MB/s in the above tests), writing around 1.5-1.6GB/s.

However, unless all your data is highly optimised structured data running on high performance hardware with high performance networking, your real-world experiences will vary considerably on a minute to minute basis. As soon as filesystem overheads become a significant factor in the backup activity (i.e., you hit fileservers, regular OS and application parts of the operating system, etc.), your backup performance is generally going to drop by a substantial margin.

This is easy enough to test in real-world scenarios; take a chunk of a filesystem (at least 2x the memory footprint of the host in question), and compare the time to backup:

  • The actual files;
  • A tar of the files.

You’ll see in that situation that there’s a massive performance difference between the two. If you want to see some real-world examples on this, check out “In-lab review of the impact of dense filesystems“.

Unless pretty much all of your data environment consists of optimised structured data which is optimally available, you’ll likely need to focus your performance tuning activities on the performance ravine – those periods of time where performance is significantly sub-optimal. Or to consider it another way – if absolute optimum performance is 200MB/s, spending a day increasing that to 205MB/s doesn’t seem productive if you also determine that 70% of the time the backup environment is running at less than 100MB/s. At that point, you’re going to achieve much more if you flatten the ravine.

Looking for a quick fix

There’s various ways that you can aim to do this. If we stick purely within the backup realm, then you might look at factoring in some form of source based deduplication as well. Avamar, for instance, can ameliorate some issues associated with unstructured data. Admittedly though, if you don’t already have Avamar in your environment, adding it can be a fairly big spend, so it’s at the upper range of options that may be considered, and even then won’t necessarily always be appropriate, depending on the nature of that unstructured data.

Traditional approaches have included sending multiple streams per filesystem, and (in some occasions) considering block-level backup of filesystem data (e.g., via SnapImage – though, increasing virtualisation is further reducing SnapImage’s number of use-cases), or using NDMP if the data layout is more amenable to better handling by a NAS device.

What the performance ravine demonstrates is that backup is not an isolated activity. In many organisations there’s a tendency to have segmentation along the lines of:

  • Operating system administration;
  • Application/database administration;
  • Virtualisation teams;
  • Storage teams;
  • Backup administration.

Looking for the real fix

In reality, fixing the ravine needs significant levels of communication and cooperation between the groups, and, within most organisations, a merger of the final three teams above, viz:

Crossing the ravine

The reason we need such close communication, and even team merger, is that baseline performance improvement can only come when there’s significant synergy between the groups. For instance, consider the classic dense-filesystem issue. Three core ways to solve it are:

  • Ensure the underlying storage supports large numbers of simultaneous IO operations (e.g., a large number of spindles) so that multistream reads can be achieved;
  • Shift the data storage across to NAS, which is able to handle processing of dense filesystems better;
  • Shift the data storage across to NAS, and do replicated archiving of infrequently accessed data to pull the data out of the backup cycle all together.

If you were hoping this article might be about quick fixes to the slower part of backups, I have to disappoint you: it’s not so simple, and as suggested by the above diagram, is likely to require some other changes within IT.

If merger in itself is too unwieldy to consider, the next option is the forced breakdown of any communication barriers between those three groups.

A ravine of our own making

In some senses, we were spoilt when gigabit networking was introduced; the solution became fairly common – put the backup server and any storage nodes on a gigabit core, and smooth out those ravines by ensuring that multiple savesets would always be running; therefore even if a single server couldn’t keep running at peak performance, there was a high chance that aggregated performance would be within acceptable levels of peak performance.

Yet unstructured data has grown at a rate which quite frankly has outstripped sequential filesystem access capabilities. It might be argued that operating system vendors and third party filesystem developers won’t make real inroads on this until they can determine adequate ways of encapsulating unstructured filesystems in structured databases, but development efforts down that path haven’t as yet yielded any mainstream available options. (And in actual fact just caused massive delays.)

The solution as environments switch over to 10Gbit networking however won’t be so simple – I’d suggest it’s not unusual for an environment with 10TB of used capacity to have a breakdown of data along the lines of:

  • 4 TB filesystem
  • 2 TB database (prod)
  • 3 TB database (Q/A and development)
  • 500 GB mail
  • 500 GB application & OS data

Assuming by “mail” we’ve got “Exchange”, then it’s quite likely that 5.5TB of the 10TB space will backup fairly quickly – the structured components. That leaves 4.5TB hanging around like a bad smell though.

Unstructured data though actually proves a fundamental point I’ve always maintained – that Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP) are two reasonably independent activities. If they were the same activity, then the resulting synergy would ensure the data were laid out and managed in such a way that data protection would be a doddle. Remember that ILP resembles the following:

Components of ILP

One place where the ravine can be tackled more readily is in the deployment of new systems, which is where that merger of storage, backup and virtualisation comes in, not to mention the close working relationship between OS, Application/DB Admin and the backup/storage/virtualisation groups. Most forms and documents used by organisations when it comes to commissioning new servers will have at most one or two fields for storage – capacity and level of protection. Yet, anyone who works in storage, and equally anyone who works in backup will know that such simplistic questions are the tip of the iceberg for determining performance levels, not only for production access, but also for backup functionality.

The obvious solution to this is service catalogues that cover key factors such as:

  • Capacity;
  • RAID level;
  • Snapshot capabilities;
  • Performance (IOPs) for production activities;
  • Performance (MB/s) for backup/recovery activities (what would normally be quantified under Service Level Agreements, also including recovery time objectives);
  • Recovery point objectives;
  • etc.

But what has all this got to do with the ravine?

I said much earlier in the piece that if you’re looking for a quick solution to the poor-performance ravine within an environment, you’ll be disappointed. In most organisations, once the ravine appears, there’ll need to be at least technical and process changes in order to adequately tackle it – and quite possibly business structural changes too.

Take (as always seems to be the bad smell in the room) unstructured data. Once it’s built up in a standard configuration beyond a certain size, there’s no “easy” fix because it becomes inherently challenging to manage. If you’ve got a 4TB filesystem serving end users across a large department or even an entire company, it’s easy enough to think of a solution to the problem, but thinking about a problem and solving it are two entirely different things, particularly when you’re discussing production data.

It’s here where team merger seems most appropriate; if you take storage in isolation, a storage team will have a very specific approach to configuring a large filesystem for unstructured data access – the focus there is going to be on maximising the number of concurrent IOs and ensuring that standard data protection is in place. That’s not, however, always going to correlate to a configuration that lends itself to traditional backup and recovery operations.

Looking at ILP as a whole though – factoring in snapshot, backup and replication, you can build an entirely different holistic data protection mechanism. Hourly snapshots for 24-48 hours allow for near instantaneous recovery – often user initiated, too. Keeping one of those snapshots per day for say, 30 days, extends this considerably to cover the vast number of recovery requests a traditional filesystem would get. Replication between two sites (including the replication of the snapshots) allows for a form of more traditional backup without yet going to a traditional backup package. For monthly ‘snapshots’ of the filesystem though, regular backup may be used to allow for longer term retention. Suddenly when the ravine only has to be dealt with once a month rather than daily, it’s no longer much of an issue.

Yet, that’s not the only way the problem might be dealt with – what if 80% of that data being backed up is stagnant data that hasn’t been looked at in 6 months? Shouldn’t that then require deleting and archiving? (Remember, first delete, then archive.)

I’d suggest that a common sequence of problems when dealing with backup performance runs as follows:

  1. Failure to notice: Incrementally increasing backup runtimes over a period of weeks or months often don’t get noticed until it’s already gone from a manageable problem to a serious problem.
  2. Lack of ownership: Is a filesystem backing up slowly the responsibility of the backup administrators or the operating system administrators, or the storage administrators? If they are independent teams, there will very likely be a period where the issue is passed back and forth for evaluation before a cooperative approach (or even if a cooperative approach) is decided upon.
  3. Focus on the technical: The current technical architecture is what got you into the mess – in and of itself, it’s not necessarily going to get you out of the mess. Sometimes organisations focus so strongly on looking for a technical solution that it’s like someone who runs out of fuel on the freeway running to the boot of their car, grabbing a jerry can, then jumping back in the driver’s seat expecting to be able to drive to the fuel station. (Or, as I like to put it: “Loop, infinite: See Infinite Loop; Infinite Loop: See Loop, Infinite”.)
  4. Mistaking backup for recovery: In many cases the problem ends up being solved, but only for the purposes of backup, without attention to the potential impact that may make on either actual recoverability or recovery performance.

The first issue is caused by a lack of centralised monitoring. The second, by a lack of centralised management. The third, by a lack of centralised architecture, and the fourth, by a lack of IT/business alignment.

If you can seriously look at all four of those core issues and say replacing LTO-4 tape drives with LTO-5 tape drives will 100% solve a backup-ravine problem every time, you’re a very, very brave person.

If we consider that backup-performance ravine to be a real, physical one, the only way you’re going to get over it is to build a bridge, and that requires a strong cooperative approach rather than a piecemeal approach that pays scant regard for anything other than the technical.

I’ve got a ravine, what do I do?

If you’re aware you’ve got a backup-performance ravine problem plaguing your backup environment, the first thing you’ve got to do is to pull back from the abyss and stop staring into it. Sure, in some cases, a tweak here or a tweak there may appear to solve the problem, but likely it’s actually just addressing a symptom, instead. One symptom.

Backup-performance ravines should in actual fact be viewed as an opportunity within a business to re-evaluate the broader environment:

  1. Is it time to consider a new technical architecture?
  2. Is it time to consider retrofitting an architecture to the existing environment?
  3. Is it time to evaluate achieving better IT administration group synergy?
  4. Is it time to evaluate better IT/business alignment through SLAs, etc.?

While the problem behind a backup-performance ravine may not be as readily solvable as we’d like, it’s hardly insurmountable – particularly when businesses are keen to look at broader efficiency improvements.

If you wouldn’t drink it, don’t cook with it…

 Architecture, Backup theory, General thoughts  Comments Off on If you wouldn’t drink it, don’t cook with it…
Sep 282011
 

This blog article has been moved across to the sister site, Enterprise Systems Backup and Recovery. Read it here.

Arrays, auto hot-spot migration and application performance tuning

 Architecture, General Technology, General thoughts  Comments Off on Arrays, auto hot-spot migration and application performance tuning
Jul 212011
 

I’m not a storage person, as I’ve been at pains to highlight in the past. My personal focus is at all times ILP, not ILM, and so I don’t get all giddy about array speeds and feeds, or anything along those lines.

Of course, if someone were to touch base with me tomorrow and offer me a free 10TB SSD array that I could fit under my desk, my opinion would change.

Queue the chirping crickets.

But seriously, in my “lay technical” view of arrays, I do have this theory and the problems introduced by hot spot migration, and I’m going to throw the theory out there with my reasoning.

First, the background:

  1. When I was taught to program, the credo was “optimise, optimise, optimise”. With limited memory and CPU functionality, we didn’t have the luxury to do lazy programming.
  2. With the staggering increase in processor speeds and memory, many programmers have lost focus on optimisation.
  3. Many second-rate applications can be deemed as such not by pure bugginess, but a distinct lack of optimisation.
  4. The transition from Leopard to Snow Leopard was a perfect example of the impacts of optimisation – the upgrade was about optimisation, not about major new features. And it made a huge difference.
And now, a classic example:
  1. In my first job, I was a system administrator for a very customised SAP system running on Tru64.
  2. Initially the system ran really smoothly all through the week.
  3. Over the 2-3 years I was administering, rumbling slowly developed that on Friday the system would get slower and slower.
  4. This always happened while people were entering their timesheets.
  5. Eventually, as part of Y2K remediation, someone took a look at the SQL commands used for timesheets, and noticed that someone had written a really bad query years ago which basically started by selecting all time sheet entries by all employees, then narrowing down. (Your classic problem of having an SQL query select the wrong results first.)
  6. This was fixed.
  7. System performance leapt through the roof.
  8. Users congratulated everyone on the fantastic “upgrade” that was done.
So, here’s my concern:
  1. For most applications, even complex ones these days, performance will be first IO bound before they become CPU or memory bound.
  2. Hot spot migration to faster media will mask, but not solve performance problems such as those described above.
  3. An application administrator (e.g., DBA) trying to solve application performance will find it challenging to resolve it around hot spot migration, particularly if they run multiple attempts to resolve the problem.
The problem, in short, is two-fold:
  1. First, hot spot migration will mask the problem.
  2. Second, hot spot migration will make problem debugging and resolution more problematic.
Clearly, there’s solutions to this. As someone said to me by reply today – a lot of what we do in IT already introduces these problems. It’s why, for instance, I’d never configure a NetWorker storage node as a virtual machine, because it’s using shared resources for performance. It’s why for instance, I’m always reluctant to use blades in the same situation. The solution, I think, is to to always be mindful of the following:
  1. Hot spot migration, while fantastic for handling load spikes, masquerades rather than solves application architecture/design issues.
  2. Hot spot migration, if supported by the array, but unknown by the application administrator, at best makes analysis and rectification extremely challenging, and at worst may actually make it impossible.
  3. It will always be important to have the option of turning off hot spot migration for deep analysis and debugging.
At least, that’s what I think. What do you think?

Well, it performs – or performance?

 Architecture, Backup theory, NetWorker  Comments Off on Well, it performs – or performance?
Apr 192011
 

We are approaching the point where it would be conceivable for someone to build a PB disk system in their home. It wouldn’t be cheap, but it’s a hell of a lot cheaper than at any point in computing history.

Do you think I’ve gone crazy? Do the math – using 3TB drives, you’d only need 342 drives to get to 1PB. If you wanted to do it really nasty, you could use USB drives, 7-port hubs and a series of PCIe USB cards.

You can get standard motherboards now with 8 PCIe ports. 8 x 4-port USB-2 cards would yield 32 incoming USB channels. Throw a 7-port USB hub onto the end of each of those channels, and you’d get 224 USB connections. Not quite 342, so expand your design a little bit, deploy 2 hosts, split the drives between them and throw in a clustered filesystem across them.

Violá! It’ll be a messy pile of cables and you’ll need decent 3-phase power coming into your house, but you’ll have a PB of storage! Here’s the start of the system diagram before I got too bored and realised I’d need a much bigger screen:

Performance or it performs?

And if you haven’t lost your current meal laughing at this yet – I’ll state the brutally obvious. Performance would suck. For very, very large values of suck. Reliability would suck too, for equally large values of suck.

But you’d have a PB of storage.

That’s what it’s all about, isn’t it? Well, no. And we all know that.

So if we recognise that, when we look at a blatantly absurd example, why is it that we can be sucked into equally absurd configurations?

I remember in the early noughties, when the CLARiiON line got its first ATA line. The sales guys at the company I worked for at the time started selling FC arrays to customers with snapshots going to ATA disk to make snapshots cheap.

The snapshots were indeed cheap.

But as soon as the customers started using their storage with active snapshots, their performance sucked – for large values of suck.

And the customers got angry.

And there was much running around and gnashing of teeth.

And I, a software-only guy sat there thinking “who would be insane enough to think that this would have been OK?”

It’s the old saying – you can have cheap, good or fast. Pick two.

Sometimes, cheap means you don’t even get to pick a second option.

By now you’re probably think that I’m on some meandering ramble that doesn’t have a point. And, if you’re about to give up and close the browser tab, you’d be right.

But, if you went onto this paragraph, you’d get to read my point: if you wouldn’t use a particular server, storage array or configuration in a primary production server, you shouldn’t be considering it for the backup server that has to protect those primary production servers either.

No, not in scenario X, or possibility Y. If you wouldn’t use it for primary production, you don’t use it for support production.

It’s that simple.

And if you want to, feel free to go build that cheap and cheerful 1PB storage array for your production storage. I’m sure your users will love it – about as much as they’d love it if you used that 1PB array to backup your production systems and then had to do a recovery from it. That’s my point; it’s not just about buying something that performs for backup – it’s about having something that has sufficient performance for recovery.

So if someone starts talking to you about deploying X for backup, and X is a bit slower, but it’s a lot cheaper, just consider this: would you use X for your primary production system? If the answer is no, then you’d better have some damn good performance data at hand to show that it’s appropriate for the backup of those primary production systems.

Psst! Want to touch my backup server?

 Architecture, NetWorker  Comments Off on Psst! Want to touch my backup server?
Apr 052011
 

Last month, I posted a survey with the following questions:

  1. What is your backup server (currently)?
    1. Physical server
    2. Virtual server, backing up directly
    3. Virtual server, in director mode only
    4. Blade server, backing up directly
    5. Blade server, director mode only
  2. Would you run a virtual backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.
  3. Would you run a blade backup server?
    1. Yes – backing up to disk only.
    2. Yes – backing up to any device.
    3. Yes – only as a director.
    4. No.
    5. Already do.

Now, I did preface this survey with my own feelings at the time:

I have to admit, I have great personal reservations towards virtualising backup servers. There’s a simple, fundamental reason for this: the backup server should have as few dependencies as possible in an environment. Therefore to me it seems completely counter-intuitive to make the backup server dependent on an entire virtualisation layer existing before it can be used.

For this reason I also have some niggling concerns with running a backup server as a blade server.

Personally, at this point in time, I would never willingly advocate deploying a NetWorker server as a virtual machine (except in a lab situation) – even when running in director mode.

At the time of the survey, I already knew from a few different sources that EMC run virtualised NetWorker servers as part of their own environment, and are happy to recommend it. I however, wasn’t. (And let’s face it, I’ve been working with NetWorker for longer than EMC’s owned it.) That being said, I wasn’t looking for confirmation that I was right – I was looking for justifiable reasons why I might be wrong.

First, I want to present the survey findings, and then I’ll discuss some of the comments and where I now stand.

There were 122 respondents to the survey, and the answers were:

Current Backup Server

Did this number surprise me? Not really – by its very nature, backup operations and administration is about being conservative: keep things simple, don’t go bleeding edge, and trust what is known. As such, the majority of sites are running a physical backup server. Of the respondents, only 10% were running any form of virtualised backup server, regardless of whether that was a software or hardware virtualised server, and regardless of whether it was directly doing backups or backing up in director mode only.

Would you run a virtual backup server?

So this question was a simple one – would you run a backup server that was virtual? Anyone who has done any surveys would claim (rightly so) that my leading questions into the survey may have coloured the results of the survey, and I’d not disagree with them.

Yet, let’s look at those numbers – less than 50% (admittedly only by a small margin) gave an outright “No” response to this question. I was pleased though that those who would run a virtualised backup server seemed to mirror my general thoughts on the matter – the majority would only do so in director mode, with the next biggest group being willing to backup to disk to the backup server, but not using other devices.

Would you run a blade backup server?

The final question asked the same about blade servers. To be fair to those using blade servers, this probably should have been prefaced with a question “Do you use blade servers in your environment already?”, since it would seem logical that anyone currently not using blade servers probably wouldn’t answer yes to this. But I was still curious – as you may be aware, I’ve had some questions about blade servers in the past; and other than offering better rack density I see them having no tangible benefits. (Then again, I am in a country that has no lack of space.)

The big difference between a software virtualised backup server and a hardware virtualised backup server though was that people who would run a backup server in a blade environment were more willing to backup to any device. That’s probably understandable. It smells like and looks like regular hardware, so it feels easier than say, a virtual machine accessing a physical tape drive does.

So, the survey showed me fairly much what I was expecting I’d see – a high level of users with physical backup servers. I was hoping though that I might see some comments from people who were either using, or considering using virtual servers, and get some feedback on what they found to be the case.

One of the best comments that came through was from Alex Kaasjager. He started with this:

I agree with you that a backup server (master, director) should be as independent as possible – and right for that specific reason, I’d prefer the server virtualised. Virtualisation solves the problem of a hardware, a hardware-bound OS, location and redundancy.

That immediately got my attention – and so Alex followed with these examples:

– if my hardware breaks (and it will at a certain point in time) I will have to keep a spare machine or go with reinstall-recovery, which, as you will agree, poses its own very peculiar set of problems
– the OS, regardless which one, is bound to the hardware, be it for licensing, MAC address, or drivers. A change in the OS (because of a move to another datacenter for example) may hurt (although it probably won’t, in all fairness)
– I can move my VM anywhere, to another rack, datacenter, or country without much hassle, I can copy, make a snap and even export it. Hardware will prevent this.

Of all the things I hadn’t considered, it was the simple ability to move your backup server between virtual servers wasn’t what I’d considered. Alex’s first point – about protection from hardware failure – is very cogent on its own, but being able to just move the backup server around without impacting any operations, or disrupting licenses – now that’s the kind of “bonus” argument I was looking for. (It’s why, for instance, I’ve advocated that if you’re going to have a License Manager server, you make that virtual.)

Another backup administrator (E. O’S) advocated:

It absolutely has to be in director mode as you describe. All the benefits of hardware abstraction and HA/FT that you get with VM are just as relevant to a critical an app as NetWorker, especially for storage mobility and expansion for a growing and changing datazone. Snapshots before major upgrades? Cloning for testing or redeployment to another site? Yes please. You have to be more confident than ever in your ability to recover NetWorker with bootstraps and indices (even onto a physical host if you need to, to solve your virtualisation layer dependency conundrum) if and when the time comes. Plan for it, practice it, and sleep easy.

The final part of what I’ve quoted there comes to the heart of my reservations of running NetWorker virtualised, even in a director role – how do you do an mmrecov of it? In particular, even when running as a backup director, the NetWorker server still has to back its own bootstrap information up to a local device. Ensuring that you can still recover from such a device would become of paramount importance.

I think the solution here is three-fold:

  • (Already available) Design a virtualised backup server such that the risk of having to do a bootstrap recovery in DR is as minimal as possible.
  • (Already available) Assuming you’re doing those bootstrap backups to disk/virtual disk, be sure to keep them as a separate disk file to the standard disk file for the VM, so that you can run any additional cloning/copying of that you want at a lower level, or attach it to another VM in an emergency.
  • (EMC please take note) It’s time that we no longer needed to do any backups to devices directly attached to the backup server. NetWorker does need architectural enhancements to allow bootstrap backup/recovery to/from storage node devices. Secondary to this: DR should not be dependent on the original and the destination host having the same names.)

So, has this exercise changed my mind or reinforced my belief that you should always run a physical backup server?

I’m probably now awkwardly sitting on the fence – facing the “virtual is OK for director mode only” camp. That would be with strong caveats to do with recoverability arrangements for the virtual machine. In particular, what I’d suggest is that I would not agree with virtualising the backup server if you were in such a small environment that there’s no provisioning for moving the guest machine between virtual servers. The absolute minimum, for me, in terms of reliability of such a solution is being able to move the backup server from one physical host to another. If you can do that, and you can then have a very well practiced and certain recovery plan in the event of a DR, then yeah, I’m sold on the merits of having a virtualised backup director server.

(If EMC updated NetWorker as per that final bullet point above? I’d be very happy to pitch my tent in that camp.)

I’ve got a couple of follow-up points and questions I’ll be making over the coming week, but I wanted to at least get this initial post out.