Preston de Guise

 

Virtualisation.

It’s a fantastic blade to wield through a datacentre. Sweeping and scything, whole racks of equipment are reduced to single servers presenting dozens of hosts. All those driver disks? All those complex and fiddly options for hardware components during OS installation? Brushed aside – all the virtual components are simple and have rock solid drivers. Virtual machine host failing? That’s OK, just push the virtual machines across to another server without the users even noticing.

The improvements virtualisation has made to system efficiency, reliability, etc., in the x86/x86_64 field have been unquestionable.

Yet, like any other sword, it’s double edged.

Virtualisation is about cramming as many systems as is practical within a single bucket.

Backup is something that virtualisation has always handled poorly. And there’s a reason for this – virtualisation is designed for environments where the hosts cooperatively share access to resources. Thin provisioning isn’t just about storage – it’s also about CPU, networking and memory.

Backup isn’t about cooperative sharing of CPU, networking or memory. It’s about needing to get as much data from A to B as possible as quickly as can be done:

The problem with virtualisation backup

Backup at the guest level wants to suck as much data from the virtual network pipes provided by all those machines on the same host at the same time. You want to see the biggest, most powerful virtualisation server your company has ever bought grind to a halt and saturate the network as well? There’s a good chance backing up every guest it runs simultaneously will do the trick just nicely.

When VMware first came up with VCB, it was meant to be the solution. Pull the backup away from the guest, make it part of the hypervisor, and voilà, the problem is solved!

Except it was written by people who believed virtualisation applied only to Windows systems. And thus, it was laughably sad. No, I’m not having a dig at Windows here. But I am having a dig at the notion of homogeneous virtual environments. Sure, they exist, but designing products around them when you’re the virtualisation vendor is … well, I have to say, short sighted.

Perhaps for this reason, or perhaps for less desirable reasons, VCB never really gained the traction VMware likely hoped for, and so something else had to be developed. Something more expansive.

So, VADP was meant to be the big, grand solution to this. And indeed, the VADP API allows more than just Windows systems backups to be performed in such a way that file level recovery from those backups is possible.

What’s the vendor support like though? Haphazard, irregular and inconsistent would probably be the best description. Product X: “Oh, you want to backup a database as well? You need to revert to a guest agent.” Product Y: “Huh? Linux? Guest agent.” Product Z: “Linux? Sure! For any system – well, any that uses ext2 or ext3 filesystems” … you get the picture.

So the problem with VADP is that it’s only a partial solution. In fact, it’s less than half the solution for backing up virtual machines on VMware. It’s maybe 40%. The other 40% is provided by whatever backup product you’re using, and there’s 20% glue.

Between that 40%, 20% and 40%, there’s a lot of scope for things to fall through the cracks.

Where “things” are:

  • Guests using operating systems the backup product doesn’t support VADP with;
  • Guest using filesystems the backup product doesn’t support VADP with;
  • Guests using databases or applications the backup product doesn’t support VADP with.

VADP is the emperor’s new clothes. Everyone is sold on it until the discussions start around what they can’t do with it.

I’m tired of VADP being seen as a silver bullet. That’s the real problem – it doesn’t matter how many hoozits a widget has – if it doesn’t have the hoozit you need, the widget is not fit for your purposes.

I’m not pointing the finger at EMC here. I don’t see a single backup vendor, enterprise or otherwise, providing complete backup solutions under VADP. There’s always something missing.

Until that isn’t the case, you’ll excuse me if I don’t drink the VADP koolaid.

After all, my job is to make sure we can backup everything, not just the easy bits.

 

Periodically, I talk about backup being just a part of a broader set of strategies that I refer to as Information Lifecycle Protection (ILP). This is distinct from Information Lifecycle Management (ILM), and has components as follows:

Components of ILP

A common mistake within an organisation, sometimes triggered by not having merged Backup, Storage and Virtualisation administration, is to approach all backup requirements and challenges only from a backup perspective. When approached from just a backup technology perspective, sometimes it doesn’t matter how elegant your solution is – it just may not be optimal.

Optimal solutions sometimes require extending the umbrella. A classic example of this is NAS. Consider for instance an enterprise environment that has a NAS in the production datacentre, replicating to a disaster recovery datacentre:

Replicated NAS

This is a fairly standard strategy, yet NAS often presents significant challenges to backup environments. Even with NDMP in place, coming up with a nightly data protection strategy for fileservers presenting tens of millions of files is not easy. Various NDMP techniques may allow for speeding up the backup process via block level strategies, but file level recovery from these styles of backups tend to either be challenging at best, or not even possible in the worst case scenario.

As is always the case, whether you can even get a backup done is irrelevant if you can’t recover the data in an appropriately usable way.

What’s more, unstructured data doesn’t really lend itself well to more frequent backups than every 24 hours. While database logs can be captured on an almost continual basis, if it takes 8 hours to do an incremental walk of a highly dense filesystem for traditional backup, but the business requires a Recovery Point Objective (RPO) of just 1 hour, your traditional nightly-incremental strategy just doesn’t cut it.

So, we turn to other aspects in ILP.

The first step is to start using snapshots:

NAS and snapshotsOnce configured at the storage layer, NAS snapshots happen pretty much automatically. If the business requires an RPO of 1 hour, then the most obvious protection strategy is to have the NAS take a snapshot every hour. These copy-on-write style snapshots are typically browsable by end-users, and in that situation they have an added advantage – if users can browse a snapshot and find the file they want, they don’t need to ask the backup team to recover the file(s) they need.

However – snapshots on their own represent a poor data protection strategy, since they’re only as safe as the array they’re sitting on, and relying solely on snapshots to protect data on an array, when the snapshots are also on that array, is … well, insane.

So, we have to make use of that replication strategy, and ensure that the snapshots are replicated as well:

Replicated Snapshots

So at this point, we’ve got:

  • Snapshots providing an hourly RPO;
  • Snapshots providing a user-directed nearly recovery process;
  • Replication providing protection for snapshots in case of total array failure.

Now, some storage manufacturers would like to suggest that at this point you’ve got a valid backup solution. Not so fast, though! It’s only a valid backup solution if you’re prepared to burn through money to buy enough storage to provide long-term recoverability from snapshot. It’s around this point that you’ll want a backup product inserted into the protection strategy.

However, we don’t just insert a daily backup and leave it at that; if the NAS snapshots are configured correctly we can extend that the convenience factor for end-users whilst still getting a copy out to off-line storage. In this scenario, we might end up with a solution such as the following:

Snapshots with Daily Backup

In this scenario, hourly snapshots are kept for 24 hours, with the final snapshot of each day kept in turn as the “daily” backup for n days. In many businesses this will extend to more than a week – e.g., 28 or 31 days. In the above example, those “daily” snapshots are each written out to tape. Keep in mind that we’re still replicating the NAS and its snapshots from one site to another, so we hit a new benefit of combining snapshot, replication and backup into a comprehensive ILP strategy – when the traditional backup is run, it can be run from the replicated data, offloading the impact of the backup from the production NAS:

Replica Snapshot Backups

Of course, this isn’t the only way the backup strategy can work. If sufficient protection is available on both the production and replica NAS units, and the filesystems are large enough, only weekly backups might get output to tape:

Snapshots with Weekly Backups

With that strategy, no incremental backups of the NAS are ever written to tape – just weekly fulls.

Nothing in the above data protection strategy is particularly complex – but equally, none of it is really all that possible when considering backups in isolation. As soon as backups are considered along side with the other activities in ILP (RAID, Replication and Snapshots), advanced and flexible strategies such as the above become available.

So before you design you approach your next data protection challenge, ask yourself the following question:

Does this need a backup strategy, or does it need an Information Lifecycle Protection strategy?

 

How much time do your staff take to monitor backups?

The answer should be: very little.

Not because they don’t care, or you’re not tasking someone with the responsibility, but because your system should be designed such that your staff can see a “big picture” overview of all backups in a very short period of time. Assuming you do all your full backups on the weekend, your staff don’t arrive until 08.55 and spend the first 10 minutes grabbing a coffee, chatting, logging on, firing up email, browsers, etc., then if your staff can’t by 09.15 tell you what your percentage success rate for weekend backups, you’re monitoring backups wrong.

Don’t get this confused with troubleshooting. If backups encountered problems, troubleshooting may take considerably longer.

What unfortunately happens all too regularly is that monitoring and troubleshooting are seen as the same activity, or worse, they occupy the same amount of time. Nothing should be further from the truth.

 

How many datacentres do you have across your organisation?

Server Rooms

It’s a simple enough question, but the answer is sometimes not as simple as some organisations think.

The reason for this is it’s too easy to imagine different physical locations equating to different datacentres.

There are actually two conditions that must be met before a server room can be considered a fully independent datacentre. These are:

  1. Physical separation – The room/building must be sufficiently physically separated from other datacentres. By “sufficiently”, I mean that any disaster situation the company designs into its contingency plans should not be able to take out both more than one datacentre on the basis of physical proximity to one another.
  2. Technical separation – The room/building must be able to operate at its full production potential without the direct availability of any other datacentre within the environment.

So what does it mean if you have a datacentre that doesn’t meet both of those requirements? Quite simply, it’s not an independent datacentre at all, and likely should be considered just a remote server room which is part of a geographically disperse datacentre.

If you’re wondering what the advantage of making this distinction is, it’s this: unless they’re truly independent, considering geographically disperse server rooms to be datacentres results in the business often making highly incorrect assumptions about the resiliency of the IT systems, and by extension, the business itself.

You might think that we have enough differentiation by referring to simply datacentres and independent datacentres. This, I believe, compounds the problem rather than introducing clarity; many people, particularly those who are budget conscious, will assume the best possible scenario for the least possible price. We all do it – that’s why getting a bargain when shopping can be such a thrill. So while-ever a non-independent datacentre is referred to as a datacentre, it’s going to be read by a plethora of people within a business, or the customers of that business, as an independent one. The solution is to take the word away.

So, on that basis, it’s time to recount, and answer: how many datacentres does your business truly have?

 

What’s the ravine?

When we talk about data flow rates into a backup environment, it’s easy to focus on the peak speeds – the maximum write performance you can get to a backup device, for instance.

However, sometimes that peak flow rate is almost irrelevant to the overall backup performance.

Backup ravine

Many hosts will exist within an environment where only a relatively modest percentage of their data can be backed up at peak speed; the vast majority of their data will instead be backed up at suboptimal speeds. For instance, consider the following nsrwatch output:

High Performance

That’s a write speed averaging 200MB/s per tape drive (peaks were actually 265MB/s in the above tests), writing around 1.5-1.6GB/s.

However, unless all your data is highly optimised structured data running on high performance hardware with high performance networking, your real-world experiences will vary considerably on a minute to minute basis. As soon as filesystem overheads become a significant factor in the backup activity (i.e., you hit fileservers, regular OS and application parts of the operating system, etc.), your backup performance is generally going to drop by a substantial margin.

This is easy enough to test in real-world scenarios; take a chunk of a filesystem (at least 2x the memory footprint of the host in question), and compare the time to backup:

  • The actual files;
  • A tar of the files.

You’ll see in that situation that there’s a massive performance difference between the two. If you want to see some real-world examples on this, check out “In-lab review of the impact of dense filesystems“.

Unless pretty much all of your data environment consists of optimised structured data which is optimally available, you’ll likely need to focus your performance tuning activities on the performance ravine – those periods of time where performance is significantly sub-optimal. Or to consider it another way – if absolute optimum performance is 200MB/s, spending a day increasing that to 205MB/s doesn’t seem productive if you also determine that 70% of the time the backup environment is running at less than 100MB/s. At that point, you’re going to achieve much more if you flatten the ravine.

Looking for a quick fix

There’s various ways that you can aim to do this. If we stick purely within the backup realm, then you might look at factoring in some form of source based deduplication as well. Avamar, for instance, can ameliorate some issues associated with unstructured data. Admittedly though, if you don’t already have Avamar in your environment, adding it can be a fairly big spend, so it’s at the upper range of options that may be considered, and even then won’t necessarily always be appropriate, depending on the nature of that unstructured data.

Traditional approaches have included sending multiple streams per filesystem, and (in some occasions) considering block-level backup of filesystem data (e.g., via SnapImage – though, increasing virtualisation is further reducing SnapImage’s number of use-cases), or using NDMP if the data layout is more amenable to better handling by a NAS device.

What the performance ravine demonstrates is that backup is not an isolated activity. In many organisations there’s a tendency to have segmentation along the lines of:

  • Operating system administration;
  • Application/database administration;
  • Virtualisation teams;
  • Storage teams;
  • Backup administration.

Looking for the real fix

In reality, fixing the ravine needs significant levels of communication and cooperation between the groups, and, within most organisations, a merger of the final three teams above, viz:

Crossing the ravine

The reason we need such close communication, and even team merger, is that baseline performance improvement can only come when there’s significant synergy between the groups. For instance, consider the classic dense-filesystem issue. Three core ways to solve it are:

  • Ensure the underlying storage supports large numbers of simultaneous IO operations (e.g., a large number of spindles) so that multistream reads can be achieved;
  • Shift the data storage across to NAS, which is able to handle processing of dense filesystems better;
  • Shift the data storage across to NAS, and do replicated archiving of infrequently accessed data to pull the data out of the backup cycle all together.

If you were hoping this article might be about quick fixes to the slower part of backups, I have to disappoint you: it’s not so simple, and as suggested by the above diagram, is likely to require some other changes within IT.

If merger in itself is too unwieldy to consider, the next option is the forced breakdown of any communication barriers between those three groups.

A ravine of our own making

In some senses, we were spoilt when gigabit networking was introduced; the solution became fairly common – put the backup server and any storage nodes on a gigabit core, and smooth out those ravines by ensuring that multiple savesets would always be running; therefore even if a single server couldn’t keep running at peak performance, there was a high chance that aggregated performance would be within acceptable levels of peak performance.

Yet unstructured data has grown at a rate which quite frankly has outstripped sequential filesystem access capabilities. It might be argued that operating system vendors and third party filesystem developers won’t make real inroads on this until they can determine adequate ways of encapsulating unstructured filesystems in structured databases, but development efforts down that path haven’t as yet yielded any mainstream available options. (And in actual fact just caused massive delays.)

The solution as environments switch over to 10Gbit networking however won’t be so simple – I’d suggest it’s not unusual for an environment with 10TB of used capacity to have a breakdown of data along the lines of:

  • 4 TB filesystem
  • 2 TB database (prod)
  • 3 TB database (Q/A and development)
  • 500 GB mail
  • 500 GB application & OS data

Assuming by “mail” we’ve got “Exchange”, then it’s quite likely that 5.5TB of the 10TB space will backup fairly quickly – the structured components. That leaves 4.5TB hanging around like a bad smell though.

Unstructured data though actually proves a fundamental point I’ve always maintained – that Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP) are two reasonably independent activities. If they were the same activity, then the resulting synergy would ensure the data were laid out and managed in such a way that data protection would be a doddle. Remember that ILP resembles the following:

Components of ILP

One place where the ravine can be tackled more readily is in the deployment of new systems, which is where that merger of storage, backup and virtualisation comes in, not to mention the close working relationship between OS, Application/DB Admin and the backup/storage/virtualisation groups. Most forms and documents used by organisations when it comes to commissioning new servers will have at most one or two fields for storage – capacity and level of protection. Yet, anyone who works in storage, and equally anyone who works in backup will know that such simplistic questions are the tip of the iceberg for determining performance levels, not only for production access, but also for backup functionality.

The obvious solution to this is service catalogues that cover key factors such as:

  • Capacity;
  • RAID level;
  • Snapshot capabilities;
  • Performance (IOPs) for production activities;
  • Performance (MB/s) for backup/recovery activities (what would normally be quantified under Service Level Agreements, also including recovery time objectives);
  • Recovery point objectives;
  • etc.

But what has all this got to do with the ravine?

I said much earlier in the piece that if you’re looking for a quick solution to the poor-performance ravine within an environment, you’ll be disappointed. In most organisations, once the ravine appears, there’ll need to be at least technical and process changes in order to adequately tackle it – and quite possibly business structural changes too.

Take (as always seems to be the bad smell in the room) unstructured data. Once it’s built up in a standard configuration beyond a certain size, there’s no “easy” fix because it becomes inherently challenging to manage. If you’ve got a 4TB filesystem serving end users across a large department or even an entire company, it’s easy enough to think of a solution to the problem, but thinking about a problem and solving it are two entirely different things, particularly when you’re discussing production data.

It’s here where team merger seems most appropriate; if you take storage in isolation, a storage team will have a very specific approach to configuring a large filesystem for unstructured data access – the focus there is going to be on maximising the number of concurrent IOs and ensuring that standard data protection is in place. That’s not, however, always going to correlate to a configuration that lends itself to traditional backup and recovery operations.

Looking at ILP as a whole though – factoring in snapshot, backup and replication, you can build an entirely different holistic data protection mechanism. Hourly snapshots for 24-48 hours allow for near instantaneous recovery – often user initiated, too. Keeping one of those snapshots per day for say, 30 days, extends this considerably to cover the vast number of recovery requests a traditional filesystem would get. Replication between two sites (including the replication of the snapshots) allows for a form of more traditional backup without yet going to a traditional backup package. For monthly ‘snapshots’ of the filesystem though, regular backup may be used to allow for longer term retention. Suddenly when the ravine only has to be dealt with once a month rather than daily, it’s no longer much of an issue.

Yet, that’s not the only way the problem might be dealt with – what if 80% of that data being backed up is stagnant data that hasn’t been looked at in 6 months? Shouldn’t that then require deleting and archiving? (Remember, first delete, then archive.)

I’d suggest that a common sequence of problems when dealing with backup performance runs as follows:

  1. Failure to notice: Incrementally increasing backup runtimes over a period of weeks or months often don’t get noticed until it’s already gone from a manageable problem to a serious problem.
  2. Lack of ownership: Is a filesystem backing up slowly the responsibility of the backup administrators or the operating system administrators, or the storage administrators? If they are independent teams, there will very likely be a period where the issue is passed back and forth for evaluation before a cooperative approach (or even if a cooperative approach) is decided upon.
  3. Focus on the technical: The current technical architecture is what got you into the mess – in and of itself, it’s not necessarily going to get you out of the mess. Sometimes organisations focus so strongly on looking for a technical solution that it’s like someone who runs out of fuel on the freeway running to the boot of their car, grabbing a jerry can, then jumping back in the driver’s seat expecting to be able to drive to the fuel station. (Or, as I like to put it: “Loop, infinite: See Infinite Loop; Infinite Loop: See Loop, Infinite”.)
  4. Mistaking backup for recovery: In many cases the problem ends up being solved, but only for the purposes of backup, without attention to the potential impact that may make on either actual recoverability or recovery performance.

The first issue is caused by a lack of centralised monitoring. The second, by a lack of centralised management. The third, by a lack of centralised architecture, and the fourth, by a lack of IT/business alignment.

If you can seriously look at all four of those core issues and say replacing LTO-4 tape drives with LTO-5 tape drives will 100% solve a backup-ravine problem every time, you’re a very, very brave person.

If we consider that backup-performance ravine to be a real, physical one, the only way you’re going to get over it is to build a bridge, and that requires a strong cooperative approach rather than a piecemeal approach that pays scant regard for anything other than the technical.

I’ve got a ravine, what do I do?

If you’re aware you’ve got a backup-performance ravine problem plaguing your backup environment, the first thing you’ve got to do is to pull back from the abyss and stop staring into it. Sure, in some cases, a tweak here or a tweak there may appear to solve the problem, but likely it’s actually just addressing a symptom, instead. One symptom.

Backup-performance ravines should in actual fact be viewed as an opportunity within a business to re-evaluate the broader environment:

  1. Is it time to consider a new technical architecture?
  2. Is it time to consider retrofitting an architecture to the existing environment?
  3. Is it time to evaluate achieving better IT administration group synergy?
  4. Is it time to evaluate better IT/business alignment through SLAs, etc.?

While the problem behind a backup-performance ravine may not be as readily solvable as we’d like, it’s hardly insurmountable – particularly when businesses are keen to look at broader efficiency improvements.

 

Training is an uphill battle

To be perfectly blunt, staff training in backup and recovery products is somewhat of an uphill battle.

There’s a commonly held belief in many organisations that knowledge and understanding of backup products, even enterprise ones, should be acquired on the job via a review of product manuals and online forums.

Yet data protection is somewhat unique in this assumption – there are few organisations that believe storage administrators should learn how to manage the arrays that critical production data resides on: after all, one mistake and significant data loss can occur. If not data loss, significant production issues – slow downs, outright stalls, reduced failure capabilities, etc.

Backup and recovery systems touch on even more components of an environment than storage – arguably, in terms of IT, they may touch on more items of an environment than even the IP network (after all, they can encompass fibre networking as well). The reach of an enterprise backup system, fully deployed and fully protecting an organisation, is staggering in its breadth.

Trying to manage that using untrained staff is like trying to manage the fleet maintenance for an airline using self-taught mechanics who have excellent access to instruction manuals. Sure, they may muddle through regularly – but how well do they really understand what they’re doing?

10-15 years ago, the real struggle in IT was to get management to recognise the need for backups.

Now, the struggle is to make sure IT and business management understand they don’t really have a backup system until they have trained administrators. After all, if you look at what goes into making a backup system, the technology itself only plays a small part:

Backup system

All those components are actually fairly disparate – and there needs to be a unifying factor. That unifying factor is actually training; knowledge-empowered staff are able to appropriately test, are able to utilise the documentation and the technology to integrate with the processes, are able to liaise on the establishment of SLAs, etc.

Without training, everything comes with a higher risk factor. Sure, with training there still is a risk factor, but training can significantly diminish it.

 

 

World Backup Day March 2012

World backup day was established last year as a means of trying to encourage everyone to focus on backups.

Personally I disagree with it – and yes, I know that means I’ll probably sound a bit like the Grinch, but I just can’t bring myself to believe a “World Backup Day” works. Of course, the end goal of getting people and corporations to be cognisant of the need for backups and ensuring they’re done is admirable. However, declaring a day to be “world backup day” just as equally sends the wrong message that for 364 days a year (or in this case, 365 days), you don’t really have to think too much about backups.

The simple fact is that every day should be world recovery day. After all, backups aren’t done for the sake of using media, they’re done in case we need to later recover from them.

 

Stop

The last 6 weeks my life has seemingly constantly been about interruptions. The house we’re renting has just been sold, and while I appreciate as a landlord myself the constraints of home ownership, I’ve also been made acutely aware of the challenges of trying to live a normal life while you’re constantly being asked to facilitate inspections, access, etc. The simple fact is that for 6 weeks, I’ve not been able to do anything much at all on weekends. Sure, the interruptions may only take an hour or two each day they occur, but since they happen in the middle of the day, there’s a whole bunch of things that you just can’t get to. Such as, a couple of weeks ago, a festival over a long weekend that was entirely unattainable.

Which brings me to the topic of this post – how much does your backup system interrupt you from your work?

If you’re a backup administrator, you probably question the logic of my question – after all, having to spend time on the backup system is just a case of doing your job.

However, this isn’t really the full story. Even if you’re a dedicated backup administrator, your job shouldn’t really be interruption based. An interruption based job, in that respect, implies a firefighting role – and a firefighting role is going to occur because of any combination of the following:

  • Architectural issues;
  • Procedural issues;
  • Hardware/software issues.

None of these should be all-encompassing enough that they become a dominating factor. Timesheets often demonstrate this in terms of how we start notating our used time. For more years than I can count I’ve worked in jobs where time has to be accounted for, and usually in 15 minute increments. But timesheets never account for spin-down and spin-up time. That is, if you’re working on something already, and a new task comes up that you have to switch across to, that switch-time is not instantaneous. (For further details, check here.)

So if your backup system is regularly acting as an interrupt system, are you working productively, or do you have an annoy-a-tron in your environment?

If you’re suffering high levels of interrupts in your backup environment, it’s time to look at changing the environment, even if that change means a temporary spike in work load or a requirement to bring some temporary staff on. With the possible exception of recoveries, no backup environment should be interrupt driven.

With the exception of recoveries, all other activities within a backup environment should be handled either as:

  • Change requests – a formal system tracking and monitoring successful implementation of non-major updates and alterations to the environment. This would cover new clients, new backup modules, etc.
  • Projects – a formal process for delivering substantial changes to the backup environment. (E.g., replacing an existing tape library with a combined backup to disk + long-term tape solution.)

Now I said “with the exception of recoveries” because, quite frankly, recoveries are the most important activity that can be done in a backup environment. As such, I want to note their processes explicitly. Recoveries should fall into one of three different categories:

  • User serviced – Recoveries that end-users or people other than backup administrators/operators can initiate, monitor and complete without intervention. This may be file recoveries from NAS units that integrate with snapshot/rollback functionality, it may be access to a NetWorker recovery GUI, or it may be the ability to initiate recovery from within an application module. These should be practically invisible to the backup administrators/operators.
  • Scheduled – Non-urgent recoveries that are requested via a formal process and submitted to the appropriate recovery facilitator to complete. These would be slotted into the facilitator’s work schedule on a priority basis.
  • Emergency – Critical recoveries (you could call these priority 1 recoveries – regardless of whether the official recovery request has been submitted or not)

In any environment, no matter how well architected, there will always be the risk of emergency situations requiring immediate action – critical faults don’t tend to be something you can just schedule into your work day, for instance.

However, in a well architected backup environment with functioning equipment, it should be the case that fire-fighting is a minimum job aspect, rather than an all-encompassing part of the backup administrator’s role.

 

I’ve been pretty quiet of late on the site, and it’s not through a lack of interest. Unfortunately there’s several major challenges that I’m currently dealing with that are diverting most of my attention from all kinds of writing – not just this blog, but also my personal blog too.

Most of the articles I’m thinking of for the blog at the moment require more extensive testing and research – or are multi-part ones that require a lot more spare time to be allocated to them. However, I’m not getting a lot of that spare time at the moment – the place we’re renting is up for sale and as a result of that it seems every second day I’m dealing with another open-for-inspection, etc. It’s also fairly draining, so between that and work projects, I’m just not getting a lot of energy to devote to the site.

Things should start to ease in a couple of weeks; we should know by that stage whether an investor or an occupier has bought the property, and even if that means a move, it’ll at least be a period of action rather than a draining holding pattern.

So, bear with me: there is new content being worked on, it’s just my window of opportunity to work on it is a little small at the moment.

Cheers!

 

When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha