Virtualisation.

It’s a fantastic blade to wield through a datacentre. Sweeping and scything, whole racks of equipment are reduced to single servers presenting dozens of hosts. All those driver disks? All those complex and fiddly options for hardware components during OS installation? Brushed aside – all the virtual components are simple and have rock solid drivers. Virtual machine host failing? That’s OK, just push the virtual machines across to another server without the users even noticing.

The improvements virtualisation has made to system efficiency, reliability, etc., in the x86/x86_64 field have been unquestionable.

Yet, like any other sword, it’s double edged.

Virtualisation is about cramming as many systems as is practical within a single bucket.

Backup is something that virtualisation has always handled poorly. And there’s a reason for this – virtualisation is designed for environments where the hosts cooperatively share access to resources. Thin provisioning isn’t just about storage – it’s also about CPU, networking and memory.

Backup isn’t about cooperative sharing of CPU, networking or memory. It’s about needing to get as much data from A to B as possible as quickly as can be done:

The problem with virtualisation backup

Backup at the guest level wants to suck as much data from the virtual network pipes provided by all those machines on the same host at the same time. You want to see the biggest, most powerful virtualisation server your company has ever bought grind to a halt and saturate the network as well? There’s a good chance backing up every guest it runs simultaneously will do the trick just nicely.

When VMware first came up with VCB, it was meant to be the solution. Pull the backup away from the guest, make it part of the hypervisor, and voilà, the problem is solved!

Except it was written by people who believed virtualisation applied only to Windows systems. And thus, it was laughably sad. No, I’m not having a dig at Windows here. But I am having a dig at the notion of homogeneous virtual environments. Sure, they exist, but designing products around them when you’re the virtualisation vendor is … well, I have to say, short sighted.

Perhaps for this reason, or perhaps for less desirable reasons, VCB never really gained the traction VMware likely hoped for, and so something else had to be developed. Something more expansive.

So, VADP was meant to be the big, grand solution to this. And indeed, the VADP API allows more than just Windows systems backups to be performed in such a way that file level recovery from those backups is possible.

What’s the vendor support like though? Haphazard, irregular and inconsistent would probably be the best description. Product X: “Oh, you want to backup a database as well? You need to revert to a guest agent.” Product Y: “Huh? Linux? Guest agent.” Product Z: “Linux? Sure! For any system – well, any that uses ext2 or ext3 filesystems” … you get the picture.

So the problem with VADP is that it’s only a partial solution. In fact, it’s less than half the solution for backing up virtual machines on VMware. It’s maybe 40%. The other 40% is provided by whatever backup product you’re using, and there’s 20% glue.

Between that 40%, 20% and 40%, there’s a lot of scope for things to fall through the cracks.

Where “things” are:

  • Guests using operating systems the backup product doesn’t support VADP with;
  • Guest using filesystems the backup product doesn’t support VADP with;
  • Guests using databases or applications the backup product doesn’t support VADP with.

VADP is the emperor’s new clothes. Everyone is sold on it until the discussions start around what they can’t do with it.

I’m tired of VADP being seen as a silver bullet. That’s the real problem – it doesn’t matter how many hoozits a widget has – if it doesn’t have the hoozit you need, the widget is not fit for your purposes.

I’m not pointing the finger at EMC here. I don’t see a single backup vendor, enterprise or otherwise, providing complete backup solutions under VADP. There’s always something missing.

Until that isn’t the case, you’ll excuse me if I don’t drink the VADP koolaid.

After all, my job is to make sure we can backup everything, not just the easy bits.

 

Periodically, I talk about backup being just a part of a broader set of strategies that I refer to as Information Lifecycle Protection (ILP). This is distinct from Information Lifecycle Management (ILM), and has components as follows:

Components of ILP

A common mistake within an organisation, sometimes triggered by not having merged Backup, Storage and Virtualisation administration, is to approach all backup requirements and challenges only from a backup perspective. When approached from just a backup technology perspective, sometimes it doesn’t matter how elegant your solution is – it just may not be optimal.

Optimal solutions sometimes require extending the umbrella. A classic example of this is NAS. Consider for instance an enterprise environment that has a NAS in the production datacentre, replicating to a disaster recovery datacentre:

Replicated NAS

This is a fairly standard strategy, yet NAS often presents significant challenges to backup environments. Even with NDMP in place, coming up with a nightly data protection strategy for fileservers presenting tens of millions of files is not easy. Various NDMP techniques may allow for speeding up the backup process via block level strategies, but file level recovery from these styles of backups tend to either be challenging at best, or not even possible in the worst case scenario.

As is always the case, whether you can even get a backup done is irrelevant if you can’t recover the data in an appropriately usable way.

What’s more, unstructured data doesn’t really lend itself well to more frequent backups than every 24 hours. While database logs can be captured on an almost continual basis, if it takes 8 hours to do an incremental walk of a highly dense filesystem for traditional backup, but the business requires a Recovery Point Objective (RPO) of just 1 hour, your traditional nightly-incremental strategy just doesn’t cut it.

So, we turn to other aspects in ILP.

The first step is to start using snapshots:

NAS and snapshotsOnce configured at the storage layer, NAS snapshots happen pretty much automatically. If the business requires an RPO of 1 hour, then the most obvious protection strategy is to have the NAS take a snapshot every hour. These copy-on-write style snapshots are typically browsable by end-users, and in that situation they have an added advantage – if users can browse a snapshot and find the file they want, they don’t need to ask the backup team to recover the file(s) they need.

However – snapshots on their own represent a poor data protection strategy, since they’re only as safe as the array they’re sitting on, and relying solely on snapshots to protect data on an array, when the snapshots are also on that array, is … well, insane.

So, we have to make use of that replication strategy, and ensure that the snapshots are replicated as well:

Replicated Snapshots

So at this point, we’ve got:

  • Snapshots providing an hourly RPO;
  • Snapshots providing a user-directed nearly recovery process;
  • Replication providing protection for snapshots in case of total array failure.

Now, some storage manufacturers would like to suggest that at this point you’ve got a valid backup solution. Not so fast, though! It’s only a valid backup solution if you’re prepared to burn through money to buy enough storage to provide long-term recoverability from snapshot. It’s around this point that you’ll want a backup product inserted into the protection strategy.

However, we don’t just insert a daily backup and leave it at that; if the NAS snapshots are configured correctly we can extend that the convenience factor for end-users whilst still getting a copy out to off-line storage. In this scenario, we might end up with a solution such as the following:

Snapshots with Daily Backup

In this scenario, hourly snapshots are kept for 24 hours, with the final snapshot of each day kept in turn as the “daily” backup for n days. In many businesses this will extend to more than a week – e.g., 28 or 31 days. In the above example, those “daily” snapshots are each written out to tape. Keep in mind that we’re still replicating the NAS and its snapshots from one site to another, so we hit a new benefit of combining snapshot, replication and backup into a comprehensive ILP strategy – when the traditional backup is run, it can be run from the replicated data, offloading the impact of the backup from the production NAS:

Replica Snapshot Backups

Of course, this isn’t the only way the backup strategy can work. If sufficient protection is available on both the production and replica NAS units, and the filesystems are large enough, only weekly backups might get output to tape:

Snapshots with Weekly Backups

With that strategy, no incremental backups of the NAS are ever written to tape – just weekly fulls.

Nothing in the above data protection strategy is particularly complex – but equally, none of it is really all that possible when considering backups in isolation. As soon as backups are considered along side with the other activities in ILP (RAID, Replication and Snapshots), advanced and flexible strategies such as the above become available.

So before you design you approach your next data protection challenge, ask yourself the following question:

Does this need a backup strategy, or does it need an Information Lifecycle Protection strategy?

 

How many datacentres do you have across your organisation?

Server Rooms

It’s a simple enough question, but the answer is sometimes not as simple as some organisations think.

The reason for this is it’s too easy to imagine different physical locations equating to different datacentres.

There are actually two conditions that must be met before a server room can be considered a fully independent datacentre. These are:

  1. Physical separation – The room/building must be sufficiently physically separated from other datacentres. By “sufficiently”, I mean that any disaster situation the company designs into its contingency plans should not be able to take out both more than one datacentre on the basis of physical proximity to one another.
  2. Technical separation – The room/building must be able to operate at its full production potential without the direct availability of any other datacentre within the environment.

So what does it mean if you have a datacentre that doesn’t meet both of those requirements? Quite simply, it’s not an independent datacentre at all, and likely should be considered just a remote server room which is part of a geographically disperse datacentre.

If you’re wondering what the advantage of making this distinction is, it’s this: unless they’re truly independent, considering geographically disperse server rooms to be datacentres results in the business often making highly incorrect assumptions about the resiliency of the IT systems, and by extension, the business itself.

You might think that we have enough differentiation by referring to simply datacentres and independent datacentres. This, I believe, compounds the problem rather than introducing clarity; many people, particularly those who are budget conscious, will assume the best possible scenario for the least possible price. We all do it – that’s why getting a bargain when shopping can be such a thrill. So while-ever a non-independent datacentre is referred to as a datacentre, it’s going to be read by a plethora of people within a business, or the customers of that business, as an independent one. The solution is to take the word away.

So, on that basis, it’s time to recount, and answer: how many datacentres does your business truly have?

 

What’s the ravine?

When we talk about data flow rates into a backup environment, it’s easy to focus on the peak speeds – the maximum write performance you can get to a backup device, for instance.

However, sometimes that peak flow rate is almost irrelevant to the overall backup performance.

Backup ravine

Many hosts will exist within an environment where only a relatively modest percentage of their data can be backed up at peak speed; the vast majority of their data will instead be backed up at suboptimal speeds. For instance, consider the following nsrwatch output:

High Performance

That’s a write speed averaging 200MB/s per tape drive (peaks were actually 265MB/s in the above tests), writing around 1.5-1.6GB/s.

However, unless all your data is highly optimised structured data running on high performance hardware with high performance networking, your real-world experiences will vary considerably on a minute to minute basis. As soon as filesystem overheads become a significant factor in the backup activity (i.e., you hit fileservers, regular OS and application parts of the operating system, etc.), your backup performance is generally going to drop by a substantial margin.

This is easy enough to test in real-world scenarios; take a chunk of a filesystem (at least 2x the memory footprint of the host in question), and compare the time to backup:

  • The actual files;
  • A tar of the files.

You’ll see in that situation that there’s a massive performance difference between the two. If you want to see some real-world examples on this, check out “In-lab review of the impact of dense filesystems“.

Unless pretty much all of your data environment consists of optimised structured data which is optimally available, you’ll likely need to focus your performance tuning activities on the performance ravine – those periods of time where performance is significantly sub-optimal. Or to consider it another way – if absolute optimum performance is 200MB/s, spending a day increasing that to 205MB/s doesn’t seem productive if you also determine that 70% of the time the backup environment is running at less than 100MB/s. At that point, you’re going to achieve much more if you flatten the ravine.

Looking for a quick fix

There’s various ways that you can aim to do this. If we stick purely within the backup realm, then you might look at factoring in some form of source based deduplication as well. Avamar, for instance, can ameliorate some issues associated with unstructured data. Admittedly though, if you don’t already have Avamar in your environment, adding it can be a fairly big spend, so it’s at the upper range of options that may be considered, and even then won’t necessarily always be appropriate, depending on the nature of that unstructured data.

Traditional approaches have included sending multiple streams per filesystem, and (in some occasions) considering block-level backup of filesystem data (e.g., via SnapImage – though, increasing virtualisation is further reducing SnapImage’s number of use-cases), or using NDMP if the data layout is more amenable to better handling by a NAS device.

What the performance ravine demonstrates is that backup is not an isolated activity. In many organisations there’s a tendency to have segmentation along the lines of:

  • Operating system administration;
  • Application/database administration;
  • Virtualisation teams;
  • Storage teams;
  • Backup administration.

Looking for the real fix

In reality, fixing the ravine needs significant levels of communication and cooperation between the groups, and, within most organisations, a merger of the final three teams above, viz:

Crossing the ravine

The reason we need such close communication, and even team merger, is that baseline performance improvement can only come when there’s significant synergy between the groups. For instance, consider the classic dense-filesystem issue. Three core ways to solve it are:

  • Ensure the underlying storage supports large numbers of simultaneous IO operations (e.g., a large number of spindles) so that multistream reads can be achieved;
  • Shift the data storage across to NAS, which is able to handle processing of dense filesystems better;
  • Shift the data storage across to NAS, and do replicated archiving of infrequently accessed data to pull the data out of the backup cycle all together.

If you were hoping this article might be about quick fixes to the slower part of backups, I have to disappoint you: it’s not so simple, and as suggested by the above diagram, is likely to require some other changes within IT.

If merger in itself is too unwieldy to consider, the next option is the forced breakdown of any communication barriers between those three groups.

A ravine of our own making

In some senses, we were spoilt when gigabit networking was introduced; the solution became fairly common – put the backup server and any storage nodes on a gigabit core, and smooth out those ravines by ensuring that multiple savesets would always be running; therefore even if a single server couldn’t keep running at peak performance, there was a high chance that aggregated performance would be within acceptable levels of peak performance.

Yet unstructured data has grown at a rate which quite frankly has outstripped sequential filesystem access capabilities. It might be argued that operating system vendors and third party filesystem developers won’t make real inroads on this until they can determine adequate ways of encapsulating unstructured filesystems in structured databases, but development efforts down that path haven’t as yet yielded any mainstream available options. (And in actual fact just caused massive delays.)

The solution as environments switch over to 10Gbit networking however won’t be so simple – I’d suggest it’s not unusual for an environment with 10TB of used capacity to have a breakdown of data along the lines of:

  • 4 TB filesystem
  • 2 TB database (prod)
  • 3 TB database (Q/A and development)
  • 500 GB mail
  • 500 GB application & OS data

Assuming by “mail” we’ve got “Exchange”, then it’s quite likely that 5.5TB of the 10TB space will backup fairly quickly – the structured components. That leaves 4.5TB hanging around like a bad smell though.

Unstructured data though actually proves a fundamental point I’ve always maintained – that Information Lifecycle Management (ILM) and Information Lifecycle Protection (ILP) are two reasonably independent activities. If they were the same activity, then the resulting synergy would ensure the data were laid out and managed in such a way that data protection would be a doddle. Remember that ILP resembles the following:

Components of ILP

One place where the ravine can be tackled more readily is in the deployment of new systems, which is where that merger of storage, backup and virtualisation comes in, not to mention the close working relationship between OS, Application/DB Admin and the backup/storage/virtualisation groups. Most forms and documents used by organisations when it comes to commissioning new servers will have at most one or two fields for storage – capacity and level of protection. Yet, anyone who works in storage, and equally anyone who works in backup will know that such simplistic questions are the tip of the iceberg for determining performance levels, not only for production access, but also for backup functionality.

The obvious solution to this is service catalogues that cover key factors such as:

  • Capacity;
  • RAID level;
  • Snapshot capabilities;
  • Performance (IOPs) for production activities;
  • Performance (MB/s) for backup/recovery activities (what would normally be quantified under Service Level Agreements, also including recovery time objectives);
  • Recovery point objectives;
  • etc.

But what has all this got to do with the ravine?

I said much earlier in the piece that if you’re looking for a quick solution to the poor-performance ravine within an environment, you’ll be disappointed. In most organisations, once the ravine appears, there’ll need to be at least technical and process changes in order to adequately tackle it – and quite possibly business structural changes too.

Take (as always seems to be the bad smell in the room) unstructured data. Once it’s built up in a standard configuration beyond a certain size, there’s no “easy” fix because it becomes inherently challenging to manage. If you’ve got a 4TB filesystem serving end users across a large department or even an entire company, it’s easy enough to think of a solution to the problem, but thinking about a problem and solving it are two entirely different things, particularly when you’re discussing production data.

It’s here where team merger seems most appropriate; if you take storage in isolation, a storage team will have a very specific approach to configuring a large filesystem for unstructured data access – the focus there is going to be on maximising the number of concurrent IOs and ensuring that standard data protection is in place. That’s not, however, always going to correlate to a configuration that lends itself to traditional backup and recovery operations.

Looking at ILP as a whole though – factoring in snapshot, backup and replication, you can build an entirely different holistic data protection mechanism. Hourly snapshots for 24-48 hours allow for near instantaneous recovery – often user initiated, too. Keeping one of those snapshots per day for say, 30 days, extends this considerably to cover the vast number of recovery requests a traditional filesystem would get. Replication between two sites (including the replication of the snapshots) allows for a form of more traditional backup without yet going to a traditional backup package. For monthly ‘snapshots’ of the filesystem though, regular backup may be used to allow for longer term retention. Suddenly when the ravine only has to be dealt with once a month rather than daily, it’s no longer much of an issue.

Yet, that’s not the only way the problem might be dealt with – what if 80% of that data being backed up is stagnant data that hasn’t been looked at in 6 months? Shouldn’t that then require deleting and archiving? (Remember, first delete, then archive.)

I’d suggest that a common sequence of problems when dealing with backup performance runs as follows:

  1. Failure to notice: Incrementally increasing backup runtimes over a period of weeks or months often don’t get noticed until it’s already gone from a manageable problem to a serious problem.
  2. Lack of ownership: Is a filesystem backing up slowly the responsibility of the backup administrators or the operating system administrators, or the storage administrators? If they are independent teams, there will very likely be a period where the issue is passed back and forth for evaluation before a cooperative approach (or even if a cooperative approach) is decided upon.
  3. Focus on the technical: The current technical architecture is what got you into the mess – in and of itself, it’s not necessarily going to get you out of the mess. Sometimes organisations focus so strongly on looking for a technical solution that it’s like someone who runs out of fuel on the freeway running to the boot of their car, grabbing a jerry can, then jumping back in the driver’s seat expecting to be able to drive to the fuel station. (Or, as I like to put it: “Loop, infinite: See Infinite Loop; Infinite Loop: See Loop, Infinite”.)
  4. Mistaking backup for recovery: In many cases the problem ends up being solved, but only for the purposes of backup, without attention to the potential impact that may make on either actual recoverability or recovery performance.

The first issue is caused by a lack of centralised monitoring. The second, by a lack of centralised management. The third, by a lack of centralised architecture, and the fourth, by a lack of IT/business alignment.

If you can seriously look at all four of those core issues and say replacing LTO-4 tape drives with LTO-5 tape drives will 100% solve a backup-ravine problem every time, you’re a very, very brave person.

If we consider that backup-performance ravine to be a real, physical one, the only way you’re going to get over it is to build a bridge, and that requires a strong cooperative approach rather than a piecemeal approach that pays scant regard for anything other than the technical.

I’ve got a ravine, what do I do?

If you’re aware you’ve got a backup-performance ravine problem plaguing your backup environment, the first thing you’ve got to do is to pull back from the abyss and stop staring into it. Sure, in some cases, a tweak here or a tweak there may appear to solve the problem, but likely it’s actually just addressing a symptom, instead. One symptom.

Backup-performance ravines should in actual fact be viewed as an opportunity within a business to re-evaluate the broader environment:

  1. Is it time to consider a new technical architecture?
  2. Is it time to consider retrofitting an architecture to the existing environment?
  3. Is it time to evaluate achieving better IT administration group synergy?
  4. Is it time to evaluate better IT/business alignment through SLAs, etc.?

While the problem behind a backup-performance ravine may not be as readily solvable as we’d like, it’s hardly insurmountable – particularly when businesses are keen to look at broader efficiency improvements.

 

Stop

The last 6 weeks my life has seemingly constantly been about interruptions. The house we’re renting has just been sold, and while I appreciate as a landlord myself the constraints of home ownership, I’ve also been made acutely aware of the challenges of trying to live a normal life while you’re constantly being asked to facilitate inspections, access, etc. The simple fact is that for 6 weeks, I’ve not been able to do anything much at all on weekends. Sure, the interruptions may only take an hour or two each day they occur, but since they happen in the middle of the day, there’s a whole bunch of things that you just can’t get to. Such as, a couple of weeks ago, a festival over a long weekend that was entirely unattainable.

Which brings me to the topic of this post – how much does your backup system interrupt you from your work?

If you’re a backup administrator, you probably question the logic of my question – after all, having to spend time on the backup system is just a case of doing your job.

However, this isn’t really the full story. Even if you’re a dedicated backup administrator, your job shouldn’t really be interruption based. An interruption based job, in that respect, implies a firefighting role – and a firefighting role is going to occur because of any combination of the following:

  • Architectural issues;
  • Procedural issues;
  • Hardware/software issues.

None of these should be all-encompassing enough that they become a dominating factor. Timesheets often demonstrate this in terms of how we start notating our used time. For more years than I can count I’ve worked in jobs where time has to be accounted for, and usually in 15 minute increments. But timesheets never account for spin-down and spin-up time. That is, if you’re working on something already, and a new task comes up that you have to switch across to, that switch-time is not instantaneous. (For further details, check here.)

So if your backup system is regularly acting as an interrupt system, are you working productively, or do you have an annoy-a-tron in your environment?

If you’re suffering high levels of interrupts in your backup environment, it’s time to look at changing the environment, even if that change means a temporary spike in work load or a requirement to bring some temporary staff on. With the possible exception of recoveries, no backup environment should be interrupt driven.

With the exception of recoveries, all other activities within a backup environment should be handled either as:

  • Change requests – a formal system tracking and monitoring successful implementation of non-major updates and alterations to the environment. This would cover new clients, new backup modules, etc.
  • Projects – a formal process for delivering substantial changes to the backup environment. (E.g., replacing an existing tape library with a combined backup to disk + long-term tape solution.)

Now I said “with the exception of recoveries” because, quite frankly, recoveries are the most important activity that can be done in a backup environment. As such, I want to note their processes explicitly. Recoveries should fall into one of three different categories:

  • User serviced – Recoveries that end-users or people other than backup administrators/operators can initiate, monitor and complete without intervention. This may be file recoveries from NAS units that integrate with snapshot/rollback functionality, it may be access to a NetWorker recovery GUI, or it may be the ability to initiate recovery from within an application module. These should be practically invisible to the backup administrators/operators.
  • Scheduled – Non-urgent recoveries that are requested via a formal process and submitted to the appropriate recovery facilitator to complete. These would be slotted into the facilitator’s work schedule on a priority basis.
  • Emergency – Critical recoveries (you could call these priority 1 recoveries – regardless of whether the official recovery request has been submitted or not)

In any environment, no matter how well architected, there will always be the risk of emergency situations requiring immediate action – critical faults don’t tend to be something you can just schedule into your work day, for instance.

However, in a well architected backup environment with functioning equipment, it should be the case that fire-fighting is a minimum job aspect, rather than an all-encompassing part of the backup administrator’s role.

 

When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

 

RIP Old Backup Software

Much of what I deal with relates to active backup systems, but sometimes a backup system will reach an end-point in its lifecycle. To be fair, this isn’t something that should necessarily happy regularly. If chosen correctly, a backup system (particularly an enterprise one) should evolve with the needs of business. Indeed, it could be argued that in order to even be classified as an enterprise backup product, software must feature both growth and scaleability so it can remain useful and relevant in a deployment.

That being said, there are still times when a company will decide to decommission a backup system. Reasons I’ve seen in the past include:

  1. Business is purchased by another company that has a backup software standard;
  2. Critical feature set<->requirements gap develops, necessitating re-evaluation;
  3. Backup product is discontinued (or subsumed by another product);
  4. OS platform shift necessitates a product change;
  5. New manager has a beef against existing product or vendor (sadly, while this shouldn’t come into play, it really does sometimes).

There are going to be other reasons from time to time, of course, but those represent the most common reasons I’ve seen (not in any real particular order, I should note).

These days it’s actually extremely rare to encounter a business that doesn’t have any long-term recovery requirements. (Indeed, typically businesses that believe they don’t have any long-term recovery requirements are mistaken.) Out of all my current customers, there’s only one that I can immediately think of that has short-term retention policies only and proof that’s all they need.

It’s the transitioning between backup products that sees us lose the insurance policy analogy. We can compare a lot of backup and recovery system operations to insurance policies – backing up is taking out the policy, recovery is making a claim, cloning your backups is like ensuring your policy is up to date and your insurer is liquid, and having a support contract is like making sure your insurer has an underwriter.

Switching backup products? You might say that it’s like switching insurance companies, except when you switch insurance companies you don’t have to keep your old policy around “just in case”. It’s a very rare situation to be able to switch without any legacy considerations.

And so, the net result when it comes time to decommission a backup product is that a full decommissioning may in fact take months, or even years, to complete, depending on the retention requirements on the backups.

When a backup environment is due to be decommissioned, you can typically choose one or more of the following actions:

  1. Migrate all, or the critical long-term backups to the new product. This typically is a costly and fairly manual process involving recoveries and new backups, typically requiring third party certification that no data was changed during the process, etc.;
  2. Maintain the old backup environment ‘as-is’, with appropriate support contracts, which may be costly;
  3. Maintain the old backup environment ‘as-is’, without support contracts (i.e., an Icarus support contract process), which will be risky;
  4. Virtualise and the essential components of the backup environment, and reduce to a bare minimum the hardware requirements necessary for a recovery (e.g., replace a large tape library with just one or two standalone drives, etc.);
  5. Decommission the environment, archiving the requisite hardware and systems to facilitate a “cold” startup and recovery (possibly exporting the meta-data necessary for long-term backup tracking before hand to facilitate those recoveries).

To be perfectly honest, none of these options are inherently ideal, and each carry their own risks, costs and compromises. (I believe the most flexible choice, if it’s available to the business, is virtualisation.)

If migration isn’t performed, then there’s another aspect to decommissioning which needs to be considered. Like everything to do with backups, the technology isn’t likely to be the biggest challenge; in this case, the challenge will centre around staff knowledge.

At the best of times, backup product expertise is best acquired by regular use of the product, and moving to a new product will obviously draw attention away from the old product. If a recovery needs to be performed three months after decommissioning, a backup administrator will likely have no issue performing that recovery. But after six months? Twelve months? Three years? People who are rusty with the product will work slower and are more likely to make mistakes.

The simple fact is that there’s no really easy way to decommission a backup system in favour of a new one. That lack of simplicity should, by rights, factor into any decision process relating to the decommissioning itself; namely:

  1. Will we migrate, decommission or retain a reduced, active form of the old system?
  2. What will be the costs associated with each option?
  3. What will be the risks associated with each option?
  4. What are the benefits (both direct and indirect) from the transition?
  5. Do the costs and risks of the transition outweigh the benefits?

The last question is not flippant – any decision to change a backup product must be closely and carefully weighed up. (This is why the “new manager hates vendor X/product Y and insists on change” transition reason is particularly challenging and unpleasant to deal with – there’ll likely be few, if any benefits to that transition.)

Make sure that all of the above questions can be answered clearly and accurately; if they can’t, then in all likelihood the decommissioning will get very messy.

 

The cockatrice was a legendary beast that was a two-legged dragon, with the head of a rooster that could, amongst other things, turn people to stone with a glance. So it was somewhat to a basilisk, but a whole lot uglier and looked like it had been designed by a committee.

You may be surprised to know that there are cockatrice backup environments out there. Such an environment can be just as ugly as the mythical cockatrice, and just as dangerous, turning even a hardened backup expert to stone as he or she tries to sort through the “what-abouts?”, the “where-ares?” and the “who-does?”

These environments are typically quite organic, and have grown and developed over years, usually with multiple staff having been involved and/or responsible, but no one staff member having had sufficient ownership (or longevity) to establish a single unifying factor within the environment. That in itself would be challenging enough, but to really make the backup environment a cockatrice, there’ll also be a lack of documentation.

In such environments, it’s quite possible that the environment is largely acting like a backup system, but through a combination of sheer luck and a certain level of procedural adherence, typically by operators who have remained in the environment for long enough. These are the systems for which, when the question “But why do you do X?”, the answer is simply, “Because we’ve always done X.”

In this sort of system, new technologies have typically just been tacked on, sometimes shoe-horned into “pretending” they work just as the old systems, and sometimes not used at their peak efficiency because of that general reluctance to change such systems engender. (A classic example for instance, can be seen where a deduplication system is tacked onto an existing backup environment, but is treated like a standard VTL or a standard backup-to-disk region, without any consideration for the particularities involved in using deduplication storage.)

The good news is, these environments can be fixed, and turned into true backup systems. To do so, there needs to be four decisions made:

  1. To embrace change. The first essential step is to eliminate the “it’s always been done this way before” mentality. This doesn’t allow for progress, or change, at all, and if there’s one common factor in any successful business, it’s the ability to change. This is not just representative of the business itself, but for each component of the business – and that includes backup.
  2. To assign ownership. A backup system requires both a technical owner and a management owner. Ideally, the technical owner will be the Data Protection Advocate for the company or business group, and the management owner will be both an individual, and the Information Protection Advisory Council. (See here.)
  3. To document. The first step to pulling order out of chaos (or even general disarray and disconnectedness) is to start documenting the environment. “Document! Document! Document!”, you might hear me cry as I write this line – and you wouldn’t be too far wrong. Document the system configuration. Document the rebuild process. Document the backup and recovery processes. Sometimes this documentation will be reference to external materials, but a good chunk of it will be material that your staff have to develop themselves.
  4. To plan. Organic growth is fine. Uncontrolled organic or haphazard growth is not. You need to develop a plan for the backup environment. This will be possible once the above aspects have been tackled, but two key parts to that plan should be:
    • How long will the system, in its current form, continue to service our requirements?
    • What are some technologies we should be starting to evaluate now, or at least stay abreast of, for consideration when the system has to be updated?

With those four decisions made, and implemented, the environment can be transfigured from a hodge-podge of technologies with no real unifying principle other than conformity to prior usage patterns into a collection of synergistic tools working seamlessly to optimise the data backup and recovery operations of the company.

 

Resolutions Check-in

In December last year I posted “7 new years backup resolutions for companies”. Since it’s the end of January 2012, I thought I’d check in on those resolutions and suggest where a company should be up to on them, as well as offering some next steps.

  1. Testing – The first resolution related to ensuring backups are tested. By now at least an informal testing plan should be in place if none were before. The next step will be to deal with some of the aspects below so as to allow a group to own the duty of generating an official data protection test plan, and then formalise that plan.
  2. Duplication – There should be documented details of what is and what isn’t duplicated within the backup environment. Are only production systems duplicated? Are only production Tier 1 systems duplicated? The first step towards achieving satisfactory duplication/cloning of backups is to note the current level of protection and expand outwards from that. The next step will be to develop tier guidelines to allow a specification of what type of backup receives what level of duplication. If there are already service tiers in the environment, this can serve as a starting point, slotting existing architecture and capability onto those tiers. Where existing architecture is insufficient, it should be noted and budgets/plans should be developed next to deal with these short-falls.
  3. Documentation – As I mentioned before, the backup environment should be documented. Each team that is involved in the backup process should have assigned at least one individual to write documentation relating to their sections (e.g., Unix system administrators would write Unix backup and recovery guidelines, etc., Windows system administrators would do the same for Windows, and so on). This should actually include 3 people: the writer, the peer reviewer, and the manager or team leader who accepts the documentation as sufficiently complete. The next step after this will be to handover documentation to the backup administrator(s) who will be responsible for collation, contribution of their sections, and periodic re-issuing of the documents for updates.
  4. Training – If staff (specifically administrators and operators) had previously not been trained in backup administration, a training programme should be in the works. The next step, of course, will be to arrange budget for that training.
  5. Implementing a zero error policy – First step in implementing a zero error policy is to build the requisite documents: an issues register, an exceptions register, and an escalations register. Next step will be to adjust the work schedules of the administrators involved to allow for additional time taken to resolve the ‘niggly’ backup problems that have been in the environment for some time as the switchover to a zero error policy is enacted.
  6. Appointing a Data Protection Advocate – The call should have gone out for personnel (particularly backup and/or system administrators) to nominate themselves for the role of DPA within the organisation, or if it is a multi-site organisation, one DPA per site. By now, the organisation should be in a position to decide who becomes the DPA for each site.
  7. Assembling an Information Protection Advisory Council (IPAC) – Getting the IPAC in place is a little more effort because it’s going to involve more groups. However, by now there should be formal recognition of the need for this council, and an informal council membership. The next step will be to have the first formal meeting of the council, where the structure of the group and the roles of the individuals within the group are formalised. Additionally, the IPAC may very well need to make the final decision on who is the DPA for each site, since that DPA will report to them on data protection activities.

It’s worth remembering at this point that while these tasks may seem arduous at first, they’re absolutely essential to a well running backup system that actually meshes with the needs of the business. In essence: the longer they’re put off, the more painful they’ll be.

How are you going?

 

Continuing on my post relating to dark data last week, I want to spend a little more about data awareness classification and distribution within an enterprise environment.

Dark data isn’t the end of the story, and it’s time to introduce the entire family of data-awareness concepts. These are:

  • Data – This is both the core data managed and protected by IT, and all other data throughout the enterprise which is:
    • Known about – The business is aware of it;
    • Managed – This data falls under the purview of a team in terms of storage administration (ILM);
    • Protected – This data falls under the purview of a team in terms of backup and recovery (ILP).
  • Dark Data – To quote the previous article, “all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for”.
  • Grey Data – Grey data is previously discovered dark data for which no decision has been made as yet in relation to its management or protection. That is, it’s now known about, but has not been assigned any policy or tier in either ILM or ILP.
  • Utility Data – This is data which is subsequently classified out of grey data state into a state where the data is known to have value, but is not either managed or protected, because it can be recreated. It could be that the decision is made that the cost (in time) of recreating the data is less expensive than the cost (both in literal dollars and in staff-activity time) of managing and protecting it.
  • Noise – This isn’t really data at all, but are all the “bits” (no pun intended) that are left which are neither grey data, data or utility data. In essence, this is irrelevant data, which someone or some group may be keeping for unnecessary reasons, and in actual fact should be considered eligible for either deletion or archival and deletion.

The distribution of data by awareness within the enterprise may resemble something along the following lines:

Data Awareness Percentage Distribution

That is, ideally the largest percentage of data should be regular data which is known, managed and protected. In all likelihood for most organisations, the next biggest percentage of data is going to be dark data – the data that hasn’t been discovered yet. Ideally however, after regular and dark data have been removed from the distribution, there should be at most 20% of data left, and this should be broken up such that at least half of that remaining data is utility data, with the last 10% split evenly between grey data and noise.

The logical implications of this layout should be reasonably straight forward:

  1. At all times the majority of data within an organisation should be known, managed and protected.
  2. It should be expected that at least 20% of the data within an organisation is undiscovered, or decentralised.
  3. Once data is discovered, it should exist in a ‘grey’ state for a very short period of time; ideally it should be reclassified as soon as possible into data, utility data or noise. In particular, data left in a grey state for an extended period of time represents just as dangerous a potential data loss situation as dark data.

It should be noted that regular data, even in this awareness classification scheme, will still be subject to regular data lifecycle decisions (archive, tiering, deletion, etc.) In that sense, primary data eligible for deletion isn’t really noise, because it’s previously been managed and protected; noise really is ex dark-data that will end up being deleted, either as an explicit decision, or due to a failure at some future point after the decision to classify it as ‘noise’, having never been managed or protected in a centralised, coordinated manner.

Equally, utility data won’t refer to say, Q/A or test databases that replicate the content of production databases. These types of databases will again have fallen under the standard data umbrella in that there will have been information lifecycle management and protection policies established for them, regardless of what those policies actually were.

If we bring this back to roles, then it’s clear that a pivotal role of both the DPAs (Data Protection Advocates) and the IPAC (Information Protection Advisory Council) within an organisation should be the rapid coordination of classification of dark data as it is discovered into one of the data, utility data or noise states.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha