Oct 022015
 

Introduction

When NetWorker 8 was released I said at the time it represented the biggest consolidated set of changes to NetWorker in the all the years I’d been working with it. It represented a radical overhaul and established the groundwork for further significant changes with NetWorker 8.1 and NetWorker 8.2.

Into the Future

NetWorker 9 – Leaping Into the Future

NetWorker 9 is not a similarly big set of changes: it’s a bigger set of changes.

There’s a good reason why it’s NetWorker 9. This year we celebrated the 25th birthday of NetWorker, and NetWorker has done an excellent job protecting data in those 25 years, but with the changing datacentre and changing IT environment, it was time for NetWorker to change again.

01 - NW9 Launch Screen

NetWorker 9 NMC Splash Screen

 

Networker 9 NMC Login

NetWorker 9 NMC Login

The changes are more than cosmetic, of course. (Much, much more.) A while ago I posted of the need for an evolved, modern approach to data protection activities, that being the orientation of said policies and processes around service catalogues. This is something I’ve advocated for years, but it was also something I deliberately hinted at with a view towards what was coming with NetWorker 9.

The way in which we’ve configured backups in NetWorker for the last couple of decades has been much the same. When I started using NetWorker in 1996, it was by configuring groups, retention policies, schedules and clients. That’s changing.

A bright new world – Policies

NetWorker 9 represents a move towards a simpler, more containerised approach to configuration, with an emphasis on the service catalogue approach – and here’s what it looks like:

NetWorker 9 Datazone

NetWorker 9 Configuration Engine

The changes in NetWorker 9 are sweeping – classic configuration components such as savegroups, scheduled staging and scheduled cloning are being replaced with a new policy engine that borrows much from the virtual machine protection engine introduced in NetWorker 8.1. This simultaneously makes it easier and faster to maintain data protection configurations, and develop more complex data protection configurations for the modern business. The policy engine is a containerised configuration system that makes it straightforward to identify and modify components of NetWorker configuration, and even have parts of the configuration dynamically adjust as required.

The core configuration process now in NetWorker 9 consists of:

  • A policy, which is a container for workflows
  • One or more workflows, which have:
    • A set of actions and
    • A list of data sources to run those actions against

If you’re upgrading NetWorker from an earlier version, your existing NetWorker configuration will be migrated for you into the new policy engine configuration. I’ll get to that in a little while. Before that though, we need to talk more about the policy engine.

Regardless of whether you’re setting up a brand new NetWorker server or upgrading an existing NetWorker server, you’ll get 5 default policies created for you:

  • Server Protection
  • Bronze
  • Silver
  • Gold
  • Platinum

Each of these policies do distinctly different things. (If you’re migrating, you’ll get some additional policies. More of that in a while.)

NW9 Protection - Policy with 2 workflows

NetWorker 9 Protection Window

In this case, the server protection policy consists of two workflows:

  • NMC server backup – Performs a backup of the NetWorker management console database
  • Server backup – Performs a bootstrap backup and a media database expiration

You can see straight away that’s two entirely different things being done within the same policy. In the world of NetWorker 8.x and lower, each Group was effectively an atomic component that did only one particular thing. With policies, you’ve got a container that encapsulates multiple logically similar activities. For instance, let’s look at the difference between the default Bronze policy and the default Silver policy:

NetWorker 9 Bronze Policy

NetWorker 9 Bronze Policy

The Bronze policy has two workflows – one for Applications, and one for Filesystem backups. Each workflow does a backup to the Default pool (which of course you can can change), and that’s it. By comparison, the Silver policy looks like the following:

NetWorker 9 Silver Policy

NetWorker 9 Silver Policy

You can see the difference immediately – a Silver policy is about backing up, then cloning. The policy engine is geared very much towards a service catalogue design – setup a small number of policies with the required workflows and consolidate your configuration accordingly.

Oh – and here’s a cool thing about the visual policy engine – right clicking within the visualisation of the policy and changing settings, such as:

NetWorker 9 Right Clicking in Visual Policy

NetWorker 9 Right Clicking in Visual Policy

The policy engine is not a like-for-like translation from older versions of NetWorker configuration (though your existing configuration is migrated). For instance, here’s an “Emerald” policy I created on my lab server:

Sample policy with advanced cloning

Sample policy with advanced cloning

That policy backs up to the Daily pool and then does something new for NetWorker – clones simultaneously to two different pools – “Site-A Clone” and “Site-B Clone”. There’s also something different about the selection process for what gets backed up. The group here is…

…wait, I need to explain Groups in NetWorker 9. Don’t think they’re like the old NetWorker groups. A group in NetWorker 9 is simply a selection of data sources. That could be a collection of clients, a collection of virtual machines, a collection of NAS systems or a collection of savesets (for cloning/staging). That’s it though: groups don’t start backups, control cloning, etc.

…the group here is a dynamic group. This is a new option for traditional clients. Rather than being an explicit list of clients, a dynamic group is assembled at the time the workflow is executed based on a list of tags defined in the group list. Any client with a matching tag is automatically included in the backup process. This allows for hosts to be moved easily between different policies and workflows through just by changing the tags associated with it. (Alternatively, it might be configured as automatically selecting every client.)

NetWorker 9 Dynamic Groups

NetWorker 9 Dynamic Groups

There’s a lot more to the policy engine than just what I’ve covered above, but there’s also a lot more I need to cover, so I’ll stop for now and come back to the new policy engine in more detail in a future blog post.

Policy Migration

Actually, there’s one other thing I’ll mention about policies before I continue, and that’s the policy migration process. When you upgrade a NetWorker server to NetWorker 9, your existing configuration is migrated (and as you might imagine this migration process is something that’s received a lot of attention and testing). For example, a “classic” NetWorker environment that consists of a raft of groups. On migration, each group is converted into a workflow of the same name and placed under a new policy called Backup. So a basic group list of say, “Daily Dev Servers”, “Daily Filesystem” and “Monthly Filesystem” will get converted accordingly. Here’s what the group list looks like under v8 (with the default Default group):

NetWorker 8 Group List

NetWorker 8 Group List

Under version 9, this becomes the following policy and workflows:

NetWorker 9 Converted Policy

NetWorker 9 Converted Policy

The workflow visualisation for the groups above converted into policy format is:

NetWorker 9 Converted Policy Workflow Visualisation

NetWorker 9 Converted Policy Workflow Visualisation

(By the way, that “Monthly Filesystem” workflow cloning to the “Default Clone” pool was just a lazy error on my part while setting up a test server – not an error.)

I know lots of people tested some fairly hairy configuration migrations. If I recall correctly the biggest configuration I tested had over 1000 clients defined and around 300 groups, schedules, etc., associated with those clients. And I did a whole bunch of shortcuts and tricks in schedules and they converted successfully.

The back-end changes

I’ll undoubtedly do some additional blog articles about the NetWorker 9 policy engine, but it’s time to move on to other topics and other changes within NetWorker. I’ll start with some back-end changes to the environment.

Media database

The “WISS” database format has been around for as long as I can recall with NetWorker. It’s served NetWorker well, but it’s also had some limitations. As of version 9, the NetWorker media database format is now SQLite, which gives NetWorker a big boost for performance and parallelisation of media activities. As per the policy engine, this migration happens automatically for you as part of the upgrade process. (Depending on the size of your media database this may take a little while to complete, but the media database is usually fairly small for most organisations.)

NetWorker Management Console (NMC) Database

Previous versions of NetWorker have used the Sybase embedded SQLAnywhere database for NMC. NetWorker version 9 switches the NMC database to PostgreSQL. If you’re wanting to keep your existing NMC database, you’ll need to take some pre-ugprade steps to export the Sybase embedded database content into a format that can be imported into the PostgreSQL database. Be sure to read the upgrade documentation – but you were going to do that anyway, right?

License Server

Other than the options around traditional vs NetWorker capacity vs DPS capacity, NetWorker licensing has remained mostly the same for the entire 19 years I’ve been dealing with it. There was a Legato License Manager introduced some time ago but it had mainly been pushed as a means of centralising management of traditional licensing across multiple datazones. Since the capacity formats aren’t so bothered on datazone counts, LLM usage has fallen away.

With a lot of customers deploying multiple EMC products and EMC moving towards transformative enterprise licensing models, a move to a new licensing service that can handle licensing for multiple products makes sense. From a day to day basis, the licensing server won’t really change how you interact with NetWorker, but you’ll want to deal with your sales/pre-sales team or your integrator (depending on which way you procure NetWorker licenses) in order to prep for the license changes. It’s not a change to functionality of traditional vs capacity licenses, and it doesn’t signal a move away from traditional licenses either, but it is a much needed change.

Authentication System

NetWorker has by and large used OS provided user-authentication for authorisation. That might be localised on a per-system basis or it might leverage Active Directory/etc. This however left somewhat of a split between authorisation supported by NetWorker Management Console and authorisation supported from the command line. The new authentication system is effectively a single sign-on approach providing integrated authentication between NMC related activities and command line activities.

Restricted Data Zones

Restricted datazones get a few tweaks with NetWorker 9, too. I’ve had very little direct cause to use RDZs myself, so I’ll let the release notes speak for themselves on this front:

  • You can now associate an RDZ resource to an individual resource (for example, to a client, protection policy, protection group, and so on) from the resource itself. As a result, RDZ resources can no longer effect resource associations directly.

  • Non-default resources, that are previously associated to the global zone and therefore unusable by an RDZ, are now shared resources that can be used by an RDZ. Although, these resources cannot be modified by restricted administrators.

If you’re using RDZs in your environment, be sure to understand the implications of the above changes as part of the upgrade process.

Scaling

With a raft of under-the-hood changes and enhancements, NetWorker servers – already highly scaleable – become even more scaleable. If your NetWorker environment has been getting large enough that you’ve considered deploying additional datazones, now is the time to talk to your local EMC teams to see whether you still need to go down that path. (Chances are you don’t.)

NetWorker Server Platform

There are actually very few environments left where the NetWorker server itself runs on what I’d refer to as “classic” Unix systems – i.e., Solaris, HPUX or AIX. As of NetWorker 9, the NetWorker server processes (and similarly, NMC processes) will now run only on Windows 64-bit or Linux 64-bit systems. This allows a concentration of development, leveraging the substantially (I’d say massively) reduced use of these platforms for better development efficiencies. However, NetWorker client support is still extremely healthy and those platforms are also still fully supported as storage nodes.

From a migration perspective, this is actually relatively easy to handle. EMC for some time has supported cross platform migration, wherein the NetWorker media database, configuration and index (i.e., the NetWorker server) is moved from say, Linux to Windows, Solaris to Linux, Solaris to Windows, etc. If you are one of those sites still using the NetWorker server services on Solaris, HPUX or AIX, you can engage cross platform migration services and transfer across to Windows or Linux. To keep things simple (I’ve done this dozens of times myself over the years), consider even keeping the old server around, renaming it and turning it into a storage node so you don’t really have to change any device connectivity. Then, elevate the backup server to a “director only” mode where it’s not actually doing any client backup itself. All up, this sort of transition can be seamlessly achieved in a very short period of time. In short: it may be a small interruption and change to your processes, but having executed it many times myself in the past, I can honestly say it’s a very small change in the grand scheme of things, and very manageable.

In summary, the options along this front if you’re using a non-Windows/non-Linux NetWorker server are:

  • Do a platform migration of your NetWorker server to Windows or Linux using your current NetWorker version, then upgrade to the new version
  • Stand up a new NetWorker datazone on Windows or Linux and retain the existing one for legacy recoveries, migrating clients across

I’m actually a big fan of the former rather than the latter – I really have done enough platform migrations to know they work well and they allow you to retain everything you’ve been doing. (IMHO the only reason to not do a platform migration is if you have a very short retention period for all of your backups and you want to start with a brand new configuration approach.)

(Cross platform migrations do have to be done by an authorised party – if you’re not sure who near you can do cross platform migrations, reach out to your local EMC team and find out.)

One more thing: with the additional services now running on a NetWorker server, you could need more RAM/CPU in your server. Check out the release notes for some details on this front. Environments that have been sized with room for spare likely won’t need to worry about this at all – but if you’ve got an environment where you’ve got an older piece of hardware running as your NetWorker server, you might need to increase its performance characteristics a little.

[Clarifying point: I’m only talking about the NetWorker server platform. Traditional Unix systems remain fully supported for storage nodes and clients.]

Cloning

NetWorker gets a performance and optimisation boost with cloning. Cloning has previously been a reasonably isolated process compared to regular save or recovery operations. With NetWorker 9, cloning is now a more integrated function, leveraging the in-place recovery technology implemented in NetWorker 8.2 to speed up cloning of synthetic backups.

This has some advantages relating to parallelising clones and limiting the need for additional nsrmmd processes to handle the cloning operation, and introduces scope for exciting changes in future versions of NetWorker, too.

With continuing advances in how you can configure and manage cloning from within NetWorker policies, manual command line driven cloning is becoming less necessary, but if you do still use it you’ll notice some difference in the output. For instance:

[root@sirius ~]# mminfo -q "name=/usr,savetime>=24 hours ago" -r ssid
4278951844
[root@sirius ~]# nsrclone -b "Site-A Clone" -S 4278951844
140988:nsrclone: launching backend job on host sirius.turbamentis.int
140990:nsrclone: Backend started: job Id(160004).
85401:nsrrecopy: Input client or saveset is NULL, information not updated in jobdb
09/30/15 18:48:04.652904 Clone pool size used:4
09/30/15 18:48:04.756405 Init Clone PARAMS: Network constant(73400320) Saveset computation overhead(2000000 microsec) Threshold(600000000 microsec) MIN-Threads(16) MAX-Threads(32)
09/30/15 18:48:04.757495 Adjust Clone param: Total overhead(50541397 microsec) Threshold(12635349 microsec) MIN-threads(1) MAX-Threads(4)
09/30/15 18:48:04.757523 Add New saveset group(0x0x3fe5db0): Group overhead(50541397 microsec) Num ss(1)
129290:nsrrecopy: Successfully established direct file retrieve session for save-set ID '4278951844' with adv_file volume 'Daily.001'.
09/30/15 18:49:30.765647 nsrrecopy exiting
140991:nsrclone: Backend exited: job Id(160004).
 [ORIGINAL REQUESTED SAVESETS]
4278951844;
 [CLONE SUCCESS SAVESETS]
4278951844/1443603606;

Note that while the command line output is a little difference, the command line options remain the same so your scripts can continue to work without change there. However, with enhanced support for concurrent cloning operations you’ll likely be able to speed up those scripts … or replace them entirely with new policies.

Performance tuners win too

The performance tuning and optimisation guide has been getting more detailed information over more recent versions, and the one that accompanies NetWorker 9 is no exception. For example, there’s an entire new section on TCP window size and network latency considerations that a bunch of examples (and graphs) relating to the impact of latency on backup and cloning operations of varying sizes based on filesystem density. If you’re someone who likes to see what tuning and adjustment options there are in NetWorker, you’ll definitely want to peruse the new Performance Tuning/Optimisation guide, available with the rest of the reference documentation.

(On that front, NDMP has now been broken out into its own document: the NDMP User Guide. Keep an eye on it if you’re working with NAS systems.)

Additional Features

Block Based Backup (BBB) for Linux

Several Linux operating systems and filesystems now get the option of performing block based backups. This can significantly speed up the backup of large/dense filesystems – even more so than parallel save streams – by actually bypassing the filesystem entirely. It’s been available in Windows backups for a while now, but it’s hopped over the fence to Linux as well. Like the Windows variant, BBB doesn’t require image level recovery – you can do file level recovery from block based backups. If you’ve got really dense filesystems (I’m looking at large scale IMAP servers as a classic example), BBB could increase your backup performance by up to an order of magnitude.

Parallel Save Streams

Parallel Save Streams certainly aren’t forgotten about in NetWorker 9. There are now options to go beyond 4 parallel save streams per saveset for PSS enabled clients, and we’ve seen the introduction of automatic stream reclaiming, which will dynamically increase the number of active streams for a saveset already running in PSS mode to maximise the utilisation of client parallelism settings. (That’s a mouthful. The short: PSS is more intelligent and more reactive to fluctuations in used parallelism on clients.)

ProtectPoint

ProtectPoint is a pretty exciting new technology being rolled out by EMC across its storage arrays and integrates with Data Domain for the back-end storage. To understand what ProtectPoint does, consider a situation where you’ve got say, a 100TB Oracle database sitting on a VMAX3 system, and you need to back it up as fast as possible with as little an impact to the actual database server itself as possible. In conventional agent-based backups, it doesn’t matter what tricks and techniques you use to mitigate the amount of data flowing from the Oracle server to the backup environment, the Oracle server still has to read the data from the storage system. ProtectPoint is an application aware and application/integrated system that allows you to seamlessly have the storage array and the Data Domain handle pretty much the entire backup, with the data transfer going directly from the storage array to the Data Domain. Suddenly that entire-database server read load associated with a conventional backup disappears.

NetWorker v9 integrates management of ProtectPoint policies in a very similar way to how NetWorker v8.2 introduced highly advanced NAS snapshot service integration into the data protection management. This further grows NetWorker’s capabilities in orchestrating the overall data protection process in your environment.

(There’s a good overview demo of ProtectPoint over at YouTube.)

NVE

Some people want to be able to stand up and completely control a NetWorker environment themselves, and others want to be able to deploy an appliance, answer a couple of questions, and have a fully functioning backup environment ready for use. NetWorker Virtual Edition (NVE) addresses the needs and desires of the latter. For service providers or businesses deploying remote office protection solutions, NVE will be a boon – and it won’t eat into any operating system licensing costs, as the OS (Linux) is bundled with the virtual machine template file.

Base vs Extended Client Installers

For Unix systems, NetWorker now splits out the client package into two separate installers – the base version and the extended version – lgtoclnt and lgtoxtdclnt respectively. You install the base client on clients that need to get fairly standard filesystem backups. It doesn’t include binaries like mminfo, nsrwatch or nsradmin – they’re now in the extended package. This allows you to keep regular client installs streamlined – particularly useful if you’re a service provider or dealing with larger environments.

VBA

There’s been a variety of changes made to the Virtual Backup Appliance (introduced in NetWorker 8.1), but the two I want to particularly single out are the two that users have mentioned most to me over the last 18 months or so:

  • Flash is no longer required for the File Level Recovery (FLR) web interface
  • There’s a command line interface for FLR

If you’ve been leery about using VBA for either of the above reasons, it’s time to jump on the bandwagon and see just how useful it is. Note that in order to achieve command line FLR you’ll need to install the basic NetWorker client package on the relevant hosts – but you need to get a binary from somewhere, so that makes sense.

Module Enhancements

Both the NetWorker Module for Microsoft Applications (NMM) and NetWorker Module for Databases and Applications (NMDA) have received a bunch of updates, including (but not limited to):

  • NMM:
    • Simpler use of VSS.
    • Block based support for HyperV and Exchange – yes, and Exchange. (This speeds up both types of backups considerably.)
    • Federated backups for SharePoint, allowing non-primary databases to be leveraged for the backup process.
    • I love the configuration checker – it makes getting NMM up and running with minimum effort so much easier. It’s been further enhanced in NetWorker 9 to grow its usefulness even more.
    • HyperV support for Partial VSS writer – previously if you had a single VM fail to backup under HyperV the backup group running the process would register as a failure. Now the backups will continue and only the VM that fails to backup will be be declared a failure. This aligns HyperV backups much more closely to traditional filesystem or VMware style backups.
    • Improved support for Federated backups of HyperV SMB 3 clusters.
    • File Level Recovery GUI for HyperV virtual machine backups.
    • Full integration of policy support for NMM.
  • NMDA:
    • Support for DDBoost over Fibre-Channel for AIX.
    • Full integration of policy support for NMDA.
    • Support for log-only backups for Lotus Notes systems.
    • NetWorker Snapshot Manager support for features like ProtectPoint.
    • Various DB2 enhancements/improvements.
    • Oracle RAC discovery in the NMC configuration wizards.
    • Optional use of a CONFIG_FILE parameter for RMAN scripts so you can put all the NMDA related customisations for RMAN backups into a single file (or small number of files) and keep that file/those files updated rather than having to make changes to individual RMAN scripts.

Policies, Redux

Before I wrap up: just one more thing. With the transition to a policy configuration engine, the nsrpolicy command previously introduced in NetWorker 8.1 to support Virtual Machine Protection Policies has been extensively enhanced to be able to handle all aspects of policy creation, configuration adjustment and policy/workflow execution. This does mean that if you’ve previously used nsradmin or savegrp to handle configuration/group execution processes, you’ll have to adjust some of your scripts accordingly. (It also means I’ll have to work on a new version of the Turbocharged NetWorker Administration Guide.)

Wrapping Up

I wasn’t joking at the start when I said NetWorker 9 represents the biggest set of changes I’ve ever seen in my 19 years of using NetWorker. What I will say is that these are necessary changes to prepare NetWorker for the rapidly changing datacentre. (Or even the rapidly changing datacenter if you’re so minded.)

This upgrade will require very careful review of the release notes and changed functionality, as well as potentially revisiting any automation scripts you’ve done in the past. (But you can do it.) If you’ve got a heavily scripted environment, my advice is to run up a test NetWorker 9 server and review your scripts against the changes, first evaluating whether you actually need to continue using those scripts, and then if you do, adjusting them accordingly. EMC has also prepared some video training for NetWorker 9 which I’d advise looking into (and equally I’d suggest leveraging your local EMC partner or EMC resources for the upgrade process).

It’s also an excellent time to consider revisiting your overall backup configuration and look for optimisations you can achieve based on the new policy engine and the service-catalogue approach. As I’ve been saying to my colleagues, this is the perfect opportunity to introduce policies that align to service catalogues that more precisely define and meet business requirements. If you’re not ready to do it from day zero, that’s OK – NetWorker will migrate your configuration and you’ll be able to continue to offer your existing backup and recovery services. But if you find the time to re-evaluate your configuration and reset it to a service catalogue approach, you can migrate yourself from being the “backup admin” to being the “data protection architect” within your organisation.

This is a big set of changes in NetWorker, but it’s also very much an exciting and energising set of changes, too.

As you might expect, this won’t be my only blog post on NetWorker 9 – it’s equally an energising time for me and I’m looking forward to diving into a variety of topics in more detail and providing some screen casts and videos of changes, upgrades and improvements.

(And don’t forget to wear your sunglasses: the future’s looking bright.)

The lazy admin

 Best Practice, Policies, Scripting  Comments Off on The lazy admin
Jul 112015
 

Are you an industriously busy backup administrator, or are you lazy?

Asleep at desk

When I started in IT in 1996, it wasn’t long before I joined a Unix system administration team that had an ethos which has guided me throughout my career:

The best sysadmins are lazy.

Even more so than system administration, this applies to anyone who works in data protection. The best people in data protection are lazy.

Now, there’s two types of lazy:

  • Slothful lazy – What we normally think of when we think of ‘lazy’; people who just don’t really do much.
  • Proactively lazy – People who do as much as they can in advance in order to have more time for the unexpected (or longer term projects).

If you’d previously thought I’d gone nuts suggesting I’ve spent my career trying to be lazy (particularly when colleagues read my blog), you’ll hopefully be having that “ah…ha!” moment realising I’m talking about being proactively lazy. This was something I learnt in 1996 – and almost twenty years down the track I’m pleased to see whole slabs of the industry (particularly infrastructure and data protection) are finally following suit and allowing me to openly talk about the virtues of being lazy.

Remember that embarrassingly enthusiastic dance Steve Ballmer was recorded doing years and years ago at a Microsoft conference while he chanted “Developers! Developers! Developers!” A proactively lazy data protection administrator chants “Automate! Automate! Automate!” in his or her head throughout the day.

Automation is the key to being operationally lazy yet proactively efficient. It’s also exactly what we see being the focus of DevOps, of cloud service providers, and massive scale converged infrastructure. So what are the key areas for automation? There’s a few:

  • Zero error policies – I’ve been banging the drum about zero error policies for over a decade now. If you want the TL;DR summary, a zero error policy is the process of automating the review of backup results such that the only time you get an alert is when a failure happens. (That also means treating any new “unknown” as a failure/review situation until you’ve included it in the review process.)
  • Service Catalogues and Policies – Service catalogues allow standard offerings that have been well-planned, costed and associated clearly with an architected system. Policies are the functional structures that enact the service catalogue approach and allow you to minimise the effort (and therefore the risk of human error) in configuration.
  • Visual Dashboards – Reports are OK, notifications are useful, but visual dashboards are absolutely the best at providing an “at a glance” view of a system. I may joke about Infographics from time to time, but there’s no questioning we’re a visual species – a lot of information can be pushed into a few simple glyphs or coloured charts*. There’s something to be said for a big tick to indicate everything’s OK, or an equally big X to indicate you need to dig down a little to see what’s not working.

There’s potentially a lot of work behind achieving that – but there are shortcuts. The fastest way to achieving it is sourcing solutions that have already been built. I still see the not-built-here syndrome plaguing some IT environments, and while sometimes it may have a good rationale, it’s an indication of that perennial problem of companies thinking their use cases are unique. The combination of the business, the specific employees, their specific customers and the market may make each business potentially unique, but the core functional IT requirements (“deploy infrastructure”, “protect data”, “deploy applications”, etc.) are standard challenges. If you can spend 100% of the time building it yourself from the ground up to do exactly what you need, or you can get something that does 80% and all you have to do is extend the last 20%, which is going to be faster? Paraphrasing Isaac Newton:

If I have seen further it is by standing on the shoulders of giants.

As you can see, being lazy properly is hard work – but it’s an inevitable requirement of the pressures businesses now place on IT to be adaptable, flexible and fast. The proactively lazy data protection service provider can step back out of the way of business functions and offer services that are both readily deployable and reliably work, focusing his or her time on automation and real problem solving rather than all that boring repetitive busyness.

Be proudly lazy: it’s the best way to work.


* Although I think we have to be careful about building too many simplified reports around colour without considering the usability to the colour-blind.

Jan 252011
 

One of the core concepts I try to drive home in my book is that you don’t get a backup system by installing enterprise backup software.

Here’s a diagram to help explain what really goes into making a backup system:

Backup system

In short, you can have as much technology as you want, but without the rest of those pieces all you’ve got is a budget sink-hole.

If you want to understand how all these concepts fit together, you really should take the time to invest in my book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy“.

Jan 142011
 

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive 🙂

What a day!

 Backup theory, Policies  Comments Off on What a day!
Mar 302010
 

This morning we went to the funeral of our best friends’ father. It was, as funerals go, a lovely service and after the funeral and the burial we headed off to the wake, only to have someone’s hilux slam into the driver’s side of our car on a tight bend. They’d skidded and come onto the wrong side of the road by just enough, given the tight corner, to make the impact. Thankfully speed, alcohol or drugs weren’t in play, just the wet, and even more importantly, no-one was injured. Dignity will be lost any time it’s driven without a passenger though – the driver’s door can’t be opened from the inside:

Alas, poor car, I hardly knew ye

The case is with the insurers, and we’re waiting for an assessment next Tuesday to find out whether the car will be repaired or written off. It would be a shame if it’s written off; it’s a Toyota Avalon, circa 2001, and while those cars were frumpy they were damn good cars. With only around 120,000km on the clock it’s not really all that old. About 3 or 4 years ago it was almost completely totalled in a massive hail storm on the central coast; as I recall the repair was in the order of about $12,000, and it only scraped through for repair on an insurance value of around $14,000. Now, with insurance of $7,500 and the repair estimate saying that it’ll top $5,000, age is against the car and it doesn’t look good.

But, this blog isn’t about my hassles, or my car.

It is however about insurance, and insurance is something I’ll be dealing with quite a bit over the coming days. Or I will be, once we hit next Tuesday and the car gets checked out by the assessors.

When we think of “backup as insurance”, there’s some fairly close analogies:

  • Backup is insurance because it’s about having a solution when something goes wrong;
  • Making a claim is performing a recovery;
  • Your excess is how easy (or hard) it is to make a recovery.

Given what’s happened today, it made me wonder what the analogy to “written off” is. That’s a little bit more unpleasant to deal with, but it’s still something that has to be considered.

In this case I’d suggest that the analogy for the insured item being “written off” is one of the following:

  • Having clonesseems simple, but if one recovery fails due to media, having clones that you can recover from instead are the cheapest, logical solution.
  • Having an alternate recovery strategy – so for items with really high availability requirements or minimal data loss requirements, this would refer to having some other replica system in place.
  • Having insurance that can get you through the worst of events – sometimes no matter what you do to protect yourself, you can have a disaster that exceeds all your preparation. So in the absolute worst case scenario, you need something that will help you pay your bills, or ameliorate your building debt while you get yourself back on-board.

Of course, it remains preferable to not have to rely on any of these options, but the case remains that it’s always important to have an idea what your “worst case scenario” recovery situation will be. If you haven’t prepared for one, I’ll suggest what it’s likely to be: going out of business. Yes, it’s that critical that you have an idea what you’ll do in a worst-case scenario. It’s not called “business continuity” for the heck of it – when that critical situation occurs, not having plans usually results in the worst kind of failure.

Me? I’ll be visiting a few car-yards on the weekend to scope up what options I have in the event the car gets written off on Tuesday.

Mar 032010
 

While I touched on this in the second blog posting I made (Instantiating Savesets), it’s worthwhile revisiting this topic more directly.

Using ADV_FILE devices can play havoc with conventional tape rotation strategies; if you aren’t aware of these implications, it could cause operational challenges when it comes time to do recovery from tape. Let’s look at the lifecycle of a saveset in a disk backup environment where a conventional setup is used. It typically runs like this:

  1. Backup to disk
  2. Clone to tape
  3. (Later) Stage to tape
  4. (At rest) 2 copies on tape

    Looking at each stage of this, we have:

    Saveset on ADV_FILE deviceThe saveset, once written to an ADV_FILE volume, has two instances. The instance recorded as being on the read-read only part of the volume will have an SSID/CloneID of X/Y. The instance recorded as being on the read-write part of the volume will have an SSID/CloneID of X/Y+1. This higher CloneID is what causes NetWorker, upon a recovery request, to seek the “instance” on the read-only volume. Of course, there’s only one actual instance (hence why I object so strongly to the ‘validcopies’ field introduced in 7.6 reporting 2) – the two instances reported are “smoke and mirrors” to allow simultaneous backup to and recovery from an ADV_FILE volume.

    The next stage sees the saveset cloned:

    ADV_FILE + Tape CloneThis leaves us with 3 ‘instances’ – 2 physical, one virtual. Our SSID/CloneIDs are:

    • ADV_FILE read-only: X/Y
    • ADV_FILE read-write: X/Y+1
    • Tape: X/Y+n, where n > 1.

    At this point, any recovery request will still call for the instance on the read-only part of the ADV_FILE volume, so as to help ensure the fastest recovery initiation.

    At some future point, as disk capacity starts to run out on the ADV_FILE device, the saveset will typically be staged out:

    ADV_FILE staging to tapeAt the conclusion of the staging operation, the physical + virtual instances of the saveset on the ADV_FILE device are removed, leaving us with:

    Savesets on tape only

    So, at this point, we end up with:

    • A saveset instance on a clone volume with SSID/CloneID of: X/Y+n.
    • A saveset instance on (typically) a non-clone volume with SSID/CloneID of: X/Y+n+m, where m > 0.

    So, where does this leave us? (Or if you’re not sure where I’ve been heading yet, you may be wondering what point I’m actually trying to make.)

    Note what I’ve been saying each time – NetWorker, when it needs to read from a saveset for recovery purposes, will want to pick the saveset instance with the lowest CloneID. At the point where we’ve got a clone copy and a staged copy, both on tape, the clone copy will have the lowest CloneID.

    The net result is that NetWorker will, in these circumstances, when both tapes aren’t online, request the clone volume for recovery – even though in an extreme number of cases, this will be the volume that’s offsite.

    For NetWorker versions 7.3.1 and lower, there was only one solution to this – you had to hunt down the actual clone saveset instances NetWorker was asking for, mark them as suspect, and reattempt the recovery. If you managed to mark them all as suspect, then you’d be able to ‘force’ NetWorker into facilitating the recovery from the volume(s) that had been staged to. However, after the recovery you had to make sure you backed out of those changes, so that both the clones and the staged copies would be considered not-suspect.

    Some companies, in this situation, would instigate a tape rotation policy such that clone volumes would be brought back from off-site before savesets were likely to be staged out, with subsequently staged media sent offsite. This has a dangerous side-effect of temporarily leaving all copies of backups on-site, jeapordising disaster recovery situations, and hence it’s something that I couldn’t in any way recommend.

    The solution introduced around 7.3.2 however is far simpler – a mminfo flag called offsite. This isn’t to be confused with the convention of setting a volume location field to ‘offsite’ when the media is removed from site. Annoyingly, this remains unqueryable; you can set it, and NetWorker will use it, but you can’t say, search for volumes with the ‘offsite’ flag set.

    The offsite flag has to be manually set, using the command:

    # nsrmm -o offsite volumeName

    (where volumeName typically equals the barcode).

    Once this is set, then NetWorker’s standard saveset (and therefore volume) selection criteria is subtly adjusted. Normally if there are no online instances of a saveset, NetWorker will request the saveset with the lowest CloneID. However, saveset instances that are on volumes with the offsite flag set will be deemed ineligible and NetWorker will look for a saveset instance that isn’t flagged as being offsite.

    The net result is that when following a traditional backup model with ADV_FILE disk backup (backup to disk, clone to tape, stage to tape), it’s very important that tape offsiting procedures be adjusted to set the offsite flag on clone volumes as they’re removed from the system.

    The good news is that you don’t normally have to do anything when it’s time to pull the tape back onsite. The flag is automatically cleared* for a volume as soon as it’s put back into an autochanger and detected by NetWorker. So when the media is recycled, the flag will be cleared.

    If you come from a long-term NetWorker site and the convention is still to mark savesets as suspect in this sort of recovery scenario, I’d suggest that you update your tape rotation policies to instead use the offsite flag. If on the other hand, you’re about to implement an ADV_FILE based backup to disk policy, I’d strongly recommend you plan in advance to configure a tape rotation policy that uses the offsite flag as cloned media is sent away from the primary site.


    * If you did need to explicitly clear the flag, you can run:

    # nsrmm -o notoffsite volumeName

    Which would turn the flag back off for the given volumeName.

    Who is your backup administrator, and who is your archive administrator?

     Backup theory, General Technology, General thoughts, Policies  Comments Off on Who is your backup administrator, and who is your archive administrator?
    Mar 022010
     

    My boss, on his blog, has raised a pertinent question – if it’s so important, according to some vendors, that backup and archive are all achieved through the same product interface, then how many companies out there assign the role of archive administrator to the backup administrator? (Or vice versa).

    I like this question; it’s kind of like the old conundrum of whether the dog wags the tail, or whether the tail wags the dog. That is, are companies that heavily push an integrated backup and archive interface:

    • Responding to the needs of IT to meet current desired business functionality, or,
    • Are they trying to drive IT in a way that perhaps doesn’t meet desired business functionality?

    (Or indeed, something else entirely).

    [Edit, further thoughts, 2010-03-03] I’ve been thinking more about this, and I have to say I can’t think of a single customer environment off-hand where the backup administrator is also responsible for archiving. Archiving seems to remain primarily the purdue of the storage administration teams in sites that I’m aware of, so it does beg the question – how beneficial is an integrated backup and archive administration process?

    [Original wrap-up] So if you’ve got any thoughts on the integration of backup and archive administration, either at the software or the human resources layer, I’d encourage you to jump across to Mike’s blog and make your voice heard.

    (As a first, I’ve disabled comments on this blog posting, so as to encourage discussion to remain in one location – the source article.)

    Dec 282009
     

    So you’re a busy backup administrator and you’re getting ready to go on leave. It’s 4pm on your final day before the holiday, you’ve finally got everything off your plate, and you think to yourself, “Now I’ve finally got the time, I’ll just quickly upgrade NetWorker before I leave.”

    This unfortunately is an alternative of that Friday change rule violation known as POETS.

    There’s three distinctly wrong things with this scenario:

    • Infrastructure upgrade done without change control.
    • Infrastructure upgrade done at the last minute.
    • Infrastructure upgrade done without follow-up monitoring.

    Any one of those scenarios is enough to cause a nightmare situation – either for yourself, getting call-outs when you’re meant to be on holidays, or for your colleagues, left in the lurch after you switch your phone off for two weeks and go on a holiday to the East Islands.

    All three though? That’s just asking for trouble.

    (This lesson doesn’t actually just apply to NetWorker – it applies across the board for system, application and storage administration. Don’t modify the system just before going away for a while.)

    Just before this holiday season, I had a customer upgrade* their NetWorker server from 7.3.x to 7.5 before going on leave. Not 7.5.1, not 7.5.1.8, 7.5. This didn’t go so well, and a few days later when the fill-in administrators noticed the issue**, there was a bit of work to rectify the various issues and some backups during that time didn’t work.

    This however is by no means unique. Following Twitter I noticed one on-call person suffer a hideous xmas day and following day working on a call-out from what appeared to be an untested change done by someone else before that other person went on holiday.

    And non-betting man that I am, I’d bet a considerable wad of money (and win) that this fellow’s experience wasn’t unique for IT workers over xmas 2009.

    In short: choosing to do an untested/uncontrolled upgrade just before going on holidays can be either self-destructive or selfish (or even both) – it may lose your your holiday, depending on the level of the fail and the backup (or lack thereof) within your company, or it may cause a colleague to have an insufferably unpleasant time. (Alternatively, if you can be reached, it may result in you having a bad time on your holiday in order to help out a colleague having a bad time as well.)

    The problem with rushing through upgrades at the last minute is that they tend to be poorly done, even if they seem simple enough. Even if change control is being followed, if that change control has been rushed through (as it can sometimes be done as a “last minute” activity), then it provides no guarantee that the change will work smoothly. And don’t forget: Murphy’s Law works in the datacentre as well. Something that looks easy, that you should be able to do with your eyes closed, when done as a rush job at the last minute can come unstuck quite easily.

    So please – for your sake, for your colleagues sake, for NetWorkers’ sake and for the sake of your company: please don’t upgrade just before you go on holidays.


    * upgrade = “update” in NetWorker speak

    ** Which should serve as a reminder that you should never only have one backup administrator.

    Dec 142009
     

    Over the years I’ve dealt with a lot of different environments, and a lot of different usage requirements for backup products. Most of these fall into the “appropriate business use” categories. Some fall into the “hmmm, why would you do that?” category. Others fall into the “please excuse my brain it’s just scuttled off into the corner to hide – tell me again” category.

    This is not about the people, or the companies, but the crazy ideas that sometimes get hold within companies that should be watched for. While I could have expanded this list to cover a raft of other things outside of backups, I’ve forced myself to just keep it to the backup process.

    In no particular order then, these are the crazy things I never want to hear again:

    1. After the backups, I delete all the indices, because I maintain a spreadsheet showing where files are, and that’s much more efficient than proprietary databases.
    2. We just backup /etc/passwd on that machine.
    3. But what about /etc/shadow? (My stupid response to the above statement, blurted after by brain stalled in response to statement #2)
    4. Oh, hadn’t thought about that (In response to #3).
    5. Can you fax me some cleaning cartridge barcodes?
    6. To save money on barcodes at the end of every week we take them off the tapes in the autochanger and put them on the new ones about to go in.
    7. We only put one tape in the autochanger each night. We don’t want <product> to pick the wrong tape.
    8. We need to upgrade our tape drives. All our backups don’t fit on a single tape any more. (By same company that said #7.)
    9. What do you mean if we don’t change the tape <product> won’t automatically overwrite it? (By same company that said #7 and #8.)
    10. Why would I want to match barcode labels to tape labels? That’s crazy!
    11. That’s being backed up. I emailed Jim a week ago and asked him to add it to the configuration. (Shouted out from across the room: “Jim left last month, remember?”)
    12. We put disk quotas on our academics, but due to government law we can’t do that to their mail. So when they fill up their home directories, they zip them up and email it to themselves then delete it all.
    13. If a user is dumb enough to delete their file, I don’t care about getting it back.
    14. Every now and then on a Friday afternoon my last boss used to delete a filesystem and tell us to have it back by Monday as a test of the backup system.
    15. What are you going to do to fix the problem? (Final question asked by an operations manager after explaining (a) robot was randomly dropping tapes when picking them from slots; (b) tapes were covered in a thin film of oily grime; (c) oh that was probably because their data centre was under the area of the flight path where planes are advised to dump excess fuel before landing; (d) fuel is not being scrubbed by air conditioning system fully and being sucked into data centre; (e) me reminding them we just supported the backup software.)

    I will say that numbers #1 and #15 are my personal favourites for crazy statements.

    Dec 072009
     

    Something I’ve periodically mentioned to various people over the years is that when it comes to data protection, simplicity is King. This can be best summed up with the following rule to follow when designing a backup system:

    If you can’t summarise your backup solution on the back of a napkin, it’s too complicated.

    Now, the first reaction a lot of people have to that is “but if I do X and Y and Z and then A and B on top, then it’s not going to fit, but we don’t have a complex environment”.

    Well, there’s two answers to that:

    1. We’re not talking a detailed technical summary of the environment, we’re talking a high level overview.
    2. If you still can’t give a high level overview on the back of a napkin, it is too complicated.

    Another way to approach the complexity issue, if you happen to have a phobia about using the back of a napkin is – if you can’t give a 30 second elevator summary of your solution, it’s too complicated.

    If you’re struggling to think of why it’s important you can summarise your solution in such a short period of time, or such limited space, I’ll give you a few examples:

    1. You need to summarise it in a meeting with senior management.
    2. You need to summarise it in a meeting with your management and a vendor.
    3. You’ve got 5 minutes or less to pitch getting an upgrade budget.
    4. You’ve got a new assistant starting and you’re about to go into a meeting.
    5. You’ve got a new assistant starting and you’re about to go on holiday.
    6. You’ve got consultant(s) (or contractors) coming in to do some work and you’re going to have to leave them on their own.
    7. The CIO asks “so what is it?” as a follow-up question when (s)he accosts you in the hallway and asks, “Do we have a backup policy?”

    I can think of a variety of other reasons, but the point remains – a backup system should not be so complex that it can’t be easily described. That’s not to ever say that it can’t either (a) do complex tasks or (b) have complex components, but if the backup administrator can’t readily describe the functioning whole, then the chances are that there is no functioning whole, just a whole lot of mess.