Aug 052017

It may be something to do with my long Unix background, or maybe it’s because my first system administration job saw me administer systems over insanely low link speeds, but I’m a big fan of being able to use the CLI whenever I’m in a hurry or just want to do something small. GUIs may be nice, but CLIs are fun.

Under NetWorker 8 and below, if you wanted to run a server initiated backup job from the command line, you’d use the savegrp command. Under NetWorker 9 onwards, groups are there only as containers, and what you really need to work on are workflows.

bigStock Workflow

There’s a command for that – nsrworkflow.

At heart it’s a very simple command:

# nsrworkflow -p policy -w workflow

That’s enough to kick off a backup job. But there’s some additional options that make it more useful, particularly in larger environments. To start with, you’ve got the -a option, which I really like. That tells nsrworkflow you want to perform an ‘adhoc’ execution of a job. Why is that important? Say you’ve got a job you really need to run today but it’s configured to skip … running it in adhoc will disregard the skip for you.

The -A option allows you to specify specific overrides to actions. For instance, if I wanted to run a job workflow today from the command line as a full rather than an incremental, I might use something like the following:

# nsrworkflow -p Gold -w Finance -A "backup -l full"

The -A option there effectively allows me to specify overrides for individual actions – name the action (backup) and name the override (-l full).

Another useful option is -c component which allows you to specify to run the job on just a single or a small list of components – e.g., clients. Extending from the above, if I wanted to run a full for a single client called orilla, it might look as follows:

# nsrworkflow -p Gold -w Finance -c orilla -A "backup -l full"

Note that specifying the action there doesn’t mean it’s the only action you’ll run – you’ll still run the other actions in the workflow (e.g., a clone operation, if it’s configured) – it just means you’re specifying an override for the nominated action.

For virtual machines, the way I’ve found easiest to start an individual client is using the vmid flag – effectively what the saveset name is for a virtual machine started via a proxy. Now, to get that name, you have to do a bit of mminfo scripting:

# mminfo -k -r vmname,name

 vm_name name

What you’re looking for is the vm:a-b-c-d set, stripping out the :vcenter at the end of the ID.

Now, I’m a big fan of not running extra commands unless I really need to, so I’ve actually got a Perl script which you’re free to download and adapt/use as you need to streamline that process. Since my lab is pretty basic, the script is too, though I’ve done my best to make the code straight forward. You simply run as follows:

[root@orilla bin]# -c krell

With ID in hand, we can invoke nsrworkflow as follows:

# nsrworkflow -p VMware -w "Virtual Machines" -c vm:5029e15e-3c9d-18be-a928-16e13839f169
133550:nsrworkflow: Starting Protection Policy 'VMware' workflow 'Virtual Machines'.
123316:nsrworkflow: Starting action 'VMware/Virtual Machines/backup' with command: 'nsrvproxy_save -s -j 705080 -L incr -p VMware -w "Virtual Machines" -A backup'.
123321:nsrworkflow: Action 'VMware/Virtual Machines/backup's log will be in '/nsr/logs/policy/VMware/Virtual Machines/backup_705081.raw'.
123325:nsrworkflow: Action 'VMware/Virtual Machines/backup' succeeded.
123316:nsrworkflow: Starting action 'VMware/Virtual Machines/clone' with command: 'nsrclone -a "*policy name=VMware" -a "*policy workflow name=Virtual Machines" -a "*policy action name=clone" -s -b BoostClone -y "1 Months" -o -F -S'.
123321:nsrworkflow: Action 'VMware/Virtual Machines/clone's log will be in '/nsr/logs/policy/VMware/Virtual Machines/clone_705085.raw'.
123325:nsrworkflow: Action 'VMware/Virtual Machines/clone' succeeded.
133553:nsrworkflow: Workflow 'VMware/Virtual Machines' succeeded.

Of course, if you are in front of NMC, you can start individual clients from the GUI if you want to:

Starting an Individual ClientStarting an Individual Client

But it’s always worth knowing what your command line options are!

Basics – Using the vSphere Plugin to Add Clients for Backup

 NetWorker, NVP, vProxy  Comments Off on Basics – Using the vSphere Plugin to Add Clients for Backup
Jul 242017

It’s a rapidly changing trend – businesses increasingly want the various Subject Matter Experts (SMEs) running applications and essential services to be involved in the data protection process. In fact, in the 2016 Data Protection Index, somewhere in the order of 93% of respondents said this was extremely important to their business.

It makes sense, too. Backup administrators do a great job, but they can’t be expected to know everything about every product deployed and protected within the organisation. The old way of doing things was to force the SMEs to learn how to use the interfaces of the backup tools. That doesn’t work so well. Like the backup administrators having their own sphere of focus, so too do the SMEs – they understandably want to use their tools to do their work.

What’s more, if we do find ourselves in a disaster situation, we don’t want backup administrators to become overloaded and a bottleneck to the recovery process. The more those operations are spread around, the faster the business can recover.

So in the modern data protection environment, we have to work together and enable each other.

Teams working together

In a distributed control model, the goal will be for the NetWorker administrator to define the protection policies needed, based on the requirements of the business. Once those policies are defined, enabled SMEs should be able to use their tools to work with those policies.

One of the best examples of that is for VMware protection in NetWorker. Using the plugins provided directly into the vSphere Web Client, the VMware administrators can attach and detach virtual machines from protection policies that have been established in NetWorker, and initiate backups and recoveries as they need.

In the video demo below, I’ll take you through the process whereby the NetWorker administrator defines a new virtual machine backup policy, then the VMware administrator attaches a virtual machine to that policy and kicks it off. It’s really quite simple, and it shows the power that you get when you enable SMEs to interact with data protection from within the comfort of their own tools and interfaces. (Don’t forget to ensure you switch to 720p/HD in order to see what’s going on within the session.)

Don’t forget – if you find the NetWorker Blog useful, you’ll be sure to enjoy Data Protection: Ensuring Data Availability.

Would you buy a dangerbase?

 Backup theory, Policies  Comments Off on Would you buy a dangerbase?
Jun 072017

Databases. They’re expensive, aren’t they?

What if I sold you a Dangerbase instead?

What’s a dangerbase!? I’m glad you asked. A dangerbase is functionally almost exactly the same as a database, except it may be a little bit more lax when it comes to controls. Referential integrity might slip. Occasionally an insert might accidentally trigger a background delete. Nothing major though. It’s twenty percent less of the cost with only four times the risk of one of those pesky ‘databases’! (Oh, you might need 15% more infrastructure to run it on, but you don’t have to worry about that until implementation.)

Dangerbases. They’re the next big thing. They have a marketshare that’s doubling every two years! Two years! (Admittedly that means they’re just at 0.54% marketshare at the moment, but that’s double what it was last year!)

A dangerbase is a stupid idea. Who’d trust storing their mission critical data in a dangerbase? The idea is preposterous.

Sadly, dangerbases get considered all too often in the world of data protection.

Destroyed Bridge

What’s a dangerbase in the world of data protection? Here’s just some examples:

  • Relying solely on an on-platform protection mechanism. Accidents happen. Malicious activities happen. You need to always ensure you’ve got a copy of your data outside of the original production platform it is created and maintained on, regardless of what protection you’ve got in place there. And you should at least have one instance of each copy in a different physical location to the original.
  • Not duplicating your backups. Whether you call it a clone or a copy or a duplication doesn’t matter to me here – it’s the effect we’re looking for, not individual product nomenclature. If your backup isn’t copied, it means your backup represents a single point of failure in the recovery process.
  • Using post-process deduplication. (That’s something I covered in detail recently.)
  • Relying solely on RAID when you’re doing deduplication. Data Invulnerability Architecture (DIA) isn’t just a buzzterm, it’s essential in a deduplication environment.
  • Turning your databases into dangerbases by doing “dump and sweep”. Plugins have existed for decades. Dump and sweep is an expensive waste of primary storage space and introduces a variety of risk into your data protection environment.
  • Not having a data lifecycle policy! Without it, you don’t have control over capacity growth within your environment. Without that, you’re escalating your primary storage costs unnecessarily, and placing strain on your data protection environment – strain that can easily break it.
  • Not having a data protection advocate, or data protection architect, within your organisation. If data is the lifeblood of a company’s operations, and information is money, then failing to have a data protection architect/advocate within the organisation is like not bothering with having finance people.
  • Not having a disaster recovery policy that integrates into a business continuity policy. DR is just one aspect of business continuity, but if it doesn’t actually slot into the business continuity process smoothly, it’s as likely going to hinder than help the company.
  • Not understanding system dependencies. I’ve been talking about system dependency maps or tables for years. Regardless of what structure you use, the net effect is the same: the only way you can properly protect your business services is to know what IT systems they rely on, and what IT systems those IT systems rely on, and so on, until you’re at the root level.

That’s just a few things, but hopefully you understand where I’m coming from.

I’ve been living and breathing data protection for more than twenty years. It’s not just a job, it’s genuinely something I’m passionate about. It’s something everyone in IT needs to be passionate about, because it can literally make the difference between your company surviving or failing in a disaster situation.

In my book, I cover all sorts of considerations and details from a technical side of the equation, but the technology in any data protection solution is just one aspect of a very multi-faceted approach to ensuring data availability. If you want to take data protection within your business up to the next level – if you want to avoid having the data protection equivalent of a dangerbase in your business – check my book out. (And in the book there’s a lot more detail about integrating into IT governance and business continuity, a thorough coverage of how to work out system dependencies, and all sorts of details around data protection advocates and the groups that they should work with.)

The lazy admin

 Best Practice, Policies, Scripting  Comments Off on The lazy admin
Jul 112015

Are you an industriously busy backup administrator, or are you lazy?

Asleep at desk

When I started in IT in 1996, it wasn’t long before I joined a Unix system administration team that had an ethos which has guided me throughout my career:

The best sysadmins are lazy.

Even more so than system administration, this applies to anyone who works in data protection. The best people in data protection are lazy.

Now, there’s two types of lazy:

  • Slothful lazy – What we normally think of when we think of ‘lazy’; people who just don’t really do much.
  • Proactively lazy – People who do as much as they can in advance in order to have more time for the unexpected (or longer term projects).

If you’d previously thought I’d gone nuts suggesting I’ve spent my career trying to be lazy (particularly when colleagues read my blog), you’ll hopefully be having that “ah…ha!” moment realising I’m talking about being proactively lazy. This was something I learnt in 1996 – and almost twenty years down the track I’m pleased to see whole slabs of the industry (particularly infrastructure and data protection) are finally following suit and allowing me to openly talk about the virtues of being lazy.

Remember that embarrassingly enthusiastic dance Steve Ballmer was recorded doing years and years ago at a Microsoft conference while he chanted “Developers! Developers! Developers!” A proactively lazy data protection administrator chants “Automate! Automate! Automate!” in his or her head throughout the day.

Automation is the key to being operationally lazy yet proactively efficient. It’s also exactly what we see being the focus of DevOps, of cloud service providers, and massive scale converged infrastructure. So what are the key areas for automation? There’s a few:

  • Zero error policies – I’ve been banging the drum about zero error policies for over a decade now. If you want the TL;DR summary, a zero error policy is the process of automating the review of backup results such that the only time you get an alert is when a failure happens. (That also means treating any new “unknown” as a failure/review situation until you’ve included it in the review process.)
  • Service Catalogues and Policies – Service catalogues allow standard offerings that have been well-planned, costed and associated clearly with an architected system. Policies are the functional structures that enact the service catalogue approach and allow you to minimise the effort (and therefore the risk of human error) in configuration.
  • Visual Dashboards – Reports are OK, notifications are useful, but visual dashboards are absolutely the best at providing an “at a glance” view of a system. I may joke about Infographics from time to time, but there’s no questioning we’re a visual species – a lot of information can be pushed into a few simple glyphs or coloured charts*. There’s something to be said for a big tick to indicate everything’s OK, or an equally big X to indicate you need to dig down a little to see what’s not working.

There’s potentially a lot of work behind achieving that – but there are shortcuts. The fastest way to achieving it is sourcing solutions that have already been built. I still see the not-built-here syndrome plaguing some IT environments, and while sometimes it may have a good rationale, it’s an indication of that perennial problem of companies thinking their use cases are unique. The combination of the business, the specific employees, their specific customers and the market may make each business potentially unique, but the core functional IT requirements (“deploy infrastructure”, “protect data”, “deploy applications”, etc.) are standard challenges. If you can spend 100% of the time building it yourself from the ground up to do exactly what you need, or you can get something that does 80% and all you have to do is extend the last 20%, which is going to be faster? Paraphrasing Isaac Newton:

If I have seen further it is by standing on the shoulders of giants.

As you can see, being lazy properly is hard work – but it’s an inevitable requirement of the pressures businesses now place on IT to be adaptable, flexible and fast. The proactively lazy data protection service provider can step back out of the way of business functions and offer services that are both readily deployable and reliably work, focusing his or her time on automation and real problem solving rather than all that boring repetitive busyness.

Be proudly lazy: it’s the best way to work.

* Although I think we have to be careful about building too many simplified reports around colour without considering the usability to the colour-blind.

Basics – Running VMware Protection Policies from the Command Line

 Basics, NetWorker, VBA  Comments Off on Basics – Running VMware Protection Policies from the Command Line
Mar 102015

If you’ve been adapting VMware Protection Policies via VBA in your environment (like so many businesses have been!), you’ll likely reach a point where you want to be able to run a protection policy from the command line. Two immediate example scenarios would be:

  • Quick start of a policy via remote access*
  • External scheduler control

(* May require remote command line access. You can tell I’m still a Unix fan, right?)

Long-term users of NetWorker will know a group can be initiated from the backup server by using the savegrp command. When EMC introduced VMware Protection Policies, they also introduced a new command, nsrpolicy.

The simplest way to invoke a policy is as follows:

# nsrpolicy -p policyName

For example:

[root@centaur ~]# nsrpolicy -p SqueezeProtect
99528:nsrpolicy: Starting Vmware Protection Policy 'SqueezeProtect'.
97452:nsrpolicy: Starting action 'SqueezeProtect/SqueezeBackup' with command: 'nsrvba_save -s centaur -j 544001 -L incr -p SqueezeProtect -a SqueezeBackup'.
97457:nsrpolicy: Action 'SqueezeProtect/SqueezeBackup's log will be in /nsr/logs/policy/SqueezeProtect/544002.
97461:nsrpolicy: Action 'SqueezeProtect/SqueezeBackup' succeeded.
99529:nsrpolicy: Vmware Protection Policy 'SqueezeProtect' succeeded.

There you go – it’s that easy.

Of accidental architectures

 Architecture, Backup theory  Comments Off on Of accidental architectures
Jul 202013

Accidental architectures


EMC’s recent big backup announcements included a variety of core product suite enhancements in the BRS space – Data Domain got a substantial refresh, Avamar jumped up to v7, and NetWorker to 8.1. For those of us who work in the BRS space, it was like christmas in July*.

Anyone who has read my NetWorker 8.1 overview knows how much I’m going to enjoy working that release. I’m also certainly looking forward to getting my hands on the new Data Domains, and it’ll be interesting to deep dive into the new features of Avamar 7, but one of the discussion points from EMC caught my attention more than the technology.

Accidental architecture.

Accidental architecture describes incredibly succinctly and completely so many of the mistakes made in enterprise IT, particularly around backup and recovery, archive and storage. It also perfectly encapsulates the net result of siloed groups and teams working independently and at times even at odds from one another, rather than synergistically meeting business requirements.

That sort of siloed development is a macrocosm of course of what I talk about in my book in section – the difference between knowledge-based and person-based groups, viz.:

[T]he best [group] is one where everyone knows at least a little bit about all the systems, and all the work that everyone else does. This is a knowledge-sharing group. Another type … is where everyone does their own thing. Knowledge sharing is at a minimum level and a question from a user about a particular system gets the response, “System X? See Z about that.” This is a person-centric group.

Everyone has seen a person-centric group. They’re rarely the fault of the people in the groups – they speak to a management or organisational failure. Yet, they’re disorganised and dangerous. They promote task isolation and stifle the development of innovative solutions to problems.

Accidental architecture comes when the groups within a business become similarly independent of one another. This happens at two levels – the individual teams within the IT arm, and it can happen at the business group level, too.

EMC’s approach is to work around business dysfunction and provide a seamless BRS experience regardless of who is partaking in the activity. The Data Domain plug-in for RMAN/Boost is a perfect example of this: it’s designed to allow database administrators to take control of their backup processes, writing Oracle backups with a Data Domain as target, completely bypassing whatever backup software is in the field.

Equally, VMware vCenter plugins to allow provisioning of backup and recovery activities from within vSphere is about trying to work around the silos.

It’s an admirable goal, and I think for a lot of businesses it’s going to be the solution they’re looking for.

I also think it’s a goal that shouldn’t need to exist. EMC’s products help to mitigate the problem, but a permanent solution needs to also come from within business change.

Crossing the ravine

As I mentioned in Rage against the Ravine, a lot of the silo issues that exist within an organisation – effectively, the accidental architectures – result from the storage, virtualisation and backup/data protection teams working too independently. These three critical back-of-house functions are so interdependent of one another that there is rarely any good reason to keep them entirely independent. In small to medium enterprises, they should be one team. In the largest of enterprises there may be a need for independent teams, but they should rotate staff between each other for maximised knowledge sharing, and they should be required to fully collaborate with one another.

In itself, that speaks again for the need of a stronger corporate approach to data protection, which requires the appointment of Data Protection Advisors and, of course, the formation an Information Protection Advisory Council.

As I’ve pointed out on more than one occasion, technology is rarely the only solution:

Rest of the iceberg

Technology is the tip of the iceberg in an accidental architecture environment, and deploying new technology doesn’t technically solve the problem, it merely masks it.

EMC’s goal of course is admirable – empower each team to achieve their own backup and recovery requirements, and I’ll fully admit there’ll always be situations where it’s necessary, so it was a direction they had to take. That’s not to say they’re looking in the wrong direction – EMC isn’t a management consulting company, after all. A business following the EMC approach however does get a critical advantage though: breathing space. When accidental architectures have lead to a bunch of siloed deployments and groups within an organisation, those groups end up spending most of their time fighting fires rather than proactively planning in a way that suits the entire organisation. Slot the EMC product suite in and those teams can start pulling back from firefighting. They can start communicating, planning and collaborating more effectively.

If you’ve got an accidental architecture for data protection, your first stop is EMC BRS’s enablement of per-technology/team solutions. Then, once you’ve had time to regroup, your next stop is to develop a cohesive and holistic approach at the personnel, process and business function layer.

At that point … boy, will your business fly.

* The term “christmas in July”, if you’re not aware of it, is fairly popular in Australia in some areas. It’s about having a mock christmas party during our coldest part of the year, mimicking in some small way the sorts of christmas those in the Northern Hemisphere get every year.

Apr 122011

One of the stories I sometimes hear from companies is that some technology X doesn’t work in their environment because X sucks, or X is broken, or X … well, you get the picture.

Years ago, when I first got into backup, the the main reasons I had to do recovery were due to system or hardware failures. Hard drive reliability was IMHO much lower, operating systems were frequently less stable, etc. Reliability was about getting to 99% availability, let alone 99.9% or anything grandiose like that.

These days, hardware/OS/app failure is, I’d suggest, one of the least likely reasons for a recovery being conducted in most organisations. Instead, it’s mainly related to soft issues – user error, audits, compliance checking, etc.

There’s a point here, and I’m almost ready to make it.

Back when I first started with backup, I’d have agreed that technology could be firmly blamed for a lot of errors. These days? Rarely – even when I blame it.

I periodically go on a rant about just how painful Linux is sometimes, but at the core I also admit that it’s a lack of training and time on my part – I’ve not made learning the ins and outs of Linux firewalls a field of study in the past, so now that I’m having to construct them by hand for a personal project it’s about as fun as tasering myself in the genitals. Technology is partly the problem – as is always the case with Linux, it’s designed for programmers and developers to manipulate, not for end users, or people like me who have concentrated on other things and just want the damn thing to work.

Ahem, where was I?

The simple fact is that we often blame technology because it’s easy. It’s like kids picking on the “easy target” at school with bullying; we bully technology and blame it for all our woes and issues because well, it doesn’t really fight back. (Hopefully we’ll get out of this habit before the singularity…)

As techos though, let’s be honest. The technology is rarely the issue. Or to be more accurate, if there’s an issue, technology is the tip of the iceberg – the visible tip. And using the iceberg analogy, you know I mean that technology is rarely going to be the majority of the issue.

The ‘issue’ iceberg in IT looks like this:

The issue iceberg

It’s probably best here that I stop and differentiate between issues and problems. A problem to me, is an isolated or an atomic failure – like, a faulty tape drive, or a failed hard drive. They’re clearly technology related, but they’re not really issues. An issue is a deeper, systemic and compound failure. E.g., something like “on any one day, 30% of my backups fail”, or “Performance across all systems is generally 50% worse at end of month”, etc.

When technology gets blamed in those instances, I’m reminded of someone who say, never has their car serviced, then when it eventually breaks down complains that the car was a lemon. Was it that the car failed the person, or more accurately that the person failed the car?

As I said, it’s easy to blame the thing that can’t defend itself.

In environments with ongoing, long-term issues, there reaches a point where you have to sit back and ponder – is the technology causing the issue, or is the environment causing the technology to have an issue?

The inevitable and hard truth is that in some cases, it’s the latter, not the former.

Let’s consider a basic scenario – the “on any given day 30% of our backups fail” scenario. So, does that mean that on any given day 30% of servers crash and reboot during the backup? Or does the backup software agent crash on 30% of servers when a backup is attempted? Maybe, in the most exceptional of circumstances, this may be the case.

In reality though? In reality we have to start looking at the rest of that iceberg:

Rest of the iceberg

High systemic failure rates, if attributed to the deployed technology, should result in a law suit. How often do we see that happening?

>queue the cicadas<

That’s right.

When there are systemic failure rates, a business must, eventually, turn to face the truth that they have to review their:

  • Policies – Are there any governing rules to the company which are contributing to the problem? For instance, does the company require the technology to be adapted in such a way that it wasn’t designed for? This can be hard and real policies, or they can be implicitly allowed policies – such as empire building.
  • Processes – Are there operating methods which are triggering the issue? Imagine a business for instance where change control has become such a consuming process that backup failures are repeatedly allowed to occur because a change window isn’t available. Is that the fault of the backup technology?
  • People and Education – I’m not suggesting that staff at sites are incompetent. Far from it. Incompetent is such a harsh, unpleasant word that in the 15+ years I’ve been consulting, it’s been a very rarely used word. Education though is a factor. No, I’m not picking on people without tertiary skills, but training is a factor. For example, managers who have no day to day technical experience may decide that some technology, based on a half hour vendor pitch, is easy enough that staff won’t need training in it. If said staff then go on to say, accidentally delete a LUN from a production server, because they weren’t trained , how is that the fault of the SAN?

Navel gazing, introspection, call it what you will, it’s not always a pleasant task. It’s about objectively looking at how we’re doing things, and ask, “are we partly to blame?”

Yet, if you aren’t prepared to do this, you’re doomed (yes, doomed) to keep making the same mistake again, and again, and again. The pile of failed technology builds up, the quest for the silver bullet becomes more frenetic, and the chances of a major failure happening increase. In the worst scenarios, it can become decidedly toxic.

But it doesn’t need to be. Evaluating your processes, your policies and your people (particularly the training of your people) can be – well, cathartic. And the benefits to the business, in terms of literal cost savings and efficiencies, ensures that the introspection is well worth it.

As a consultant, you might assume that it’s my job to ensure that customers buy the best and the most expensive technology out there that I can sell them. That’s a cynical attitude that comes from a few shoddy operators. As a consultant, my job is to partner with you and your company and help you achieve your best. (If you think I’m just blowing smoke up your proverbial, check my “13 traits of a great consultant” article.)

Sometimes that means highlighting that there are issues, not problems, and those issues require a deeper fix than plugging in a new piece of technology.