Jan 242017

In 2013 I undertook the endeavour to revisit some of the topics from my first book, “Enterprise Systems Backup and Recovery: A Corporate Insurance Policy”, and expand it based on the changes that had happened in the industry since the publication of the original in 2008.

A lot had happened since that time. At the point I was writing my first book, deduplication was an emerging trend, but tape was still entrenched in the datacentre. While backup to disk was an increasingly common scenario, it was (for the most part) mainly used as a staging activity (“disk to disk to tape”), and backup to disk use was either dumb filesystems or Virtual Tape Libraries (VTL).

The Cloud, seemingly ubiquitous now, was still emerging. Many (myself included) struggled to see how the Cloud was any different from outsourcing with a bit of someone else’s hardware thrown in. Now, core tenets of Cloud computing that made it so popular (e.g., agility and scaleability) have been well and truly adopted as essential tenets of the modern datacentre, as well. Indeed, for on-premises IT to compete against Cloud, on-premises IT has increasingly focused on delivering a private-Cloud or hybrid-Cloud experience to their businesses.

When I started as a Unix System Administrator in 1996, at least in Australia, SANs were relatively new. In fact, I remember around 1998 or 1999 having a couple of sales executives from this company called EMC come in to talk about their Symmetrix arrays. At the time the datacentre I worked in was mostly DAS with a little JBOD and just the start of very, very basic SANs.

When I was writing my first book the pinnacle of storage performance was the 15,000 RPM drive, and flash memory storage was something you (primarily) used in digital cameras only, with storage capacities measured in the hundreds of megabytes more than gigabytes (or now, terabytes).

When the first book was published, x86 virtualisation was well and truly growing into the datacentre, but traditional Unix platforms were still heavily used. Their decline and fall started when Oracle acquired Sun and killed low-cost Unix, with Linux and Windows gaining the ascendency – with virtualisation a significant driving force by adding an economy of scale that couldn’t be found in the old model. (Ironically, it had been found in an older model – the mainframe. Guess what folks, mainframe won.)

When the first book was published, we were still thinking of silo-like infrastructure within IT. Networking, compute, storage, security and data protection all as seperate functions – separately administered functions. But business, having spent a decade or two hammering into IT the need for governance and process, became hamstrung by IT governance and process and needed things done faster, cheaper, more efficiently. Cloud was one approach – hyperconvergence in particular was another: switch to a more commodity, unit-based approach, using software to virtualise and automate everything.

Where are we now?

Cloud. Virtualisation. Big Data. Converged and hyperconverged systems. Automation everywhere (guess what? Unix system administrators won, too). The need to drive costs down – IT is no longer allowed to be a sunk cost for the business, but has to deliver innovation and for many businesses, profit too. Flash systems are now offering significantly more IOPs than a traditional array could – Dell EMC for instance can now drop a 5RU system into your datacentre capable of delivering 10,000,000+ IOPs. To achieve ten million IOPs on a traditional spinning-disk array you’d need … I don’t even want to think about how many disks, rack units, racks and kilowatts of power you’d need.

The old model of backup and recovery can’t cut it in the modern environment.

The old model of backup and recovery is dead. Sort of. It’s dead as a standalone topic. When we plan or think about data protection any more, we don’t have the luxury of thinking of backup and recovery alone. We need holistic data protection strategies and a whole-of-infrastructure approach to achieving data continuity.

And that, my friends, is where Data Protection: Ensuring Data Availability is born from. It’s not just backup and recovery any more. It’s not just replication and snapshots, or continuous data protection. It’s all the technology married with business awareness, data lifecycle management and the recognition that Professor Moody in Harry Potter was right, too: “constant vigilance!”

Data Protection: Ensuring Data Availability

This isn’t a book about just backup and recovery because that’s just not enough any more. You need other data protection functions deployed holistically with a business focus and an eye on data management in order to truly have an effective data protection strategy for your business.

To give you an idea of the topics I’m covering in this book, here’s the chapter list:

  1. Introduction
  2. Contextualizing Data Protection
  3. Data Lifecycle
  4. Elements of a Protection System
  5. IT Governance and Data Protection
  6. Monitoring and Reporting
  7. Business Continuity
  8. Data Discovery
  9. Continuous Availability and Replication
  10. Snapshots
  11. Backup and Recovery
  12. The Cloud
  13. Deduplication
  14. Protecting Virtual Infrastructure
  15. Big Data
  16. Data Storage Protection
  17. Tape
  18. Converged Infrastructure
  19. Data Protection Service Catalogues
  20. Holistic Data Protection Strategies
  21. Data Recovery
  22. Choosing Protection Infrastructure
  23. The Impact of Flash on Data Protection
  24. In Closing

There’s a lot there – you’ll see the first eight chapters are not about technology, and for a good reason: you must have a grasp on the other bits before you can start considering everything else, otherwise you’re just doing point-solutions, and eventually just doing point-solutions will cost you more in time, money and risk than they give you in return.

I’m pleased to say that Data Protection: Ensuring Data Availability is released next month. You can find out more and order direct from the publisher, CRC Press, or order from Amazon, too. I hope you find it enjoyable.

10 NetWorker Configuration Rules

 NetWorker  Comments Off on 10 NetWorker Configuration Rules
Dec 122011

Over the last 15 years, I’ve administered, configured and supported a huge number of NetWorker installs across a very broad range of business types including mining, finance, insurance, media, telecommunications, agriculture, education, government and health, just to name a small few.

As you may imagine, in my time I’ve picked up a few ways of working with NetWorker, and I thought I’d share some of these as my “golden” configuration rules. These are the things that I stick to, regardless of individual design considerations. I.e., don’t consider them to be design rules; they’re different again.

1. Don’t use the default resources

I make it a policy not to use the default resources that NetWorker provides. This isn’t to say that they’re not appropriate at times, but simply that you should own your own configuration. You should establish a naming standard for each of the core configuration items (groups, pools, policies, schedules, etc.) and use that to keep a consistent, uniform configuration, rather than mixing in bits of your own configuration and the default configuration.

Further, there are some default resources that you can’t modify – pools are a classic example. And in those cases, you want to be able to enable certain settings, such as auto media verify. Since you can’t modify bootstrap pools, you may as well start from scratch there.

Exception: Notifications. While there’s a couple of notifications whose alerts you can’t change, for the most part, start by modifying the existing ones for things like savegroup completion, cleaning alerts, etc.

2. Don’t name groups after their start time

I see this time and time again – groups get named after their start time. If you have an entirely static and unchanging configuration, you may sometimes get away with this. However, for the most part, you’re going to need to be flexible on shuffling around group start times from time to time. E.g., you may need to pull a group forward five minutes, or push its start time back ten minutes, etc.

If the group is named after the start time, you’ve got two options:

  1. Give it a different start time to the name, making the configuration violate the law of least astonishment; or
  2. Create a new group, move the clients across to it, delete the old group, adjust pool configurations, etc.

Either way it’s messy and unpleasant. The best approach is to just not insert the group start time into the name of the group – after all, if you go into NMC and look at a listing of all groups, you’ll see the start time immediately anyway!

If for some reason you really need to include some form of time in the group name, keep it as fuzzy as possible; e.g., “pre midnight” or “post midnight” might be one way of doing it, or even just “Early” and “Late”.

3. Use as few pools as possible

The more pools you have, the more media you’ll need to use (for the most part). It also introduces drive contention and makes performance tuning and tweaking of an environment more challenging. Therefore, keep pools to a minimum, focusing on using them for any/all of the following:

  • Segregating backups based on retention periods/frequency of backup (e.g., a “Daily” pool and a “Monthly” pool);
  • Segregating backups based on locality (e.g., “Daily Offsite” and “Daily Onsite”).

If you keep the number of pools you use in your environment to a minimum (while still having the number you need), you’ll have a much easier to maintain environment.

4. Avoid adjusting pools while the server is active

In the dim dark days of NetWorker history, you couldn’t edit pools while the server was backing up. Over time that restriction has been lifted. However, there are still all sorts of situations that can trigger NetWorker to log the dreaded message about pools being edited while the server was busy. And if this gets logged, your pool changes won’t take effect until you can stop and restart NetWorker.

The solution? Avoid it. Plan the changes that you need to make to pools, and slot them into change windows where backups will be minimised or not happen. Equally, design your solution around the knowledge that pool modifications while the server is active can be a bit painful – e.g., having clients and/or savesets explicitly specified in a pool selection criteria should be an exception, not a rule.

5. Always enable Monitor RAP

NetWorker has a facility to track changes to the configuration which is called “Monitor RAP” – it’s a server resource setting, and it’s disabled by default. Once you enable it though, a RAP log is generated in the server’s log directory which maintains details of everything that gets changed (either by an administrator, or NetWorker itself) in the configuration. This not only helps in any audit situation, it also lets you back-trace configuration changes and stay appraised of changes to the environment when you have more than one person with administrative privileges in the datazone.

6. Don’t use wildcards for the admin usergroup

No, don’t.

It’s that simple.

7. Use schedule overrides to establish better monthly schedules

When creating schedules where you say, need to have monthly backups that skip all days of the month except the last Friday of the month, switch out of calendar view and use fuzzy time definitions for overrides – e.g., and override of “full last friday every month”. It’ll save you a lot of hassle!

If you want to know how to do this, check out the examples in the Power Users’s Guide to nsradmin micromanual.

8. Give jukeboxes sensible names

When you run a configuration wizard, NetWorker will name the jukebox by the SCSI port that it finds the jukebox on. This is all well and good, but that port isn’t necessarily static – it can be moved around due to various operating system changes, etc. What’s more, it’s usually not all that human-friendly in terms of remembering, etc.

However, you can temporarily disable the jukebox and rename it.

I tend to rename the jukebox to the model type – e.g., “i500” or something simple along those lines. This is also the case when there’s only one jukebox attached to each storage node – the “rd=hostname” component of the jukebox name will give you some separation between what’s configured on the storage node and what’s configured on the server. If you have multiple jukeboxes in the same location – particularly if they’re the same model, you might append numbers to the end of the names (e.g., DD_VTL1, DD_VTL2, etc.)

If you’ve got multiple similar jukeboxes in disparate locations say, fibre-channel connected to a single host, you might include an abbreviation of the location in the jukebox name – e.g., “DD_VTL_DC1” and “DD_VTL_DC2” for “Data Centre 1” and “Data Centre 2” – you get the drift…

9. Use the comment field

I’ll sound like an old person here, but let me put it to you this way:

Us old time NetWorker users campaigned long and hard to get a comment field – use it, or you’re being ungrateful!

Seriously though, the comment field is a great way of recording easy-to-use annotations to help make your configuration even easier to understand. Don’t go crazy with it and try to encapsulate the entire configuration for each resource in its comment field, but use it like a good programmer would use code comments.

10. Don’t mix special and filesystem savesets in the same group

Sometimes I’ll see sites where you have the same client in a group twice; the first time the client has savesets of say:

  • /
  • /opt
  • /usr
  • /usr/local
  • /home

And the second instance of the client is for a module – e.g.,

  • RMAN:OracleDB_Details

You see, NetWorker doesn’t let you have two instances of a client in the same group when one of the clients has an “All” saveset; so, in the example above, by rights the first client instance should have an “All” saveset rather than an explicit listing of savesets. But when both clients are shoe-horned into the backups, you move to an inclusive rather than exclusive backup policy, and that’s just dangerous for data protection, and introduces a much higher risk of human error.

Not only that, EMC recommends against doing it anyway – for the reason above, and for the purposes of stability and performance.

11. And a bonus: Don’t have groups that start at the same time

Keep at least five minutes between start times for groups. This is 100% “best practice” and having multiple groups that start at the same time should be absolutely avoided. If you’re in a situation where you don’t have room to configure any more groups with a 5-minute gap between groups, then, well, you’ve got too many groups and you should look at consolidating them.

The law of least astonishment and backups

 Architecture, Backup theory, General Technology, General thoughts, NetWorker  Comments Off on The law of least astonishment and backups
Jan 152010

A recent twitter posting by Matt over at Standalone Sysadmin reminded by of the law of least astonishment.

If you’re not familiar with this law/principle, and you work in IT (not to mention backup!), you should be. Over at Wikipedia, it’s defined thusly:

[W]hen two elements of an interface conflict, or are ambiguous, the behaviour should be that which will least surprise the human user or programmer at the time the conflict arises.

I can’t stress just how important it is that this rule is applied, both to general IT architecture, and to backups as a specific instance.

This is why, for instance, I recently covered the idea that if you can’t diagram your backup environment on the back of a napkin, it’s too complex.

The more arbitrarily complex a system is, the more chance there is of misunderstanding what it does. In data protection in particular, misunderstandings can lead to data loss. Thus, arbitrarily introducing complexity at the cost of comprehension is a very, very bad idea.

Take for instance, you’ve got a script that would arbitrarily remove all indices for backups older than 3 months old. No, I don’t know why you’d have such a script, but I want to use it as an example regardless. You don’t normally run this, but in an emergency if a fileserver does a absolutely huge backup with millions upon millions of files day after day, you may periodically find yourself in the situation of needing to scrub old index data to reclaim space. (Obviously, there should be more space allocated to indices. I’m using this as an example, remember…)

You might think that for such a simple script, there’s no “law of least astonishment” to follow, but trust me, there is, and in this case, it’s all in the name.

Consider a few potential names for such a script:

  • index-maintenance
  • scrub-indices
  • clean-indices
  • purge-indices-3months-and-older

I would argue that all bar the last proposed script name is a violation of the law of least astonishment. Why? The name in the first 3 could easily be misinterpreted by someone to do something else. Who’s that someone? Maybe it’s a contractor that comes in when you’re unexpectedly sick for a month. Or maybe it’s a colleague who takes over when you’re away on holidays but you didn’t get a chance to train him or her before you left. Maybe it’s a new person you’re training.

Of course, backup and system administrators should review scripts before they run them, but let’s be honest: it doesn’t always happen. Some people as well will automatically run scripts/etc., with a “-h” option to see what they do (i.e., to get usage information), and if you haven’t programmed that in and your script just starts blowing away old indices, it’s not a good result.

There is little – practically no – cost to using more meaningful script names. Sure, it means that you may have to type a little more, and maybe a few more bytes here and there are used in directory storage within filesystems, but this is so trivial it’s not worth talking about.

The benefits to using better naming structures though are significantly more pronounced – scripts are named by their function, which means a significant reduction in the chances that someone new to your system will accidentally run them when they shouldn’t, or misinterpret what they do.

In backup and in NetWorker, I’d argue that the law of least astonishment should be applied at every level of the system. This means that groups, policies, pools, schedules, etc. – all the configuration resources – should be named appropriately. Another way of considering it is that if you need a comment for every single resource, your system is too complex. Some resources should be completely obvious. Of course, comments are important at times, but that doesn’t mean that every single aspect of the system should be commented.

It also means when you’re documenting the system, or talking about the system, you should use the local nomenclature. I really dislike the complexity of the terms “cumulative incremental” and “differential incremental” in NetBackup, but when I’m talking NetBackup with people, I recognise that referring to them as “differentials” and “incrementals” respectively will just muddy the discussion. So I adjust to suit their nomenclature. Failing to follow the local nomenclature for a system just introduces more confusion, makes mistakes more likely. In terms of documentation, it means clearly following the local terms. If you can’t always follow those terms, it means you have to establish the exceptions from the outset, and periodically remind of them, so that chances of confusion are minimised. Preferably it should be avoided, but when it can’t be, it must be accounted for.

Within backup and system administration, one could argue that the primary purpose of the law of least astonishment is to eliminate, or at least substantially reduce, the risk of human errors. When people are confronted with one choice that’s clearly elucidated, they’re unlikely to choose the wrong thing. When they’ve got multiple choices, and they’re all clear as mud, the chances of them making the wrong choices or doing something that leads to error just keeps on ramping up with each fork.