I’ll presume for the moment that you’re aware of your actual end of support contract period. (Though, I’ll admit a lot of companies tend to lose track of this – something of ongoing concern.)

The question I’m really asking though is – the products that you’re using, do you know when their support finishes? In order to have a smooth operating environment, you must know the cut-off point at which point you’ll be:

  • Given assistance on that version or
  • Told to upgrade or
  • Told to patch or
  • Told to replace the product

Now, I’ll admit that the challenge here is particularly around the “Told to upgrade” part. Vendors in particular (and EMC is no different) have a tendency to want you to always upgrade to the latest version to see if the problem goes away there. This (in my opinion) is only acceptable with a very small developer team (e.g., from a small company or individual developer), or if there’s release notes or a known bug list that clearly states the problem will go away.

For NetWorker, your key to knowing your end of support dates with both the primary product and the modules comes from the PowerLink NetWorker Product Support Page. To summarise, currently the end of support dates for key NetWorker versions are:

  • Version 7.2 – Expired June 30, 2008. That’s why still being on 7.2 is not a great option. You can still get extended support*, but not for long.
  • Version 7.3 – Expired March 31, 2009. It’s about half-way through end of extended support.
  • Version 7.4 – Expires September 30, 2010. It’s time to at least start planning when you’re going to upgrade. I’m not suggesting you have to rush – quite the contrary!
  • Version 7.5 – Expires December 31, 2011.
  • Version 7.6 – Expires November 30, 2012.

You should have these sorts of end of support dates flagged in calendars, noted on post-it notes on the wall, tattooed onto your arm, or in general, recorded in such a way as you’ll continue to be aware of them.

As a general rule of thumb, I’d suggest that you should always aim to upgrade at least 3 months before end of primary support for a product. There’s a very important reason why I’d recommend that length of time: if there is a serious issue with the upgrade and you need to temporarily downgrade until it’s resolved, the version you drop back down to will continue to be fully supported in the interim.

Support providers – both vendors and third party ones – do, to varying degrees, tend to be fairly flexible, particularly in emergency situations. Remember though, backup is insurance. Running on an old, unsupported backup product is like taking out an insurance policy but then losing the paperwork. Sure, you’re covered, but you may not be able to make a claim when things go wrong.

End of life and end of support dates should effectively be long range markers in change control processes. If they’re not managed that far in advance, you run the risk of missing easy upgrade windows and instead having to do emergency upgrades without ample preparation.

For what it’s worth, none of this is specific to backup and recovery software. It equally refers to operating system software, or clustering software, or any other critical infrastructure software you may deal with. The message remains the same: always know your end of life/end of support dates.


* Extended support = pay more for running old versions.

 

There was a recent posting on the NetWorker mailing list regarding manual backups and whether they’re incrementals or not. The short answer of course is they’re not. The more challenging answer is whether or not you can actually generate a manual incremental backup.

You may think that as of 7.5 onwards, where the level is expressly ignored for manual backups, that this isn’t possible:

[root@tara ~]# save -l incr -b Default /tmp
Client initiated backup.Option '-l' is ignored and backup is performed at level adhoc

After all, in 7.4 and below, if you ran the above command anyway, you wouldn’t have actually got an incremental backup of /tmp anyway – sure, it would have been tagged as an incremental backup, but that’s not the way that non-complete backup is actually generated in NetWorker. You see, NetWorker needs a timestamp to base a non-full backup against. That timestamp is going to be the nsavetime of a previous backup. (For an incremental, it will be the nsavetime of whatever the most recent backup for the saveset was – for differentials, it may vary.)

I’ll walk through an example of getting an incremental manual backup. It will still be tagged in NetWorker as a manual backup (that just is unavoidable these days), but it will at least just be an incremental. To start with, I need a full backup of something. I’ve got a full backup of my /usr/share directory as its own saveset here:

[root@tara ~]# mminfo -q "name=/usr/share" -r volume,level,sumsize,nsavetime
 volume          lvl   size  save time
800803L4        full 1244 MB 1263844861

Now, in order to be able to run a ‘manual’ incremental backup against this, I need to run save with a -t (for time) option – and the time I use will be 1263844861, which will backup all changes to that directory since the last backup.

So the command becomes:

[root@tara ~]# save -q -LL -t 1263844861 /usr/share
66135:save: NSR directive file (/.nsr) parsed
save: /usr/share  251 KB 00:00:20    588 files
completed savetime=1263880379

Note there that I haven’t included a level. If I had, even with the “-t” option included, NetWorker would have still generated the warning/error about ignoring the level for client initiated backups. However, I can confirm that it’s effectively an incremental backup by checking mminfo and looking at the sumsize field again:

[root@tara ~]# mminfo -q "name=/usr/share" -r volume,level,sumsize,nsavetime
 volume          lvl   size  save time
800803L4        full 1244 MB 1263844861
800803L4      manual 251 KB 1263880379

As you can see, we’ve got a full backup, and a subsequent manual backup that is effectively an incremental against the full.

Where is this useful? I wouldn’t imagine that it’s something you should be making use of in normal operations. However, in an emergency, when there’s an upgrade about to be done and you need to walk someone through doing an incremental backup before the upgrade without giving them administrative access to the backup server, this would be the sort of technique that can come in handy.

 

Having recently encountered a situation where a NetWorker client on a customer site repeatedly failed its full backup, I wanted to take a few moments to stress the absolute, importance – no, extreme criticality – of always being on top of your full backups.

Specifically:

  • You should always know whether your full backups have succeeded or not for each and every client of your backup system.
  • Unless there are specific management directives to the contrary, you should always re-run full backups in the event of failure as soon as possible.

To put it another way – a set of backups without a full, when it comes to performing a complete filesystem or system recovery, is about as useful as a chocolate teapot. Perhaps even less so.

I’ve described previously the importance of having a zero error policy, and always knowing if failures occur. So this topic could be summarised as being a subset of the zero error policy. However, if I were to be asked what backup I could “afford to lose” in terms of complete system recoverability, I’d pick an incremental any day over a full. (It’s actually a fine line, but it’s still an important differentiation.)

Without a full backup, at best you can pull back bits and pieces of a filesystem. Sure, they might be the most recently modified bits, which in themselves are important, but they’re not the entire filesystem. For most organisations, they barely touch the surface of the filesystem. Incrementals (and for that matter, differentials) are like the proverbial tip of the iceberg – perhaps without the penguins though*. The real monstrosity in a backup environment – the rest of the iceberg – are the fulls.

Let’s consider it this way – in most environments (discounting say, backups of database dump regions) you’ll find that an incremental backup covers somewhere between 5% to 10% of the filesystem. Not only that, the delta change on a day to day basis will also be quite small. That is, in many situations the files that are backed up each day in incremental backup regimes are the same files, modified day after day for working purposes. So while you may have incrementals of even up to 10% per day of your fulls, in turn 90% or more of those files may be the same files each day that are getting backed up in incrementals.

If we look at a 200GB filesystem though, even 10% of that filesystem is just 20GB. So if your full is somehow lost, that’s 180GB that you can’t readily recover. Additionally, the 20% or so that you can recover is going to be a pigs breakfast as far as getting it back in any consistent state.

NetWorker, through its use of saveset dependency chains, will do its utmost to protect you from regular saveset failures. If a full filesystem backup fails, subsequent incrementals will be chained onto the previous dependency set, retaining the previous full backup for a longer period of time.

It’s important we don’t let those dependency chains just keep building and building. They need to be broken and restarted so that we don’t get into messy situations or use up too much media. That’s why you should have a policy to rerun a full backup as soon as possible if it fails, rather than just waiting for the next one. (Further, I’ve far too often seen that sites with a “just wait until the next full backup runs” policy continually miss full backup failures, often for months at a time, because that sort of attitude also seems to be accompanied with informal records keeping.)

The next thing to consider is that we mustn’t just arbitrarily break dependency chains ourselves. By this, I’m referring to manually recycling media without regards to what may depend on that media, just because we need to free up volumes or have policies that media should be recycled after a certain length of time.

More than anything else, I see this as the reason companies find themselves in situations where NetWorker returns an “Unknown” volume being required for recovery. In this situation, NetWorker knows there should be a full backup, but it doesn’t have access to it, and therefore it can’t do anything to get the complete filesystem (or other type of data) recovered. Or, if there’s going to be a significant recovery error

Your full backups are like gold. No, gold isn’t special enough. Platinum, maybe. Or some combination of gold, platinum and saffron. They’re not to be cavalierly deleted, they’re not to be ignored, and they’re not to be left unchecked. (They’re not to be uncloned, either.)

In actual fact, it really doesn’t matter what your backup product is. What always matters is that your full backups are done, they’re done as soon as possible around the scheduled time, they’re successful, they’re known to be successful, and they’re successfully cloned. If any of those factors aren’t in play, you’ve got to get it fixed straight away.


* Unless they’re incrementals from a Linux system, of course.

 

Needing a few interesting things to read at the end of the week?

Here’s a few things I’ve found fascinating this week:

  • Why do IT operations suck? An insightful article by Steve O’Donnell. Steve asks why our staff who have primary involvement with systems 24×7 (operators) are often the least skilled, least trained and least paid. (As a consultant, I’ve frequently experienced companies who consider it a waste of time to properly train operators, and as a result their systems usually suffer for it.)
  • Over at Daring Fireball, John Gruber has an article called The Original Tablet. (It’s a great historical perspective on why Microsoft can’t exclusively claim ownership of the tablet idea.)
  • Like many others, I found Google’s slap in the face to China’s net censorship and cyber-warfare activities well timed and highly appropriate. On the other hand, others such as John Obeto over at Absolutely Windows found it not much more than petty PR. Somewhere in the middle is probably the whole story…
  • Over at IT Depends, I found Terri McClure’s views on Microsoft’s requirements for accessing their Azure SLAs to be the same as mine – staggeringly stupid. (According to Microsoft Fanboy site The Register, Microsoft are reviewing their decision on that one.)
  • Storagebod got me thinking again about Availability and Uptime with his article about how availability is measured.
  • Not technically reading, but I’ve finally jumped on board the growing number of listeners to Infosmack. This podcast is run by Greg Knieriemen and Marc Farley, and frequently has guests from many of the storage vendors and other storage bloggers. I’m really regretting that I haven’t been listening to it for longer. It’s definitely going to be a regular podcast for me from now on.
  • Over at Storage Monkeys, Sunshine Mugrabi’s article on EMC’s heavy involvement in social networking is definitely worth reviewing. (For what it’s worth, if you haven’t ever read it, you need to read The Cluetrain Manifesto if you think that all this social networking stuff is rubbish or just a passing fad. It isn’t. Written years before its time, The Cluetrain Manifesto is a clear and articulate series of essays about exactly how important social networking is.)
  • Finally, there’s been some interesting discussions on VMware and application level VSS backups through VCB/vSphere. Check my posting here for the summary of the important links to be following about it.

Finishing up, a little about what you’ve been reading: the NetWorker Power Users Guide to nsradmin. The number of downloads has been staggering – far more than I hoped for, and I hope like the main blog, the guide proves useful to many a NetWorker administrator.

 

There’s currently a bug within NetWorker whereby if you’re using a 32-bit Windows client that has a filesystem large enough such that the savesets generated are larger than 2TB, you’ll get a massively truncated size reported in the savegroup completion. In fact, for a 2,510 GB saveset, the savegroup completion report will look like this:

Start time:   Sat Nov 14 17:42:52 2009
End time:     Sun Nov 15 06:58:57 2009

--- Successful Save Sets ---
* cyclops:Probe savefs cyclops: succeeded.
* cyclops:C:\bigasms 66135:(pid 3308): NSR directive file (C:\bigasms\nsr.dir) parsed
 cyclops: C:\bigasms               level=full,   1742 MB 13:15:56    255 files
 trash.pmdg.lab: index:cyclops     level=full,     31 KB 00:00:00      7 files
 trash.pmdg.lab: bootstrap         level=full,    213 KB 00:00:00    198 files

However, when checked through NMC, nsrwatch or mminfo, you’ll find that that the correct size for the saveset is actually shown:

[root@trash ~]# mminfo
 volume        client       date      size   level  name
XFS.002        cyclops   11/14/2009 2510 GB   full  C:\bigasms
XFS.002.RO     cyclops   11/14/2009 2510 GB   full  C:\bigasms

The reporting doesn’t affect recoverability, but if you’re reviewing savegroup completion reports the data sizes will likely (a) be a cause for concern or (b) affect any auto parsing that you’re doing of the savegroup completion report.

I’ve managed to secure a fix for 7.4.4 for this, with requests in to get it ported to 7.5.1 as well, and to get it integrated into the main trees for permanent inclusion upon the next service packs, etc. If you’ve been putting up with this problem for a while or have just noticed it and want it fixed, the escalation patch number was NW110493.

(It’s possible that this problem affects more than just 32-bit Windows clients – i.e,. it could affect other 32-bit clients as well. I’d be interested in knowing if someone has spotted it on another operating system. I’d test, but my lab environment is currently otherwise occupied and generating 2+TB of data, even at 90MB/s, is a wee bit long.)

 

A recent twitter posting by Matt over at Standalone Sysadmin reminded by of the law of least astonishment.

If you’re not familiar with this law/principle, and you work in IT (not to mention backup!), you should be. Over at Wikipedia, it’s defined thusly:

[W]hen two elements of an interface conflict, or are ambiguous, the behaviour should be that which will least surprise the human user or programmer at the time the conflict arises.

I can’t stress just how important it is that this rule is applied, both to general IT architecture, and to backups as a specific instance.

This is why, for instance, I recently covered the idea that if you can’t diagram your backup environment on the back of a napkin, it’s too complex.

The more arbitrarily complex a system is, the more chance there is of misunderstanding what it does. In data protection in particular, misunderstandings can lead to data loss. Thus, arbitrarily introducing complexity at the cost of comprehension is a very, very bad idea.

Take for instance, you’ve got a script that would arbitrarily remove all indices for backups older than 3 months old. No, I don’t know why you’d have such a script, but I want to use it as an example regardless. You don’t normally run this, but in an emergency if a fileserver does a absolutely huge backup with millions upon millions of files day after day, you may periodically find yourself in the situation of needing to scrub old index data to reclaim space. (Obviously, there should be more space allocated to indices. I’m using this as an example, remember…)

You might think that for such a simple script, there’s no “law of least astonishment” to follow, but trust me, there is, and in this case, it’s all in the name.

Consider a few potential names for such a script:

  • index-maintenance
  • scrub-indices
  • clean-indices
  • purge-indices-3months-and-older

I would argue that all bar the last proposed script name is a violation of the law of least astonishment. Why? The name in the first 3 could easily be misinterpreted by someone to do something else. Who’s that someone? Maybe it’s a contractor that comes in when you’re unexpectedly sick for a month. Or maybe it’s a colleague who takes over when you’re away on holidays but you didn’t get a chance to train him or her before you left. Maybe it’s a new person you’re training.

Of course, backup and system administrators should review scripts before they run them, but let’s be honest: it doesn’t always happen. Some people as well will automatically run scripts/etc., with a “-h” option to see what they do (i.e., to get usage information), and if you haven’t programmed that in and your script just starts blowing away old indices, it’s not a good result.

There is little – practically no – cost to using more meaningful script names. Sure, it means that you may have to type a little more, and maybe a few more bytes here and there are used in directory storage within filesystems, but this is so trivial it’s not worth talking about.

The benefits to using better naming structures though are significantly more pronounced – scripts are named by their function, which means a significant reduction in the chances that someone new to your system will accidentally run them when they shouldn’t, or misinterpret what they do.

In backup and in NetWorker, I’d argue that the law of least astonishment should be applied at every level of the system. This means that groups, policies, pools, schedules, etc. – all the configuration resources – should be named appropriately. Another way of considering it is that if you need a comment for every single resource, your system is too complex. Some resources should be completely obvious. Of course, comments are important at times, but that doesn’t mean that every single aspect of the system should be commented.

It also means when you’re documenting the system, or talking about the system, you should use the local nomenclature. I really dislike the complexity of the terms ”cumulative incremental” and “differential incremental” in NetBackup, but when I’m talking NetBackup with people, I recognise that referring to them as “differentials” and “incrementals” respectively will just muddy the discussion. So I adjust to suit their nomenclature. Failing to follow the local nomenclature for a system just introduces more confusion, makes mistakes more likely. In terms of documentation, it means clearly following the local terms. If you can’t always follow those terms, it means you have to establish the exceptions from the outset, and periodically remind of them, so that chances of confusion are minimised. Preferably it should be avoided, but when it can’t be, it must be accounted for.

Within backup and system administration, one could argue that the primary purpose of the law of least astonishment is to eliminate, or at least substantially reduce, the risk of human errors. When people are confronted with one choice that’s clearly elucidated, they’re unlikely to choose the wrong thing. When they’ve got multiple choices, and they’re all clear as mud, the chances of them making the wrong choices or doing something that leads to error just keeps on ramping up with each fork.

 

In a previous article, I discussed how deduplication is one of those technologies that still straddles the gap between bleeding edge and leading edge, and thus needs to be classified as bleeding edge.

Putting aside the bleeding edge/leading edge argument for the moment (though my view there remains the same), a growing concern I have for deduplication is that it’s popping up everywhere in little islands rather than as a fully integrated option.

The net result? Dedupe on primary storage. Rehydrate to access. Modify, then dedupe to save again. Rehydrate for next access. Dedupe for saved changes. Rehydrate to backup. Dedupe the backup. Rehydrate for recovery.

All this dedupe is making me thirsty. Worse, it’s starting to look like a roller-coaster ride, and I always have the same reaction to them – horror, then an urge to throw up a little. The cycle doesn’t even look nice:

Dedupe/Rehydrate Cycle

So, what’s the solution?

There’s certainly no easy solution – and currently no integrated solution. Not without some serious consideration to standards. Let’s accept, for the moment, that there’s no real option to keep in-OS/RAM data deduplicated. (I.e., at the per-operating system level – maybe there would be at a cross-OS virtualisation level within the hypervisor, but we’re not really there yet.)

One obvious factor that springs to mind is that the first, best approach to some normalisation would be to come up with a technique to transfer deduped primary storage in its deduped format to a deduped backup storage. There are already techniques for synchronising deduplicated data (e.g., when replicating between say, two Data Domain hosts). Why rehydrate when the next step is going to be a new dedupe algorithm being applied, for instance?

If we look at NetWorker, there are a number of places where dedupe can happen, either as part of the backup cycle, or a larger strategy:

  • Primary storage deduplication via say, a Data Domain storage box or something along those lines.
  • Archive/single instance deduplication for less frequently accessed files (say, Centera).
  • Source based dedupe backup (via an Avamar node).
  • Dedupe VTL (data domain or the DL4000 with a deduplication add-on).

(No, I won’t put dedupe backup to disk there. Not until ADV_FILE starts working better.)

Within the EMC product kit, there’s a lot of chance for interoperability of deduplicated data without the need to rehydrate. If anything, EMC is one of the few vendors out there (HP and IBM are the only others that spring to mind) that offer reasonably complete verticals on storage, running from the base array to the backup solution.

Based on EMC’s strong focus on deduplication with the acquisition of both Avamar and Data Domain, it seems a distinct possibility that this is at least a part of their planning. Shifting deduplicated data between disparate products without needing to rehydrate does have potential to be a game changer in terms of how we work with data, but I’ll promise you this: you won’t see this level of integration this year, and possibly not for the next few years. That level of integration is not going to be easy, it’s not going to come quick, and it’s going to require extreme levels of testing to make sure that it actually works when it is implemented.

So for the time being, we’ll have to continue to put up with deduplication being done in little islands within our IT environments, and continue to ride the deduplication/rehydration roller-coaster. Let’s hope we all don’t get sick before solutions start to appear.

 

Over at Storagebod, Martin Glassborow currently has a short and insightful post, How do you measure availability?

Martin’s point is thus:

If a vendors says that that their array is 99.999% available, what does that really mean to you? Probably not a lot in practical terms. Does it mean that individual components are 99.999% available? Or does it mean that the array itself in some shape or form is available?

This cuts to the heart of insufficiently quantifiable availability/uptime measurements.

Availability isn’t a sufficient measuring stick. Access is. To put it more accurately, availability by itself isn’t a sufficient measurement – what is important is availability of user services. The difference? An array may be completely available in that it is servicing IO requests and all drives are functional. However, it may be simultaneously unavailable, as far as users are concerned, because some esoteric bug is causing it to service those IO requests at say, one tenth the normal speed. It’s up, but not from an end user perspective, available.

True availability is a series of distinct measurements against locally defined requirements, not something that you get just by buying an array (or any other piece of hardware) that a vendor quotes an availability percentage for. It can’t be bought, it can only be architected and implemented.

For a complete outline of my argument on this, check out an article I wrote some time ago: Uptime is an inappropriate metric.

 

Over at Backup Central, Curtis Preston has written a couple of excellent blog posts to do with VSS.

The first, What is Windows VSS and why should you care? is an excellent overview of how the VSS process works within Windows. Even if you’ve been using VSS within your environment, if you’re not quite sure how it works, this is a great piece to read.

The second delves into issues relating to VMware VCB’s (in)ability to perform consistent application backups – i.e., via VSS for say, an Exchange or Microsoft SQL guest. Titled Hyper-V ahead of VMware in the Backup Race, it’s a justifiable kick in the pants to VMware, and a pointed warning regarding VMware/VCB backups of applications.

(These two articles, Curtis mentions, came about from some posts by Scott Waterhouse on The Backup Blog, which talked about vSphere backups.)

 

On the NetWorker Mailing List, I still frequently see a lot of posts from people who are having various problems with their NetWorker 7.2.x servers.

It’s time to move away from 7.2. I know, it was the last version before nsrjobd; the move to nsrjobd in 7.3, then raw daemon logs in 7.4 can both be a bit shocking, but 7.2 is now critically old and critically out of support. Equally, there’s still a lot of people out there running 7.3 releases of NetWorker. That, too, exited support some time ago, and it’s time to move on from it too.

I’ll agree that within backup, there is a strong logic to the statement “if it ain’t broke, don’t fix it”, but, you have to weigh up that against the simple fact that 7.2.x releases in particular are very old, and 7.3.x releases are fairly aged as well.

Since I’ve been watching more and more of Top Gear, I’ll use a car analogy. Let’s say you’ve got a brand new, top of the line Ferrari. When it needs servicing, do you take it to the official Ferrari shop that provides a 100% warranty on all repairs and whose repairs keep the original vehicle warranty intact, or do you take it to Bill & Joes Motor Fixits ‘R’ Us, who not only might leave you with a car in a worse condition than when you drove it in, but who aren’t certified by Ferrari and thus lose you your new car warranty?

Continuing to backup your environment with a backup product which is long out of support is like outsourcing to Bill & Joes Motor Fixits ‘R’ Us IT Service.

I’ll be the first to admit that even on simple updates you can run into a few hassles. Particularly as you move up the NetWorker version chain you’ll find changes to authentication and name resolution requirements alone that may necessitate some additional work around the time of the update. If your clients are old you’ll also be needing to plan an update for them as soon as possible too, and in some cases, you may find yourself definitely having to update clients if there turns out to be some particularly odd issue.

But I’ll be honest: that little bit of up-front pain is much, much better than hitting a critical backup or recovery problem that can’t be solved without upgrading (or worse, can’t be solved due to incompatibilities between ancient NetWorker versions and modern operating system versions). Planning and implementing a controlled upgrade, even if it does end up having a few hassles, is infinitely better than doing an emergency upgrade without any planning to facilitate a recovery or a backup that has to be done.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha