Late last year while in New Zealand, I attended a {Product} Overview session. Unfortunately, I had to leave after an hour, and during that time only managed to hear a couple of new things about {Product}. Instead, my colleagues and I spent most of the time trying to correct the {Vendor} technical specialist, who was regurgitating some old FUD about NetWorker.

One of the funniest pieces of FUD that I heard from the {Vendor} rep was that “workgroup” (ha!) products like NetWorker will waste time during the recovery process when differential backups have been run. Drawing up a simple table like the following. The argument used ran along these lines:

  • Weekend – Full backup (i.e., 100%)
  • Monday – Backup 5% change
  • Tuesday – Backup 10% change
  • Wednesday – Backup 15% change
  • Thursday – Backup 20% change
  • Friday – Backup 25% change

Now, as I point out in my book, while one must consider the potential that the unique changed files in set of differential backups may be 100% on each day, it’s not always going to be the case. In fact, only in fairly niche areas or situations will this be so. To be more accurate, a differential backup model may look more like:

  • Weekend – Full backup (i.e., 100%)
  • Monday – Backup 5% change.
  • Tuesday – Backup 7% change.
  • Wednesday – Backup 9% change.
  • Thursday – Backup 10% change.
  • Friday – Backup 11% change.

(That is – in most sites where differentials are used, the unique files that change each day will be minimal.)

Now, regardless of which model happens within an environment, the {Vendor} representative bravely then tried to assert that “with NetWorker, that means a full recovery on Friday would need to pull back 125% of the data!”

That statement is of course about as accurate as “croc shoes are cool”.

There are two types of implied FUD In this statement – and both are incorrect. They are:

  • The FUD that if you backup the same file in both a full and a differential, NetWorker would recover both files, first the one from the full, then the one from the differential, in order to complete the recovery.
  • The FUD that a filesystem recovery from fulls + X might pull back all files that were backed up, rather than a point in time view of the filesystem as of the last backup.

Thankfully, like Elmer, both of these FUDs are relatively easy to put to rest. I’ll do them in reverse order, since disproving the second puts us in an easy position to disprove the first FUD.

Scenario:

  • Schedule called “TestDiff”: full, 5, 5, 5, 5, 5, 5
  • Group called “TestDiff”: Using schedule “TestDiff”
  • Client tara in group “TestDiff” has save set: /root/casestudy

Initial content of /root/casestudy:

[root@tara ~]# ls -al /root/casestudy
total 30796
drwxr-xr-x  2 root root     4096 Feb  2 03:41 .
drwxr-x--- 22 root root    20480 Feb  2 03:34 ..
-rw-r--r--  1 root root 10485760 Feb  2 03:41 full1.dat
-rw-r--r--  1 root root 10485760 Feb  2 03:41 full2.dat
-rw-r--r--  1 root root 10485760 Feb  2 03:41 full3.dat

So that’s 30MB of data. Our first backup will by necessity be a full, and we’ll follow that with an mminfo so we can see how much data has been backed up:

[root@tara ~]# savegrp -l full TestDiff
Feb  2 03:44:35 tara logger: NetWorker media: (waiting) Waiting for 1 writable volume(s) to backup pool 'TestDiff' tape(s) on tara.pmdg.lab
[root@tara ~]# mminfo -q "name=/root/casestudy"
volume        client       date      size   level  name
800844L4       tara.pmdg.lab 02/02/2011 30 MB full  /root/casestudy

Now that we’ve got that initial backup done, we’ll populate a couple more files into the directory, and do 2 level 5 differential backups:

[root@tara ~]# dd if=/dev/zero bs=1024k count=10 of=/root/casestudy/1stdiff-1.dat
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.02353 seconds, 446 MB/s

[root@tara ~]# dd if=/dev/zero bs=1024k count=10 of=/root/casestudy/1stdiff-2.dat
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.080027 seconds, 131 MB/s

[root@tara ~]# ls -al /root/casestudy
total 51308
drwxr-xr-x  2 root root     4096 Feb  2 03:45 .
drwxr-x--- 22 root root    20480 Feb  2 03:34 ..
-rw-r--r--  1 root root 10485760 Feb  2 03:45 1stdiff-1.dat
-rw-r--r--  1 root root 10485760 Feb  2 03:45 1stdiff-2.dat
-rw-r--r--  1 root root 10485760 Feb  2 03:41 full1.dat
-rw-r--r--  1 root root 10485760 Feb  2 03:41 full2.dat
-rw-r--r--  1 root root 10485760 Feb  2 03:41 full3.dat

[root@tara ~]# savegrp -l5 TestDiff
[root@tara ~]# !mminfo
mminfo -q "name=/root/casestudy"
volume        client       date      size   level  name
800844L4       tara.pmdg.lab 02/02/2011 30 MB full  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy

[root@tara ~]# savegrp -l5 TestDiff
[root@tara ~]# mminfo -q "name=/root/casestudy"
volume        client       date      size   level  name
800844L4       tara.pmdg.lab 02/02/2011 30 MB full  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy

All of that looks completely normal. So, now we’ll put a couple of more files in the directory – this time of differing sizes, and run a differential backup as well as the mminfo commands again:

[root@tara ~]# dd if=/dev/zero bs=512k count=10 of=/root/casestudy/3rddiff-1.dat
10+0 records in
10+0 records out
5242880 bytes (5.2 MB) copied, 0.022319 seconds, 235 MB/s
[root@tara ~]# dd if=/dev/zero bs=512k count=10 of=/root/casestudy/3rddiff-2.dat
10+0 records in
10+0 records out
5242880 bytes (5.2 MB) copied, 0.011584 seconds, 453 MB/s

[root@tara ~]# savegrp -l5 TestDiff
[root@tara ~]# !mminfo
mminfo -q "name=/root/casestudy"
volume        client       date      size   level  name
800844L4       tara.pmdg.lab 02/02/2011 30 MB full  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 30 MB    5  /root/casestudy

Right, next step is to delete some files – I’m going to delete the “1stdiff*” files, then run a new backup:

[root@tara ~]# rm /root/casestudy/1stdiff-*
rm: remove regular file `/root/casestudy/1stdiff-1.dat'? y
rm: remove regular file `/root/casestudy/1stdiff-2.dat'? y

[root@tara ~]# ls -l /root/casestudy
total 41032
-rw-r--r-- 1 root root  5242880 Feb  2 03:48 3rddiff-1.dat
-rw-r--r-- 1 root root  5242880 Feb  2 03:48 3rddiff-2.dat
-rw-r--r-- 1 root root 10485760 Feb  2 03:41 full1.dat
-rw-r--r-- 1 root root 10485760 Feb  2 03:41 full2.dat
-rw-r--r-- 1 root root 10485760 Feb  2 03:41 full3.dat

[root@tara ~]# savegrp -l5 TestDiff
[root@tara ~]# !mminfo
mminfo -q "name=/root/casestudy"
volume        client       date      size   level  name
800844L4       tara.pmdg.lab 02/02/2011 30 MB full  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 30 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 10 MB    5  /root/casestudy

Right, {Vendor} FUD – are you with us now? We’ll now delete all the files in the directory and do a recovery and see what we pull back. By rights, it should be only those files in the directory as of the time of backup – full1.dat, full2.dat, full3.dat, 3rddiff-1.dat and 3rddiff-2.dat:

[root@tara casestudy]# cd /root/casestudy
[root@tara casestudy]# rm *
rm: remove regular file `3rddiff-1.dat'? y
rm: remove regular file `3rddiff-2.dat'? y
rm: remove regular file `full1.dat'? y
rm: remove regular file `full2.dat'? y
rm: remove regular file `full3.dat'? y
[root@tara casestudy]# recover -s tara
Current working directory is /root/casestudy/
recover> ls
 3rddiff-1.dat   3rddiff-2.dat   full1.dat       full2.dat       full3.dat
recover> add *
5 file(s) marked for recovery
recover> volumes
Volumes needed (all on-line):
        800844L4 at /dev/nst1
recover> recover
Recovering 5 files into their original locations
Volumes needed (all on-line):
        800844L4 at /dev/nst1
Total estimated disk space needed for recover is 41 MB.
Requesting 5 file(s), this may take a while...
Requesting 1 recover session(s) from server.
./full1.dat
./full2.dat
./full3.dat
./3rddiff-1.dat
./3rddiff-2.dat
Received 5 file(s) from NSR server `tara'
Recover completion time: Wed 02 Feb 2011 03:51:43 AM EST
recover> quit
[root@tara casestudy]# ls -l
total 41032
-rw-r--r-- 1 root root  5242880 Feb  2 03:48 3rddiff-1.dat
-rw-r--r-- 1 root root  5242880 Feb  2 03:48 3rddiff-2.dat
-rw-r--r-- 1 root root 10485760 Feb  2 03:41 full1.dat
-rw-r--r-- 1 root root 10485760 Feb  2 03:41 full2.dat
-rw-r--r-- 1 root root 10485760 Feb  2 03:41 full3.dat

So, {Vendor} FUD #2 is toast. If we do multiple differential backups (or for that matter, incrementals!) with file deletes happening between backups, NetWorker just recovers the filesystem as of the last point it was backed up – it doesn’t try to repopulate files that didn’t exist as of the last backup.

Let’s return now to {Vendor} FUD #1 about differential backups in NetWorker. We’ve got a bunch of files for which we’ve done differential backups with, and so far all of those files have been backed up to a single volume – 800844L4. So, what I’m going to do is unmount that tape, mark it as full, then overwrite the file ‘full3.dat’, which will mean it’ll need a new backup:

[root@tara casestudy]# nsrjb -u 800844L4
Info: Operation `Eject' in progress on device `/dev/nst1'
Jukebox operation finished with status: succeeded
[root@tara casestudy]# nsrmm -o full 800844L4
Mark LTO Ultrium-4 tape 800844L4 as full? y
[root@tara casestudy]# dd if=/dev/zero bs=1024k count=50 of=full3.dat
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 0.249461 seconds, 210 MB/s
[root@tara casestudy]# !savegrp
savegrp -l5 TestDiff
Feb  2 03:56:29 tara logger: NetWorker media: (waiting) Waiting for 1 writable volume(s) to backup pool 'TestDiff' tape(s) on tara.pmdg.lab
[root@tara casestudy]# !mminfo
mminfo -q "name=/root/casestudy"
 volume        client       date      size   level  name
800843L4       tara.pmdg.lab 02/02/2011 81 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 30 MB full  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 20 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 30 MB    5  /root/casestudy
800844L4       tara.pmdg.lab 02/02/2011 10 MB    5  /root/casestudy

Our new backup is sitting on 800843L4 – a different tape. I’ll now delete and recover full3.dat, and demonstrate that NetWorker doesn’t do anything so stupid as the notion of recovering the file twice:

[root@tara casestudy]# rm full3.dat
rm: remove regular file `full3.dat'? y
[root@tara casestudy]# recover
Current working directory is /root/casestudy/
recover> add full3.dat
/root/casestudy
1 file(s) marked for recovery
recover> volumes
Volumes needed (all on-line):
        800843L4 at /dev/nst2
recover> recover
Recovering 1 file into its original location
Volumes needed (all on-line):
        800843L4 at /dev/nst2
Total estimated disk space needed for recover is 51 MB.
Requesting 1 file(s), this may take a while...
Requesting 1 recover session(s) from server.
./full3.dat
Received 1 file(s) from NSR server `tara.pmdg.lab'
Recover completion time: Wed 02 Feb 2011 03:57:57 AM EST
recover> quit

Now, just to prove that I’m not incorrectly trusting NetWorker, here’s the nsr_render_log output for the daemon.raw from the time of recovery – see if you can spot how many tapes we used:

70920 02/02/2011 03:57:42 AM  0 0 2 2426823904 26733 0 tara.pmdg.lab nsrd tara.pmdg.lab:root browsing
70919 02/02/2011 03:57:51 AM  0 0 2 2426823904 26733 0 tara.pmdg.lab nsrd tara.pmdg.lab:root done browsing
70920 02/02/2011 03:57:51 AM  0 0 2 2426823904 26733 0 tara.pmdg.lab nsrd tara.pmdg.lab:root browsing
70911 02/02/2011 03:57:53 AM  0 0 2 2426823904 26733 0 tara.pmdg.lab nsrd tara.pmdg.lab:/root/casestudy (2/02/11) starting read from 800843L4 of 51 MB
70904 02/02/2011 03:57:57 AM  0 0 2 2426823904 26733 0 tara.pmdg.lab nsrd tara.pmdg.lab:/root/casestudy (2/02/11) done reading 51 MB
70919 02/02/2011 03:57:57 AM  0 0 2 2426823904 26733 0 tara.pmdg.lab nsrd tara.pmdg.lab:root done browsing
70920 02/02/2011 03:57:57 AM  0 0 2 2426823904 26733 0 tara.pmdg.lab nsrd tara.pmdg.lab:root browsing
42506 02/02/2011 03:57:57 AM  2 0 0 2426823904 26733 0 tara.pmdg.lab nsrd recover info: User root on tara.pmdg.lab successfully recovered tara.pmdg.lab's files
70919 02/02/2011 03:58:00 AM  0 0 2 2426823904 26733 0 tara.pmdg.lab nsrd tara.pmdg.lab:root done browsing

Wait for it, wait for it … now let’s see, it used 800843L4. That’s one tape. That was our second backup of the file. Hmmm, but it didn’t pull back the first copy of the file, because that was on 800844L4, and the logs tell us it only read from a single tape.

{Vendor} FUD #1 put to rest too.

The real pity about vendors flinging FUD about other vendors products is that it takes away from time that could be otherwise used productively. In this case, I had been looking forward to getting at least an hour of a {Product} technical briefing. Lamentably, that’s not what I got.

Maybe next time.

 

I want to spend a few minutes discussing something that drives me nuts. It’s something I see quite regularly on technical websites that discuss data protection, and it’s about time I make my opinion clear on it.

The latest instance comes from an article at SearchStorage called “How tiering can improve your backup strategies“. Marc Staimer wrote:

In one example, all data is commonly backed up once a day, put on tape, then shipped offsite. This methodology means that the RPO is 24 hours, and the RTO is a few days or longer. This is not a good idea for an organization’s mission-critical data. First, the process in recovering the data takes much too long, bringing all of the correct tapes back from offsite, and then recovering them in order, (which is subject to common human error). This can be incredibly tiresome and annoying if all that is being recovered is a single file caused by an accidental deletion. Second, it assumes all data on all tapes are recoverable. In the end, both introduce unacceptable risks to mission-critical data.

Now, I’m not going to dispute the fact that daily backups to tape can give RPOs of 24 hours or more, and can result in RTO’s of more than 24 hours. However, I don’t agree that an RPO of 24 hours is always the case, and I certainly don’t agree that an RTO of 24 hours (or more) is a 100% inevitability. Instead, I want to spend some time picking apart the rest of this junk statement.

Let’s first consider:

[T]he process in recovering the data takes much too long, bringing back all of the correct tapes from offsite, and then recovering them in order, (which is subject to human error). This can be incredibly tiresome and annoying if all that is being recovered is a single file caused by an accidental deletion.

This would be true if we were using archaic backup scripts (perhaps in a completely decentralised environment) with no automation. On the other hand, if you’re using decent, enterprise backup software there are absolutely no reasons why this should be the case. Enterprise class backup software will:


  • Identify which media is required for a recovery.
  • Read only from the media required for a recovery.
  • Seek to positions as close to the recovery point so as to avoid reading redundant data.

If we look at NetWorker for instance, we know it’s no slouch when it comes to seeking to the right spot on media for rapid single-file recovery. Between file records and media record markers, NetWorker can very quickly direct a tape drive to seek to the optimum location to commence recovery.

So my first thought is – if that’s the sort of experience that Marc Staimer has with tape based backup and recovery systems, he’s using the wrong ones, and shouldn’t blame that on tape.

Now let’s cover the second point:

[I]t assumes all data on all tapes are recoverable.

This can only be interpreted to mean one thing: the old “tape is unreliable” mantra. If tape were half as unreliable as every second article on tape made out to believe, there wouldn’t be a single tape vendor left in the market – they’d have all been sued out of business for deceptive trading and terribly unreliable products.

I’m not claiming that tape is fault free – if I did, I’d have a heck of a lot less cause to do the Ballmer Monkey Dance shouting “Cloning! Cloning! Cloning!” than I do. Tapes aren’t infallible, but I’ve not seen a single published paper citing extreme fault rates of enterprise class media*. On a yearly basis, the number of cases I see at customer sites of tape failure could be counted on a butcher’s right hand**. And you know what? Those instances are almost always at the backup point, not the recovery point.

So where does this leave us? At FUD central.

I’m the first to admit that the role of tape is changing within backup environments – I stated my thoughts on this previously in the article “Direct to Tape is Dead, Long Live Tape“, and I stand by this; so any overall discussion about backup media tiering with a model along the lines of disk->disk->tape or disk->vtl->tape will be the sort of thing I’ll usually heartily agree with.

If someone can point out independent studies showing high tape failure rates for enterprise class tapes – I’d like to know. Until then, let’s talk about valid, non-FUD reasons for pulling tape out of the immediate backup path. These include (but are not limited to):


  • Inability of most environments to stream tape.
  • SLAs requiring faster recovery starts, which in turn necessitate recovery from disk.
  • To allow for more streamlined backup cloning operations.
  • To support target deduplication for nearline backup storage.

Tape “unreliability” is not in that list. Maybe it is in limited environments that are currently using non-enterprise tape

* On the other hand, the easiest way of storing DAT media after generating your backup is to throw it into the bin. I might trust a DAT with a backup a little more than I’d trust a monkey with a pen to take notes in a court case, but not by much.

** I’m talking an old-style butcher. Before they had to start wearing chain mail gloves.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha