There’s been of discussions on various storage blogs both previously, and again now on whether a copy (e.g., a tarball, or a snapshot, etc.) is a backup. There have been arguments on both sides of the fence, and I’m going to equally contribute to those arguments now.

You see, a copy is a backup, and it’s not a backup.

It’s almost like Schrödinger’s Cat – it may be a backup, or it may not be a backup, and you won’t know for sure until you look more closely at it.

In my book, I set out early in the process to define a backup, and define it as follows:

A backup is a copy of any data that can be used to restore the data as/when required to its original form. That is, a backup is a valid copy of data, files, applications or operating systems that can be used for the purposes of recovery.

So it would seem then that I come down fairly heavily in favour of the notion that a copy is a backup. Well, yes – and no.

In the broadest sense of the term, a random copy of data such as a tarball, an rsync, a zip file, a read-only snapshot is indeed a “backup”, as it can be used, in a single instance, for the purposes of recovery. However, so too could be a binary print-out/dump of the exact state of every bit on a LUN. Few would argue though that such an arduous and manual re-entry process would really be recoverable, even though in theory it is.

The reason that it’s not really recoverable is we’re all aware of the time frames required for recovery – recoveries must be completed in a timeframe that is useful to the business (or the end user) who needs the data back. Without that, we don’t really have a backup at all – just a random copy of the data.

If we look past the broad term “backup” though, and actually evaluate the term backup system, then I would suggest that a single “backup”, unless it’s an instantiation of protection from the backup system, is not a backup at all, but instead is just a random (or pseudo-random) copy.

To me this boils down to the need to work with the notion of Information Lifecycle Protection. As you may recall, in a previous blog entry I suggested that there’s a need to break off data protection activities from ILM and define a new process that revolves around keeping data available in order to be managed by ILM. It may seem a small distinction, but it’s one which helps in these sorts of discussions. At the time I suggested that conceptually, ILP may be represented as follows:

Components of ILP

Components of ILP

Under this definition, we can cease to worry about whether a copy is a backup, because clearly, a copy will be part of an overall ILP strategy. It’s still data protection, but it doesn’t have to be backup in order to be data protection.

My personal opinion is that a single, isolated copy is technically a backup, but is logically not a backup. “Technically is” because it can be used to restore data. “Logically not” because it’s not in itself a guarantee of a correctly designed backup system. I.e., unless we can say that the copy came from the backup system, we can’t be guaranteed it’s a backup.

One last quote from my book – this time from the back page:

A well-designed backup system comes about only when several key factors coalesce: business involvement, IT acceptance, best practice designs, enterprise software and reliable hardware.

So the answer I guess to “is a copy a backup” is another question – “did the copy from a backup system?” If the answer to that question is yes, then the answer to the original question is the same. If the answer is no, we can’t reliably answer “yes” to the original question.

 

While it turned out to be unrelated, a recent customer question made me think back to the impact of client side compression on the reported saveset size, and for the life of me I couldn’t remember how client side compression affected saveset size reporting.

Of course, it’s relatively simple to test. So I created a 1GB file on my backup server using:

# dd if=/dev/zero bs=1024k count=1024 of=/root/test.dat

Next, to test, I configured a client entry with a saveset of just ‘/root/test.dat’, and set the backup running without any client side compression. The savegroup completion email showed the sort of size you’d expect:

--- Successful Save Sets ---

* tara.pmdg.lab:Probe savefs tara.pmdg.lab: succeeded.
 tara.pmdg.lab: /root/test.dat     level=full,   1048 MB 00:00:13      3 files
 tara.pmdg.lab: index:tara.pmdg.lab level=full,     3 KB 00:00:00      4 files
 tara.pmdg.lab: bootstrap          level=full,     91 KB 00:00:01    177 files

The next step was to enable client side compression. Being lazy and not wanting to launch NMC, I created /root/.nsr with the following content:

<< . >>
compressasm: test.dat

With the backup re-run, I got the conclusive evidence that the saveset size reported is the data written to media (or transferred from the client) not the size of the data itself:

--- Successful Save Sets ---

* tara.pmdg.lab:Probe savefs tara.pmdg.lab: succeeded.
* tara.pmdg.lab:/root/test.dat 66135:save: NSR directive file (/root/.nsr) parsed
* tara.pmdg.lab:/root/test.dat 66135:save: NSR directive file (/root/.nsr) parsed
 tara.pmdg.lab: /root/test.dat     level=full,    124 MB 00:00:07      3 files
 tara.pmdg.lab: index:tara.pmdg.lab level=full,     5 KB 00:00:00      5 files
 tara.pmdg.lab: bootstrap          level=full,    102 KB 00:00:01    186 files

So the next question is – is this a good thing?

The answer is a little fluid. The correct answer I think is that both sizes should be recorded. Clearly for the purposes of backwards compatibility, current sizing values need to continue to report the data written to media. However, logically, there is significant merit in adding another field to the database – e.g., clsize that would report the amount of data the client reads for the backup. This would save a lot of hassle. (The “totalsize” field is not used for this, by the way.)

In the meantime, we just have to keep in mind that the size reported by mminfo, the savegroup completion, etc., is the size written to media – or if you will the size transferred from the client to the storage node.

 

Despite recent claims that LTO-5 is at risk of being a dead format due to Imation being the first vendor to sign on for it, over the last week there’s been stories everywhere about SpectraLogic announcing a pre-purchase program for their LTO-5 offerings. SpectraLogic’s programme is intended to allow companies to continue to purchase LTO-4 drives and replace them with LTO-5 when they become available.

Given that SpectraLogic is in the library business rather than the tape drive manufacturing business, the most important part of this announcement is that one of the key drive manufacturers is preparing to commence production. (Since SpectraLogic has apparently had a history of sourcing drives from IBM, there’s a historical reason why IBM drives may be sourced by SpectraLogic.)

I think it’s fair to say that LTO-4 still has some legs left in it – thus, I’m not surprised that the LTO-5 take up is building more slowly than previous generation formats. That shouldn’t be seen as a negative towards the format – just a sign of continuing maturity in the industry.

 

On Linu, filesystems typically have two settings regarding getting complete checks on boot. These are:

  • Maximum number of mounts before a check
  • Interval between checks

The default settings, while reasonably suitable for smaller partitions, are very unsuitable for large partitions, such as what you find in disk backup units. In fact, if you don’t pay particular attention to these settings, you may find after a routine reboot that your backup server (or storage node) can take hours to become available. For instance, it’s not unheard of to see even sub-20TB DBU environments (as say, 10 x 2TB filesystems) take several hours to complete mandatory checks on filesystems after what should have just been a routine reboot.

There are two approaches that you can take to this:

  • If you want to leave the checks enabled, it’s reasonably imperative to ensure that at most only one disk backup unit filesystem will be checked at one time after a reboot; this will at least reduce the size of any check-on-reboot. Thus, ensure you:
    • Configure each filesystem so that it will have a different number of maximum mounts before check than any other filesystem, and,
    • Configure the interval (days) between checks for each filesystem to be a significantly different number.
  • If you don’t want periodic filesystem checks to ever interfere with the reboot process, you need to:
    • Ensure that following a non-graceful restart of the server the DBU filesystems are unmounted and checked before any new backup or recovery activities are done, and,
    • Ensure that there are processes – planned maintenance windows if you will – for manual running of the filesystem checks that are being skipped.

Neither option is particularly “attractive”. In the first case, you can still, if you cherish uptime or don’t need to reboot your backup server often, get into a situation where multiple filesystems need to be checked on reboot if they’ve all exceeded their days-between-checks parameter. In the second instance, you’re having to insert human driven processes into what should normally be a routine operating system function. In particular with the manual option, there must be a process in place to NetWorker shutdown + checking even in the middle of the night if an OS crash occurs.

Actually, the above list is a little limited – there’s a couple of other options that you can consider as well – though they’re a little more left of field:

  • Build into the change control process the timings for complete filesystem checks in case they happen, or
  • Build into the change control process or reboot procedure for the backup server/storage nodes the requirement to temporarily disable filesystem checks (using say, tune2fs) so that you know the reboot to be done won’t be costly in terms of time.

Personally, I’m looking forward to btrfs – in reality, a modern filesystem such as that should solve most, if not all, of the problems discussed above.

 

On Wednesday, a large part of NSW was engulfed in a significant dust storm event. Having grown up in the country, and spent some time living in practically desert regions, I’m no stranger to dust storms. However, that doesn’t change the fact that the visual effect of them are amazing.

On Wednesday morning, my partner and I did what any two crazy photographers would do – ran out, hopped in the car and went looking for good photo opportunities. Here’s some of the results from my 50D:

Gosford waterfront during dust storm, 2009-09-03

Gosford waterfront during dust storm, 2009-09-03

Gosford waterfront during dust storm, 2009-09-03

Gosford waterfront during dust storm, 2009-09-03

Gosford waterfront during dust storm, 2009-09-03

Gosford waterfront during dust storm, 2009-09-03

Terrigal during dust storm, 2009-09-23

Terrigal during dust storm, 2009-09-23

Incidentally, I completely forgot in my rush to get photos that my hayfever is caused by … dust. You’d think given that I never normally experience hayfever on the coast, but get it as soon as I cross over the blue mountains would result in a better recollection of such things, but no, that wasn’t to be the case, and so instead I spent the entire day sneezing after an hour’s efforts in the dust.

 

In environments with satellite offices, a common “backup” technique described in a lot of situations is what I’d generically call a “trickle backup” technique. This uses either some form of asynchronous replication (either block or file), or some other state of file/block deduplication, etc., to achieve very small backups (after the first) back to a central site.

These are inevitably done for one or both of the following two reasons:

  1. Staff at the satellite office do not have the technical skills to manage local media or backup storage nodes/media servers.
  2. The WAN bandwidth is too little (or too costly) to do full-scale backups.

Remembering my definition of recoverable in The 7 Procedural Obligations of Backup Administrators, I’d like to suggest that for most situations, there’s no such thing as a trickle recovery. To reiterate, my definition of recoverable is:

  1. The item that was backed up can be retrieved from the backup media.
  2. The item that is retrieved from the backup media is usable as a replacement to the data that was backed up.
  3. The item can be retrieved within the required window.

Trickle backups, if not considered properly, cease to be valid backups if they violate item 3 above.

Whenever trickle backups are considered, there must be rigorous planning conducted (including discussions with appropriate stakeholders) to determine what recovery methods will be considered valid. Let’s discuss briefly what might need to be considered:

  1. Will individual file level recovery back across the WAN connection be possible?
  2. What will be the maximum amount of data that can be recovered across the WAN connection?
  3. How will larger data recoveries be facilitated?
  4. How will complete system recoveries be facilitated?
  5. Can all recoveries be completed within SLAs?
  6. Are HR and IT policies in place to prevent situations where satellite office recovery requirements may be abandoned or delayed due to staff shortages or local workloads?

If all 6 of those questions don’t have answers that are compatible with SLAs and business requirements, then there is no valid satellite backup system in place.

 

Something I’ve seen a few people complain about – and indeed that I’ve also complained about in the past, is that in high security environments, NetWorker allows end users on one host to be able to see the backups done for other hosts. This is obviously a security concern.

After a brief discussion with EMC, it was also obviously something that is readily changeable with only a couple of clicks of the mouse button – so I feel somewhat sheepish that I hadn’t picked up on it before. All you have to do is take away the “Monitor NetWorker” privilege from the Users usergroup.

Here’s the (to some environments) offending setting:

Monitor users privilege

Monitor users privilege

Once that setting is unchecked, end users won’t be able to view the backups for other hosts – just their own.

 

The Register has some coverage at the moment of Intel demonstrating a (highly customised/optimised) 7 disk SSD configuration which delivered 1 million IOPS on a desktop configuration. As the article says, regardless of the level of tweaking to get there, this is a fabulous example of the world that is to come with SSD.

Clearly this is still a while off from regular commercial use for the “average” business, but regardless, it’s a fascinating development.

There’s good cause, when you look at these sorts of figures, to see why most storage vendors are getting onto the SSD bandwagon and declaring SSDs to be the “zeroth tier” in storage performance.

 

I don’t have many customers with standalone tape drives. Usually when they do, it’s due to one of two reasons:

  • Purchased to support recovery from previous-format media during a format change.
  • Used in remote or satellite offices for local backups.

In the first instance, a company may say, replace SDLT with LTO, but decide not to stage their long-term backups from SDLT to the replacement media. Instead, they may just say, purchase a standalone SDLT drive so that future recovery requests can be met (albeit more slowly) through protein based autoloading.

In the second instance, a company may either run multiple NetWorker servers, or a WAN based datazone with storage nodes in satellite offices. In smaller offices, an autochanger may be either undesirable or represent too high a cost, and therefore one or more standalone tape drives may be deployed.

One of the questions that does inevitably come up whenever I do encounter people with standalone drives is “how can I make NetWorker just automatically load and use the tape that’s put in by the <janitor|secretary>?”

There are limits to what you can achieve when your tape operators have either (a) no technical skill or (b) no access to the hosts they are replacing media for, but there’s one thing that you can enable which will make your life slightly easier in these situations – standalone device auto media management.

When we normally think of auto media management, we think of tape libraries. In tape libraries, auto media management refers to one thing alone – having NetWorker automatically label previously unlabeled media when it gets to a point that no labeled media is available.

However, when auto media management is enabled for standalone tape drives, it fulfills two very useful functions. These are:

  • Recyclable volumes loaded into the drive are automatically recycled.
  • Unlabeled volumes loaded into the drive are automatically labeled. (From memory, this is to the Default pool, but in small satellite offices, that often ends up being used.)

These are done whenever the device is idle – i.e,. when it’s not being used, NetWorker monitors the device for the above two situations and acts accordingly.

While this doesn’t solve all problems with tape management at satellite offices using standalone drives, it does at least help.

 

When it comes to backup and data protection, I like to think of myself as being somewhat of a stickler for accuracy. After all, without accuracy, you don’t have specificity, and without specificity, you can’t reliably say that you have what you think you have.

So on the basis of wanting vendors to be more accurate, I really do wish vendors would stop talking about archive when they actually mean hierarchical storage management (HSM). It confuses journalists, technologists, managers and storage administrators, and (I must admit to some level of cynicism here) appears to be mainly driven from some thinking that “HSM” sounds either too scary or too complex.

HSM is neither scary nor complex – it’s just a variant of tiered storage, which is something that any site with 3+ TB of presented primary production data should be at least aware of, if not actively implementing and using. (Indeed, one might argue that HSM is the original form of tiered storage.)

By “presented primary production”, I’m referring to available-to-the-OS high speed, high cost storage presented in high performance LUN configurations. At this point, storage costs are high enough that tiered storage solutions start to make sense. (Bear in mind that 3+ TB of presented storage in such configurations may represent between 6 and 10TB of raw high speed, high cost storage. Thus, while it may not sound all that expensive initially, the disk-to-data ratio increases the cost substantially.) It should be noted that whether that tiering is done with a combination of different speeds of disks and levels of RAID, or with disk vs tape, or some combination of the two, is largely irrelevant to the notion of HSM.

Not only is HSM easy to understand and shouldn’t have any fear associated with it, the difference between HSM and archive is also equally easy to understand. It can even be explained with diagrams.

Here’s what archive looks like:

The archive process and subsequent data access

The archive process and subsequent data access

So, when we archive files, we first copy them out to archive media, then delete them from the source. Thus, if we need to access the archived data, we must read it back directly from the archive media. There is no reference left to the archived data on the filesystem, and data access must be managed independently from previous access methods.

On the other hand, here’s what the HSM process looks like:

The HSM process and subsequent data access

The HSM process and subsequent data access

So when we use HSM on files, we first copy them out to HSM media, then delete (or truncate) the original file but put in its place a stub file. This stub file has the same file name as the original file, and should a user attempt to access the stub, the HSM system silently and invisibly retrieves the original file from the HSM media, providing it back to the end user. If the user saves the file back to the same source, the stub is replaced with the original+updated data; if the user doesn’t save the file, the stub is left in place.

Or if you’re looking for an even simpler distinction: archive deletes, HSM leaves a stub. If a vendor talks to you about archive, but their product leaves a stub, you can know for sure that they actually mean HSM.

Honestly, these two concepts aren’t difficult, and they aren’t the same. In the never ending quest to save user bytes, you’d think vendors would appreciate that it’s cheaper to refer to HSM as HSM rather than Archive. Honestly, that’s a 4 byte space saving alone, every time the correct term is used!

[Edit - 2009-09-23]

OK, so it’s been pointed out by Scott Waterhouse that the official SNIA definition for archive doesn’t mention having to delete the source files, so I’ll accept that I was being stubbornly NetWorker-centric on this blog article. So I’ll accept that I’m wrong and (grudgingly yes) be prepared to refer to HSM as archive. But I won’t like it. Is that a fair compromise? :-)

I won’t give up on ILP though!

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha