It struck me recently while working on a report that there’s 7 distinct challenges in data protection, and that we can only address those challenges when we’re completely across them.

Most sites with enterprise backup will be aware of a few of these challenges, but as soon as you lose sight of some of them, you’ve lost focus on the goal.

They are:

  1. Budget
  2. Communication
  3. Regulatory Compliance
  4. Age
  5. Volume
  6. Search
  7. Formalisation

Each of these on their own represents a particular obstacle or hurdle that needs to be overcome. I should also stress – these are issues for data protection as a whole, and that’s not necessarily limited just to backup and recovery.

What’s even more important, is when you look at that list, it’s clear that any issues your site is having are not unique. Every company has to deal with the same challenges, and therefore you don’t have to feel that your solution must be unique. It just simply has to fit.

And there’s a world of difference – and cost – between “unique” and “fit”.

Let’s look at each of those challenges individually and explain what I mean.

Budget

Something I mention a bit in my book, and when I run training courses, is that I could take the entire budget, for an entire organisation, spend it solely on data protection activities, and still not come up with a solution that is 100% proof positive against any form of data loss or contingency that may happen. There’s always another contingency or potential problem looming around the corner. Sure, it might end up being something like “asteroid hits the earth” or “pandemic kills 99% of the human population”, but the net fact is: you can’t pre-emptively deal with every single possible scenario that may occur.

So it all becomes a game of “risk vs cost”. What’s the risk of it happening? What’s the cost of preparing for it? What’s the cost of it happening and not being prepared? What’s the risk that there’s nothing you can do about it?

As soon as you can start boiling everything down to “risk vs cost” you can actually prepare your data protection needs appropriately.

Communication

Except in the smallest of businesses, there’ll be different departments. And as soon as you have different departments, you have to factor in communications between those departments. Effectively, at this point, we’re talking about IS – Information Services – rather than IT (Information Technology) getting involved. You need to have clear and effective communication between the various departments within the business and the IT group in order to ensure that everyone understands the data protection requirements. In fact, you need to have that communication for pretty much everything to work. (Otherwise you end up in a situation where people think the muck described by the 37 Signals essay is a realistic portrayal of IT.)

To form effective communication, you need a bridge between a department and IT. That bridge is IS; the IS people may actually be the same people as the IT people, but the fact remains that the communication must be held at the policy level rather than the technical level. It’s not the role of someone in department X to understand how Y is done. It’s the role of IS to take their requirements, take IT options, and present strategy and requirements to the business.

Or if you want to phrase it another way – imagine someone prancing around stage like a monkey with bad flop sweat screaming out “Communicate! Communicate! Communicate!”

It’s that important.

Regulatory Compliance

Like it or not, we’re in an age where there is regulatory compliance attached to a lot of data protection. How long should information be kept for? Does it need to be destroyed at the end of that life time, or can it just be kept ‘forever’ if that’s easier?

Someone, somewhere in the company, needs to be aware of the regulatory compliance requirements that affect the company. You might say this is part of communication, but usually there’s somewhat of a gulf between how long departments want to retain data for, and what they’re required to keep data for. As to which one is longer: well, flip a coin. You need to know both.

Age

Go to a museum or library. Find an old book in your language,  pick it up, open it to a random page, and I bet you’ll still be able to mostly grasp what was written. As an example, I’ve read Leviathan (Thomas Hobbes) several times. It’s not necessarily easy going, but you can do it.

Can you confidently say that a document written by someone in say, WordStar 1.1, hanging around in a tired old directory on a fileserver somewhere within your environment is still readable?

While age presents particular problems to paper based record keeping, it’s never been easier to preserve and replicate such information. Grab it early enough, and you photocopy the original, or scan/OCR it. Suddenly you’ve got the information all over again, in relatively pristine format. It might be from several hundred years ago even, if not longer. There’s fictional works out there going back 2000+ years that people just casually read, for instance.

But age presents a particular problem to data protection in a digital age: it doesn’t matter squat if you can recover, or keep online a document going back 5, 10, 15 years, if you can’t actually retrieve the data within it.

So age becomes a significant planning factor. How do you ensure that not only can you can retrieve a file or chunk of data from 7 years ago, or 10 years ago, but it actually is still meaningful to someone?

Volume

Without a doubt, the amount of data we’re storing each year grows at a fantastic rate. Data is somewhere between air and liquid – it seems to want to expand to fill whatever storage is available, within reason. The explosion in digital media is just further exacerbating this. I’d suggest that we’re moving from the first digital age into the second at the moment; the first digital age was where data was almost naturally structured – databases are a classic example. Now though, the second digital age is all about unstructured data. Educational facilities for instance are increasingly making every lecture done by every academic available – not as a bunch of PowerPoint slides, but the actual presentation, as a video file, and often as a separate audio file, to assist people with disabilities, or distant students.

That data growth is not slowing down. I don’t see it slowing down or plateauing any time soon – and nor does most of the storage industry.

Search

It used to be that finding data stored ‘somewhere’ was akin to finding a needle in a haystack. Now, it’s a case of finding a needle in dozens or hundreds of haystacks.

It doesn’t matter how much data you store online, or retain in backups, archive, etc., if you can’t find it when you need it. It’s the sister problem to the ‘age’ issue – there’s far more than just storage involved here.

Search is big business. We see that with Google every day, but let’s consider a prime example – it used to be that filesystem/OS search tools were primarily around filename search. “Tell me part of the file name, and I’ll have a hunt around for it”, was the old approach. Now, it’s “tell me something that’s in the file, and I’ll have a hunt around for it.” I use it every day. If anything, tools like Apple’s Spotlight, for instance, have devolved my previously anal retentive approach to file storage because I don’t have to rely so much on structure any longer. I can search by content.

That works for text. What’s coming next is searching by content for complex data and media. For instance, you can already search for audio – point your iPhone at a speaker, turn on Shazam, capture 11 seconds or so of a song and violá, you’ve suddenly found a song based on a snippet. I imagine in 10 years time people who have some sense of pitch will be able to hum, sing or whistle a few bars and do the same thing. Image search is a growing area too – you can upload an image to some websites and find copies of it online – even to the point of say, finding larger, higher resolution copies of it online, etc.

Video? Undoubtedly coming.

The first vs second digital age analogy works well here too, I think. Search was able to be relatively simple when data was mostly structured. However, with that move to unstructured data, search becomes vitally important.

Make sure you have a search strategy.

(Finally) Formalisation

Most IT departments have grown from ad-hoc, informal processes within the average company. Start with a few people hired to keep systems running, and eventually as the company grows you’ve suddenly got a team of IT staff in a full time department.

What often doesn’t grow is the formality of the documentation and processes. It’s only natural that people will want to keep these as informal as possible, and I’m not suggesting that they need to be miracles of modern communication, but the simple fact remains: if it’s not written down, it doesn’t get done.

There reaches a point in any organisation where you have to be prepared to bite the bullet and admit “we have to take a more formal approach to things”. Implementing change control is a classic example; most big businesses take this for granted – yet most small businesses will start out with almost no change control process at all. Eventually though the business will hit a critical size and it becomes vitally important to actually have a real change control process.

That same jump from informal to formal is required on every level. You need formal documentation about how the network hangs together, you need formal documentation about creating new user accounts, etc. And you definitely need formal documentation about how data protection is handled within the company.

Summarising

Coming back to the original list, I can reiterate that the challenges faced in data protection are:

  1. Budget
  2. Communication
  3. Regulatory Compliance
  4. Age
  5. Volume
  6. Search
  7. Formalisation

None of those, individually should be any surprise to anyone. Again, they’re not unique to anyone either. We all have these same issues, regardless of whether we’re a customer, an integrator, a vendor, a whatever.

As soon as you acknowledge the challenges though, you can plan to overcome them.

 

You want my #1 prediction for this coming decade in storage? It’s not going to be dedupe, it’s not going to be a fundamental storage shift to SSDs or a (yeah right) death of tape. It certainly won’t be the winner between iSCSI, FCoE, FC, SCSI, SATA and every other connection technology you can come up with.

It’s going to be search: deep, fast, realtime search.

We talk about how storage continues to grow, but we often don’t talk about the real implications of that statement. Oh, terms such as “cost per TB” and “ease of management” are bandied about, as are (increasingly), “carbon footprint” and “deduplication”. People throw “Cloud” about as if it’s going to be the magical solution to unparalleled storage growth, but that’s still not thinking of the real implications (even if I were to agree, which I don’t).

None of this has even the slightest iota to do with the information that we’re using the storage for. That’s right, we’re not just buying the storage and plonking it down and it’s magically mystically growing. It’s not the storage that’s growing, it’s the information.

This next decade, I predict (and I bloody hope!) is going to be increasingly all about search. After all, what good is being able to store stuff if you can’t find it later?

Search certainly growing in focus, as evidenced by storage companies periodically gobbling up indexing/eDiscovery companies. I’ve mentioned periodically on this blog too about my interest in data visualisation – presenting complex data at a high level in an easy to understand way that then facilitates data mining. Hell, even one of my first postings on this blog was about search within NetWorker.

All the storage in the world doesn’t do squat for you if you don’t have information on it, and all the information in the world doesn’t do squat for you if you can’t find what you’ve previously stored.

It doesn’t matter whether content is in a database, or in an email, or in a file, or (the next great frontier) in a picture, video or soundclip, if you can’t find that content once you’ve stored it, you may as well have deleted it.

As a backup consultant, I’m well aware of the impact of increased storage: increased backup times, dense filesystem issues and longer recoveries are just a few things, but one of the more interesting impacts – the one that speaks more than anything else about the need for search-focus, is the frequency with which backup/system administrators are asked to recover data because the user can’t find where they put it. I.e., if your search system sucks, your users will use your backup system for search.

I’ll suggest something that should be blindingly obvious: dedicated search appliances and portals are insufficient. There should be no difference to the end user between searching for a file locally on his/her desktop than there is between searching for files and content on a dozen fileservers and other hosts within an organisation. Having to go to a dedicated portal to conduct the search is a failure.

The future of search is simple: it must be integrated with the primary user interface, the desktop. It should be as simple as clicking a checkbox called “also search network”, or something along those lines, when filling out a search query.

This leads to the next issue – centralised search. That’s not counter to what I just said. When Google Desktop Search was released a few years ago, lots of people raved about it, until it started being run up in corporate environments where pretty quickly network and system administrators demanded in many companies that it be removed entirely from the organisation. Why? Fileservers that hundreds or thousands of people might access were being repeatedly brought to their e-knees by dozens or hundreds of people with Google Desktop Search doing indexing and scanning of content.

Individual search database/engine building is not the way of the future – well, other than for home users.

In a corporate or shared storage environment, what’s key to search is a centralised index building system that represents only one accessing ‘user’ footprint reviewing and indexing data, with its database being accessible from within the primary user interface. We’ve been starting to see the edges of this in the last year or two, but it’s still very early days, and hardly homogeneous.

What occurs to me, when I see all the different indexing companies being snapped up, and every second storage and archiving system having its own specialist search utility/system, that there needs to be consideration for a standard approach to building index databases that can then be accessed by any tool. I.e., search urgently needs an open, ratified format for index/catalogues which can be subsequently accessed or probed by an OS triggered search request.

Once we get a standard for storing this indexing information, we have our best chance of achieving the holy grail of search – realtime non-impactive search. At this point, with a standard for the meta-data required for search available and understood by storage vendors, application vendors and operating system vendors, the real magic can happen. Every time a file gets written, the application writing the file can submit the meta data to the search database, have that updated, and violá, you have realtime search. The user should not have to manually update this content, it should seamlessly become part of the File->Save operation, for want of a better simplification.

That’s my prediction for the this decade that’s being called the teens. And it’s highly appropriate to make that prediction for the teens, because it’s like saying that I believe this will be the decade when storage grows up.

 

There was recently a discussion on the NetWorker mailing list regarding a situation whereby a company was skipping certain media files (e.g., *.mp3) from a fileserver, but still wanted to know when those files were present. In this case, the backup administrator didn’t have administrative rights to the fileserver, so doing a straight search of the fileserver wasn’t really an option.

One proposed solution was to change the “skip” to “null”, which does cause some index information to be stored that can be searched with nsrinfo. How useful that “some” is though is of debate. The reason for this is that if you “null” a file in a backup, it will only be reported via an nsrinfo -v command, and it won’t be reported with the full path to the file, meaning it’s necessary to walk the nsrinfo -v output to scan each new change of path then construct the null’d file paths from there.

As an exercise for the list, I tried this out – here’s what I reported at the time:

Creating the test area we want to test with using the following commands -

# mkdir /testing
# cp bigfile /testing
# cd /testing
# dd if=/dev/zero bs=1024k count=1024 of=test.dat

Following this, configure a /testing/.nsr directive with the following content:

<< . >>
null: test.dat

Now a backup can be run of the “/testing” directory; because of the directive, “test.dat” will be excluded.

Finding test.dat in nsrinfo output however is a little more tricky:

[root@nox testing]# nsrinfo -t `mminfo -q "name=/testing" -r nsavetime` nox
scanning client `nox' for savetime 1254265148(Wed 30 Sep 2009 08:59:08 AM EST)
from the backup namespace
/testing/.nsr
/testing/bigfile
/testing/
/
4 objects found

There’s no test.dat listed there. Resorting to ‘-v’ on nsrinfo, we do get the information, but as you can see, it’s more challenging to isolate the full path to the file:

[root@nox testing]# nsrinfo -t `mminfo -q "name=/testing" -r nsavetime` -v nox
scanning client `nox' for savetime 1254265148(Wed 30 Sep 2009 08:59:08 AM EST) from the
backup namespace
UNIX ASDF v2 file `/testing/.nsr', NSR size=188, fid = 2304.586641, file size=23
UNIX ASDF v2 file `/testing/bigfile', NSR size=196195024, fid = 2304.97665, file
 size=196176812
UNIX ASDF v2 file `/testing/', NSR size=244, fid = 2304.97664, file size=4096
 ndirentry->586641    .nsr
 ndirentry->97669    test.dat          <----- Here it is ------>
 ndirentry->2    ..
 ndirentry->97665    bigfile
UNIX ASDF v2 file `/', NSR size=772, fid = 2304.2, file size=4096
 ndirentry->3677476    .vmware/
 ndirentry->846145    mnt/
 ndirentry->97633    .autorelabel
 ndirentry->11    lost+found/
 ndirentry->2343169    lib64/
 ndirentry->2701153    media/
 ndirentry->1985185    opt/
 ndirentry->2668609    etc/
 ndirentry->1431937    sbin/
 ndirentry->4133089    srv/
 ndirentry->618337    boot/
 ndirentry->97638    .bash_history
 ndirentry->1366849    bin/
 ndirentry->2766241    selinux/
 ndirentry->3417121    tmp/
 ndirentry->97644    .autofsck
 ndirentry->585793    root/
 ndirentry->1692289    lib/
 ndirentry->97647    nsr
 ndirentry->0    sys/
 ndirentry->97648    home
 ndirentry->1887553    usr/
 ndirentry->2147905    var/
 ndirentry->2245537    d/
 ndirentry->0    dev/
 ndirentry->0    net/
 ndirentry->0    misc/
 ndirentry->0    proc/
 ndirentry->2    ..
 ndirentry->97664    testing/
4 objects found

So while this solution works, I’m not convinced it’s ideal in all instances.

The other solution I came up with I think works a little better, more reliably, and has the advantage of doing a live search on the filesystem even if the backup administrator doesn’t have administrator privileges.

So, the other solution is to make use of the RUSER/RCMD functionality in NetWorker to “have a chat” to the client daemons and get them to do something useful. Note that this is reasonably secure – you can only ask to run commands starting with “nsr” or “save”, and those commands must reside in the same directory as the save binary. In this case, we want to invoke save. All you have to do is turn off whatever directives are in place for the client, then setup a no-save command execution for the client.

In the example below, we’re going to get the client “asgard” to do a no-save backup of its /root folder, reporting back to the server “nox” but without actually transferring any data:

[root@nox testing]# RCMD="save -s nox -n /root"
[root@nox testing]# RUSER=root
[root@nox testing]# export RCMD RUSER
[root@nox testing]# nsrexec -c asgard
Warning: Could not determine job id: Connection timed out. Continuing ...
/root/.elinks/globhist
/root/.elinks/cookies
/root/.elinks/gotohist
/root/.elinks/
/root/.bash_profile
/root/.tcshrc
/root/aralathan.pub
/root/.my.cnf
/root/anaconda-ks.cfg
/root/nmsql521_win_x86.zip
/root/nw_linux_x86.tar.gz
<snip>
32477:(pid 29247):
save: /root 155 records 32 KB header 855 MB data

save: /root 855 MB estimated

This style of command will work equally for Windows and Unix systems – indeed, I’ve done similar things on both Windows and Unix.

Obviously, once you’re done gathering the file list, it’s important to then re-enable any directives turned off for the test/file walk.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha