If you’re not familiar with the term jumping the shark, you might want to read up on the history of it over at Wikipedia. The basic premise though comes direct from the decline and fall of Happy Days, and is summed up as:

Jumping the shark is a widely used idiom, first employed to describe a moment in the evolution of a television show, characterized by absurdity, when a particular show abandons its core premises and begins a decline in quality that is beyond recovery.

Now, I’ve worked for companies that have had partnerships with EMC for the last 11 years, and you can take what I’m about to say with whatever grain of salt or assumption of bias, but I’d like to think that I’m actually speaking more from the Australian perspective of not tooting ones horn in a way that goes completely overboard.

It started with a tweet by John Martin, aka @life_no_borders:

When a customer says “NetApp is contributing more auditable financial benefit than every other technology vendor combined” it makes me proud

If you’re not familiar with him, John is a senior NetApp employee in Australia. (Obviously John, as an Australian, had a different perspective here to me.)

Once this was Tweeted by John it was picked up and retweeted by a couple of other NetApp twitterers, and I want to make myself 100% clear here:

I am not, in any way, questioning what John has said. A customer may very well have said that. That’s not my point.

My point is that the statement jars. It’s odd. It’s jarring – there’s some compliments that … let’s just say … come out wrong. Like the time someone told one of the directors at my previous company, “You should bottle Preston’s blood.” Sure, there’s a compliment, but sometimes – particularly in the IT world – an unbelievably good compliment just comes across as jumping the shark.

I wasn’t the only person who thought this. Matt Davis aka @da5is, an extremely good techo in the storage industry, tweeted:

“one vendor saved me more money than all others” is either fanboism or incompetence.

Now, to be accused of fanboism in say, the competition between Apple, Microsoft and Google is a daily event for some people. I know people call me an Appletard and an Apple Fanboi regularly, and I’ve argued that this isn’t the case. And despite their opinions of me on that front, I value their insight.

However raising it against a person or a company on the storage front isn’t something you regularly hear, yet it comes close to explaining that jarring sensation when reading John’s original quote.

John replied to Matt at this point:

@da5is Neither fanboi or incompetence, just brilliant execution from a NetApp reseller and services team delivering on their promises.

Phil Jaenke aka @rootwyrm also weighed in helping to highlight the silliness of the entire thing:

@life_no_borders @da5is No, it really isn’t. No vendor would sit down and take it, and I know two that would cut off NetApp’s legs to win.

So what’s the story here? Did NetApp earn such a quote from a customer? Very likely – and very likely that every other vendor out there and any decent systems integrator worth their salt out there has also received a similar compliment at some point or another.

Accurate customer quote or not, I just don’t see an average manager looking at the quote and saying “OK, you’ve sold me”. If I were a betting man, I’d probably look at the average Australian manager and lay $10 that they’d think something along the lines of: “NetApp, did you just jump the shark?”

NetApp Jumps the Shark

What do you think? Did they jump the shark?

 

I periodically see indignant tweets and comments by people that if you sell something to a client, then you’re at worst being unethical, or at best being idiotic to say that you like to consider customer relations as partnerships.

This has reached the point where I’ll no longer sit back and listen to cynics who think that as soon as you start selling you either cease being human, or cease being unable to think symbiotically.

Insisting that companies cannot, and should not, refer to clients as partners, is at worst toxic and at best, demeaning to all parties.

Now, I’m not going to say that there are instances where some companies jump on the bandwagon and like to insinuate a partnership but stick to a traditional “stick whatever badge you need on that widget to sell it” sales approach. Of course that is going to happen.

But to tar all companies that sell, or integrators with that brush? Pah! Think again.

I’ve worked in some form of consulting pretty much all my career. I started as a trainee consultant, and when that programme was dying I transferred across to a Unix system administration team. Even as an “end customer” I still had my own customers, and as the company I was working for started taking on outsourcing contracts, I started being a consultant again. That was followed by a brief stint in the less than compatible world of finance, and since then I’ve remained in consulting.

Consult! Consult! Consult!

Consulting, systems integration, however you want to think about it, does not work well when customers are treated as meat – as paying clients to service the next bill. That leads to a succession of one-off engagements and implementations. Rape a company of budget, move on to the next and pillage that, too. It’s not a sustainable model. Or rather, unless you’re a global company and trade on some pre-established name, that model doesn’t get you very far. Pretty soon you get a crap name in the market and you start driving yourself out of business. You’ll blame the technology you’re using, and switch to another product, or another vendor, exhaust a new set of customers, and move on again.

There’s only one sustainable model in consulting and systems integration, and that’s the model where you engage with clients in a partnership. I’m not talking about looking for joint ventures; I’m talking about basic recognition of fundamental business cooperation, viz:

  • I want to help you succeed at what you do;
  • If you succeed at what you do, you’ll be able to help me succeed by buying things from me.

Symbiotic? Or parasitic? A cynic would say parasitic, and they’d be wrong. Or they’d come from the “everything should be free except for what I do” school of business. You know – the people who think that the only company entitled to put markup on a widget, or make a profit, is themselves.

It’s actually a symbiotic relationship, because it recognises that a relationship can actually be of mutual benefit to both parties. It doesn’t have to be about one “winning” and one “losing”, or “one making money” and “one spending money”.

The absolute basis of my belief in this is covered in my “13 traits of a great consultant” post. In particular, point 11 sums up exactly why a customer/client relationships should become a partnership:

Solve the problem, don’t answer the question – From an IT perspective, I use this example: an engineer, if asked a question by a customer, will do his or her utmost to answer the question as exactingly as possible. A consultant will look past the direct question and aim to solve the problem that led the customer to ask the question. Or in other words: if it doesn’t have a yes/no answer, no question is asked in isolation.

If you just have a customer/client relationship, then all you get is an engineering relationship. “Yes we can sell you widget X? What, you thought widget X did Y? But you didn’t ask? Thankyou for shopping, no refunds!” Do you really want that sort of relationship? Going down that path, you get a plethora of situations where technology is blamed for non-technical issues – and indeed, it happens at both the client and the sales side.

Form a symbiotic partnership though, and the relationship is far more wholesome and useful. From the sales side of it, satisfied customers whom you consistently deliver expected results to are repeat customers; repeat customers form the basis of predictable sales and earnings, and as time goes on provide valuable feedback to your growth as a company, too. From the client side, you get solutions that are tailored to your needs by people who you know and trust – and you know and trust them because they’re very much aware of your business requirements, constraints and operational models. A partner in fact will be able to help you through the rougher times – regardless of whether that’s unexpected staff changes without handover, or simply when needing a leaner approach that sacrifices scope only, rather than quality and scope. A partner will have the experience of working within your organisation and be able to deliver faster, more efficiently, and with less impact to your operational processes.

So, the next time someone suggests to you that you can’t have a partnership in a sales/client model, or that consultants/system integrators can’t form symbiotic relationships with your business, consider this one question:

Do you want a supplier you can trust, or a box dropper?

Rarely, if ever, will the answer be the latter.

 

One of the stories I sometimes hear from companies is that some technology X doesn’t work in their environment because X sucks, or X is broken, or X … well, you get the picture.

Years ago, when I first got into backup, the the main reasons I had to do recovery were due to system or hardware failures. Hard drive reliability was IMHO much lower, operating systems were frequently less stable, etc. Reliability was about getting to 99% availability, let alone 99.9% or anything grandiose like that.

These days, hardware/OS/app failure is, I’d suggest, one of the least likely reasons for a recovery being conducted in most organisations. Instead, it’s mainly related to soft issues – user error, audits, compliance checking, etc.

There’s a point here, and I’m almost ready to make it.

Back when I first started with backup, I’d have agreed that technology could be firmly blamed for a lot of errors. These days? Rarely – even when I blame it.

I periodically go on a rant about just how painful Linux is sometimes, but at the core I also admit that it’s a lack of training and time on my part – I’ve not made learning the ins and outs of Linux firewalls a field of study in the past, so now that I’m having to construct them by hand for a personal project it’s about as fun as tasering myself in the genitals. Technology is partly the problem – as is always the case with Linux, it’s designed for programmers and developers to manipulate, not for end users, or people like me who have concentrated on other things and just want the damn thing to work.

Ahem, where was I?

The simple fact is that we often blame technology because it’s easy. It’s like kids picking on the “easy target” at school with bullying; we bully technology and blame it for all our woes and issues because well, it doesn’t really fight back. (Hopefully we’ll get out of this habit before the singularity…)

As techos though, let’s be honest. The technology is rarely the issue. Or to be more accurate, if there’s an issue, technology is the tip of the iceberg – the visible tip. And using the iceberg analogy, you know I mean that technology is rarely going to be the majority of the issue.

The ‘issue’ iceberg in IT looks like this:

The issue iceberg

It’s probably best here that I stop and differentiate between issues and problems. A problem to me, is an isolated or an atomic failure – like, a faulty tape drive, or a failed hard drive. They’re clearly technology related, but they’re not really issues. An issue is a deeper, systemic and compound failure. E.g., something like “on any one day, 30% of my backups fail”, or “Performance across all systems is generally 50% worse at end of month”, etc.

When technology gets blamed in those instances, I’m reminded of someone who say, never has their car serviced, then when it eventually breaks down complains that the car was a lemon. Was it that the car failed the person, or more accurately that the person failed the car?

As I said, it’s easy to blame the thing that can’t defend itself.

In environments with ongoing, long-term issues, there reaches a point where you have to sit back and ponder – is the technology causing the issue, or is the environment causing the technology to have an issue?

The inevitable and hard truth is that in some cases, it’s the latter, not the former.

Let’s consider a basic scenario – the “on any given day 30% of our backups fail” scenario. So, does that mean that on any given day 30% of servers crash and reboot during the backup? Or does the backup software agent crash on 30% of servers when a backup is attempted? Maybe, in the most exceptional of circumstances, this may be the case.

In reality though? In reality we have to start looking at the rest of that iceberg:

Rest of the iceberg

High systemic failure rates, if attributed to the deployed technology, should result in a law suit. How often do we see that happening?

>queue the cicadas<

That’s right.

When there are systemic failure rates, a business must, eventually, turn to face the truth that they have to review their:

  • Policies – Are there any governing rules to the company which are contributing to the problem? For instance, does the company require the technology to be adapted in such a way that it wasn’t designed for? This can be hard and real policies, or they can be implicitly allowed policies – such as empire building.
  • Processes – Are there operating methods which are triggering the issue? Imagine a business for instance where change control has become such a consuming process that backup failures are repeatedly allowed to occur because a change window isn’t available. Is that the fault of the backup technology?
  • People and Education – I’m not suggesting that staff at sites are incompetent. Far from it. Incompetent is such a harsh, unpleasant word that in the 15+ years I’ve been consulting, it’s been a very rarely used word. Education though is a factor. No, I’m not picking on people without tertiary skills, but training is a factor. For example, managers who have no day to day technical experience may decide that some technology, based on a half hour vendor pitch, is easy enough that staff won’t need training in it. If said staff then go on to say, accidentally delete a LUN from a production server, because they weren’t trained , how is that the fault of the SAN?

Navel gazing, introspection, call it what you will, it’s not always a pleasant task. It’s about objectively looking at how we’re doing things, and ask, “are we partly to blame?”

Yet, if you aren’t prepared to do this, you’re doomed (yes, doomed) to keep making the same mistake again, and again, and again. The pile of failed technology builds up, the quest for the silver bullet becomes more frenetic, and the chances of a major failure happening increase. In the worst scenarios, it can become decidedly toxic.

But it doesn’t need to be. Evaluating your processes, your policies and your people (particularly the training of your people) can be – well, cathartic. And the benefits to the business, in terms of literal cost savings and efficiencies, ensures that the introspection is well worth it.

As a consultant, you might assume that it’s my job to ensure that customers buy the best and the most expensive technology out there that I can sell them. That’s a cynical attitude that comes from a few shoddy operators. As a consultant, my job is to partner with you and your company and help you achieve your best. (If you think I’m just blowing smoke up your proverbial, check my “13 traits of a great consultant” article.)

Sometimes that means highlighting that there are issues, not problems, and those issues require a deeper fix than plugging in a new piece of technology.

 

I love my Drobo. I have a 4-drive FW800/USB-2 unit, purchased mid last year. It has pride of place sitting on top of my Mac Pro. But I do have a few questions for Drobo regarding its capabilities, performance and compatibility:

  1. Why are your logs in binary format? It’s my unit – why do I have to submit a support case just to get the logs read?
  2. Why don’t you use SMART status? OK, this is more just a techo curiosity question, but still…
  3. Why does performance fall through the floor for certain applications? For instance, I can’t run my Aperture library against the Drobo. This isn’t just the difference between SATA and FW800 performance, it’s the difference between an import of 5 x 30MB photos happening near instantaneously, or taking >3 minutes.
  4. When will my Drobo get an update to support 3TB drives? Or if it’s not going to get such an update, will you at least tell us?
  5. Have you, as I requested on a couple of support cases, run any compatibility tests with WD20EARS green drives? I went through 4 drives in less than 6 months, and while 2 of them had physical faults, the other two, since being pulled from the Drobo, have worked flawlessly in other systems. If there’s an issue with their wake-up speed or spin-speed, surely that didn’t just affect me.

 

 

[Update]

False positives – that’s the update following the investigation by Samsung and others. I’m personally very glad to hear that – it effectively seems to have come down to malware detection software making the wrong call, and then a supervisor at Samsung making a really wrong call. Anyway, we can all relax now. Privacy – yep, Samsung believes in that, clearly. (Thanks to adrift_in_space for pointing out the correction on the Samsung article over at Network World.)

[Original article below]

It’s articles like this that make me glad that the only Samsung product I have in my house is a TV – and it’s not connected to the internet. It’s downright spooky and creepy.

Samsung responds to installation of keylogger on its laptop computers.

Honestly, you have to read it to believe it. Apparently a Samsung technical support supervisor admits that Samsung knowingly (i.e., deliberately) installed keyboard logging software on their laptops before they were sold so as to:

“monitor the performance of the machine and to find out how it is being used.”

If this is true, I see a big and nasty class action lawsuit looming for Samsung. And they’d totally deserve it.

 

It struck me recently while working on a report that there’s 7 distinct challenges in data protection, and that we can only address those challenges when we’re completely across them.

Most sites with enterprise backup will be aware of a few of these challenges, but as soon as you lose sight of some of them, you’ve lost focus on the goal.

They are:

  1. Budget
  2. Communication
  3. Regulatory Compliance
  4. Age
  5. Volume
  6. Search
  7. Formalisation

Each of these on their own represents a particular obstacle or hurdle that needs to be overcome. I should also stress – these are issues for data protection as a whole, and that’s not necessarily limited just to backup and recovery.

What’s even more important, is when you look at that list, it’s clear that any issues your site is having are not unique. Every company has to deal with the same challenges, and therefore you don’t have to feel that your solution must be unique. It just simply has to fit.

And there’s a world of difference – and cost – between “unique” and “fit”.

Let’s look at each of those challenges individually and explain what I mean.

Budget

Something I mention a bit in my book, and when I run training courses, is that I could take the entire budget, for an entire organisation, spend it solely on data protection activities, and still not come up with a solution that is 100% proof positive against any form of data loss or contingency that may happen. There’s always another contingency or potential problem looming around the corner. Sure, it might end up being something like “asteroid hits the earth” or “pandemic kills 99% of the human population”, but the net fact is: you can’t pre-emptively deal with every single possible scenario that may occur.

So it all becomes a game of “risk vs cost”. What’s the risk of it happening? What’s the cost of preparing for it? What’s the cost of it happening and not being prepared? What’s the risk that there’s nothing you can do about it?

As soon as you can start boiling everything down to “risk vs cost” you can actually prepare your data protection needs appropriately.

Communication

Except in the smallest of businesses, there’ll be different departments. And as soon as you have different departments, you have to factor in communications between those departments. Effectively, at this point, we’re talking about IS – Information Services – rather than IT (Information Technology) getting involved. You need to have clear and effective communication between the various departments within the business and the IT group in order to ensure that everyone understands the data protection requirements. In fact, you need to have that communication for pretty much everything to work. (Otherwise you end up in a situation where people think the muck described by the 37 Signals essay is a realistic portrayal of IT.)

To form effective communication, you need a bridge between a department and IT. That bridge is IS; the IS people may actually be the same people as the IT people, but the fact remains that the communication must be held at the policy level rather than the technical level. It’s not the role of someone in department X to understand how Y is done. It’s the role of IS to take their requirements, take IT options, and present strategy and requirements to the business.

Or if you want to phrase it another way – imagine someone prancing around stage like a monkey with bad flop sweat screaming out “Communicate! Communicate! Communicate!”

It’s that important.

Regulatory Compliance

Like it or not, we’re in an age where there is regulatory compliance attached to a lot of data protection. How long should information be kept for? Does it need to be destroyed at the end of that life time, or can it just be kept ‘forever’ if that’s easier?

Someone, somewhere in the company, needs to be aware of the regulatory compliance requirements that affect the company. You might say this is part of communication, but usually there’s somewhat of a gulf between how long departments want to retain data for, and what they’re required to keep data for. As to which one is longer: well, flip a coin. You need to know both.

Age

Go to a museum or library. Find an old book in your language,  pick it up, open it to a random page, and I bet you’ll still be able to mostly grasp what was written. As an example, I’ve read Leviathan (Thomas Hobbes) several times. It’s not necessarily easy going, but you can do it.

Can you confidently say that a document written by someone in say, WordStar 1.1, hanging around in a tired old directory on a fileserver somewhere within your environment is still readable?

While age presents particular problems to paper based record keeping, it’s never been easier to preserve and replicate such information. Grab it early enough, and you photocopy the original, or scan/OCR it. Suddenly you’ve got the information all over again, in relatively pristine format. It might be from several hundred years ago even, if not longer. There’s fictional works out there going back 2000+ years that people just casually read, for instance.

But age presents a particular problem to data protection in a digital age: it doesn’t matter squat if you can recover, or keep online a document going back 5, 10, 15 years, if you can’t actually retrieve the data within it.

So age becomes a significant planning factor. How do you ensure that not only can you can retrieve a file or chunk of data from 7 years ago, or 10 years ago, but it actually is still meaningful to someone?

Volume

Without a doubt, the amount of data we’re storing each year grows at a fantastic rate. Data is somewhere between air and liquid – it seems to want to expand to fill whatever storage is available, within reason. The explosion in digital media is just further exacerbating this. I’d suggest that we’re moving from the first digital age into the second at the moment; the first digital age was where data was almost naturally structured – databases are a classic example. Now though, the second digital age is all about unstructured data. Educational facilities for instance are increasingly making every lecture done by every academic available – not as a bunch of PowerPoint slides, but the actual presentation, as a video file, and often as a separate audio file, to assist people with disabilities, or distant students.

That data growth is not slowing down. I don’t see it slowing down or plateauing any time soon – and nor does most of the storage industry.

Search

It used to be that finding data stored ‘somewhere’ was akin to finding a needle in a haystack. Now, it’s a case of finding a needle in dozens or hundreds of haystacks.

It doesn’t matter how much data you store online, or retain in backups, archive, etc., if you can’t find it when you need it. It’s the sister problem to the ‘age’ issue – there’s far more than just storage involved here.

Search is big business. We see that with Google every day, but let’s consider a prime example – it used to be that filesystem/OS search tools were primarily around filename search. “Tell me part of the file name, and I’ll have a hunt around for it”, was the old approach. Now, it’s “tell me something that’s in the file, and I’ll have a hunt around for it.” I use it every day. If anything, tools like Apple’s Spotlight, for instance, have devolved my previously anal retentive approach to file storage because I don’t have to rely so much on structure any longer. I can search by content.

That works for text. What’s coming next is searching by content for complex data and media. For instance, you can already search for audio – point your iPhone at a speaker, turn on Shazam, capture 11 seconds or so of a song and violá, you’ve suddenly found a song based on a snippet. I imagine in 10 years time people who have some sense of pitch will be able to hum, sing or whistle a few bars and do the same thing. Image search is a growing area too – you can upload an image to some websites and find copies of it online – even to the point of say, finding larger, higher resolution copies of it online, etc.

Video? Undoubtedly coming.

The first vs second digital age analogy works well here too, I think. Search was able to be relatively simple when data was mostly structured. However, with that move to unstructured data, search becomes vitally important.

Make sure you have a search strategy.

(Finally) Formalisation

Most IT departments have grown from ad-hoc, informal processes within the average company. Start with a few people hired to keep systems running, and eventually as the company grows you’ve suddenly got a team of IT staff in a full time department.

What often doesn’t grow is the formality of the documentation and processes. It’s only natural that people will want to keep these as informal as possible, and I’m not suggesting that they need to be miracles of modern communication, but the simple fact remains: if it’s not written down, it doesn’t get done.

There reaches a point in any organisation where you have to be prepared to bite the bullet and admit “we have to take a more formal approach to things”. Implementing change control is a classic example; most big businesses take this for granted – yet most small businesses will start out with almost no change control process at all. Eventually though the business will hit a critical size and it becomes vitally important to actually have a real change control process.

That same jump from informal to formal is required on every level. You need formal documentation about how the network hangs together, you need formal documentation about creating new user accounts, etc. And you definitely need formal documentation about how data protection is handled within the company.

Summarising

Coming back to the original list, I can reiterate that the challenges faced in data protection are:

  1. Budget
  2. Communication
  3. Regulatory Compliance
  4. Age
  5. Volume
  6. Search
  7. Formalisation

None of those, individually should be any surprise to anyone. Again, they’re not unique to anyone either. We all have these same issues, regardless of whether we’re a customer, an integrator, a vendor, a whatever.

As soon as you acknowledge the challenges though, you can plan to overcome them.

 

As a consultant, you get attuned to (or as some would have it, “cynical”) certain key phrases and statements when you’re in meetings. Sometimes these statements are innocent and exactly what the person says, but usually they set the alarm bells ringing.

As a bit of winding down after a hectic 7 days, I thought I’d share the top 15 statements that cause me to start immediately trying to get deep qualification of what I’ve just been told…

What they say...What I worry it means...
"Our backup results get filed automatically and someone reviews them.""We have a server that hasn't successfully backed up for 6 months, but no-one's been checking the notifications."
"All our backups fit on a single tape""We upgrade our hardware every time this isn't the case."
"We're very selective about what we backup.""We have critical production systems we forgot to add to our schedule."
"We don't want to get backup notifications.""Backup? Meh."
"Our DBAs do their own backups.""The DBAs don't believe in enterprise backup software and think dumps are better" ... OR ... "The backup administrators have lost control of the system and its spiralling out of control."
"We don't have SLAs""No one wants ownership of establishing SLAs"
"We don't need SLAs""We trust in luck, and hope we don't ever need SLAs"
"Our users are responsible for backing up their laptops""Every day we're losing critical data that may be legally or fiscally required by the company."
"We don't have to do monthly backups.""Even though we know we SHOULD do monthly backups, until someone puts it in writing, we're not going to."
"We've been asked to shrink our backup budget...""The business has this crazy idea that backup is an IT function and problem."
"Tape is dead""Someone with a vested interest in selling lots of HDD storage has visited lately."
"We do per-incident support.""We have an Icarus support contract."
"It's too busy here to do capacity planning.""We're wasting money as fast as we can get the budget for it."
"We don't need to {clone or otherwise duplicate} our backups.""We're going to suffer a critical data loss situation."
"We only backup production data.""A lot of people's work within the company is unprotected."

 

The folks over at 37 Signals published a little piece of what I would have to describe as crazy fiction, about how the combination of cloud and more technically savvy users means that we’re now seeing the end of the IT department.

I thought long and hard about writing a rebuttal here, but quite frankly, their lack of logic made me too mad to publish the article on my main blog, where I try to be a little more polite.

So, if you don’t mind a few strong words and want to read a rebuttal to 37 Signals, check out my response here.

 

A while ago I got rather frustrated with the performance of compression utilities on my Mac Pro. It’s a bit of a beast; it was the last model before the Nehalem based systems, but has 8 x 3.2GHz cores and a respectable 20GB of RAM, but compression isn’t always as fast as I’d have liked.

Coming from a long-term Unix background, I have tended mostly to use bzip2 – brilliant compression ratios, but slow, slow, slow.

Eventually it occurred to me though was that the problem was simple: bzip2 is single-threaded. I can compress a file and using Activity Monitor, see a single core sit at a high utilisation rate – but that’s all.

Apple’s Grand Central though made me think: how much better would compression utilities work if they could run against multiple cores? Not running multiple compression activities at the same time – but splitting it up and hitting as many cores as possible at once.

As you’d expect, the answer is: much, much better. With a little bit of searching, I found pbzip2 – a parallel processor version of bzip2. Check it out, and be sure to donate to the programmer – he deserves the support.

Without a doubt, it’s a much faster way of compressing files when you have a bunch of cores available to throw at the problem. Here’s a test scenario:

  • Generate an 8GB, highly random file.
  • Time a regular bzip2 on the file.
  • Time a pbzip2 on the file.

The results? See for yourself:

# du -hs test.dat
8.0GB   test.dat
# date ; bzip2 < test.dat > test-bzip.dat.bz2; date
Sun 13 Feb 2011 12:33:11 EST
Sun 13 Feb 2011 13:02:16 EST
# date; pbzip2 < test.dat > test-pbzip.dat.bz2; date
Sun 13 Feb 2011 13:06:58 EST 
Sun 13 Feb 2011 13:12:05 EST

In the above case, the compressed files are both still effectively 8GB, since the source file was designed to not be susceptible to compression. Looking at a “real world” example then, I’ll pick a virtual machine. In actual fact, being able to quickly compress copies of virtual machines was the reason I first looked around for a better compression utility, so it makes sense to do so. Picking a small virtual machine, I can see:

[Sun Feb 13 13:20:42]
preston@aralathan /Volumes/Data/VMs
$ du -hs test02.pvm/
6.1G	test02.pvm/

Now, compressing with regular bzip2 via tar, we get:

$ date; tar cf - test02.pvm | bzip2 -c > ~/Desktop/test-bzip2.pvm.bz2; date
Sun 13 Feb 2011 13:25:37 EST
Sun 13 Feb 2011 13:40:45 EST
$ du -ms ~/Desktop/test-bzip2.pvm.bz2
2087	/Users/preston/Desktop/test-bzip2.pvm.bz2 

That was a total of 15 minutes, 10 seconds to compress 6.1GB down to 2087 MB using conventional bzip2.

Moving on to tar/pbzip2:

$ date; tar cf - test02.pvm | pbzip2 -c > ~/Desktop/test-pbzip2.pvm.bz2; date
Sun 13 Feb 2011 13:47:23 EST
Sun 13 Feb 2011 13:49:53 EST
$ du -ms ~/Desktop/test-pbzip2.pvm.bz2 
2092	/Users/preston/Desktop/test-pbzip2.pvm.bz2

So while it cost an extra 5MB in storage, pbzip2 compressed the same data in 150 seconds – or two and a half minutes, if you will. (I should note, that was reading from a 2 x 7200 RPM stripe, and writing to SSD. At this point the compression seems to be IO bound on the read, if anything – I had previously been compressing the same data in 104 seconds to a firewire-800 drive when reading from a 3 x 7200 RPM stripe.)

If you’re needing high speed compression, be sure to check out pbzip2.

NOTE: You may be wondering why I didn’t use the Unix ‘time’ command. Ever since encountering a bug in Tru64 Unix ‘time’, I’ve steered clear of it. (That bug resulted in ‘time’ blowing out the runtime of what it was monitoring by several orders of multitude.) I know, it would be safe to go back to ‘time’ by now, but for the purposes of what I needed to demonstrate, “date; command; date” was more than sufficient.

 

Deduplication can create fantastic space saving opportunities within an environment, but it does also create the need for a much closer eye on space management.

We’re used, in conventional backup or storage situations, to the following two facts:

  • There is a 1:1 mapping between amount of data deleted and amount of space reclaimed.
  • Space reclamation after delete is near instantaneous.

Data deduplication systems throw both those facts out. In other words, there’s no free lunch: you may be able to store staggeringly large amounts of data on relatively small amounts of storage, but there’s always swings and roundabouts.

With deduplication systems, you must carefully, aggressively monitor storage utilisation since:

  • There is no longer a 1:1 mapping between amount of data and amount of space reclaimed: You might, if you’re running out of space, selectively delete several TB of data, but due to the nature of deduplication, reclaim only a very small amount of actual physical space as a consequence.
  • Space reclamation is not immediate: whenever data is deleted from a deduplication system, the system must scan remaining data to see if there’s any dependencies. Only if the data deleted was completely unique will it actually be reclaimed in earnest; otherwise all that happens is that pointers to unique data are cleared. (It may be that the only space you get back is the equivalent of what you’d pull back from a Unix filesystem when you delete a symbolic link.) Not only that, reclamation is rarely run on a continuous basis on deduplication systems – instead, you either have to wait for the next scheduled process, or manually force it to start.

The net lesson? Eternal vigilance! It’s not enough to monitor and start to intervene when there’s say, 5% of capacity remaining. Depending on the deduplication system you may find that 5% remaining space is so critically low that space reclamation becomes a complete nightmare. In reality, you want to have alerts, processes and procedures targeting the following watermarks:

  • 60% utilisation – be on the look out for unexpected data growth.
  • 70% utilisation – be actively monitoring daily consumption rates.
  • 75% utilisation – you should know by know whether you have to expand the storage, or whether usage will stabilise again.
  • 80% utilisation – start forcing space reclamation to occur more frequently.
  • 85% utilisation – If you have to expand the storage, the purchase process should be complete and you should be ready to install/configure.
  • 90% utilisation – have emergency processes in place and ready to activate for storage redirection.

With these watermarks noted and understood, deduplication will serve your environment well.

© 2012 The NetWorker Blog Suffusion theme by Sayontan Sinha