Questions about “big data”

I’ve been watching the “big data” discussion happen in a variety of circles, with a slightly cynical concern that this may be like Cloud 2.0 – another sad meme for technology that’s already been in use for some time, but with an excuse to slap a 30% mark up on it.

So, the simple question really is this – is “big data” a legitimate or an illegitimate problem?

By legitimate – is it a problem which truly exists in and of itself? Has data growth in places hit a sufficiently exponential curve that existing technology and approaches can’t keep up …

OR

… is it an illegitimate problem, in that it speaks of (a) a dumbing down of computer science which has resulted in a lack of developmental foresight into problems which we’ve seen coming for some time and/or (b) a failure of IT companies (from base component manufacturers through to vendors across the board) failing to sufficiently innovate?

For me, the jury is still out, and I’ll use a simple example as to why. I deal with big data regularly – since “big data” is defined as being anything outside of a normal technical scope, if I get say, a 20 GB log file from a customer that I have to analyse, none of my standard tools assist with this. So instead, I have to start working on pattern analysis – rather than trying to extract what may be key terms or manually read the file, I’ll skim through it – I’ll literally start by “cat”ting the file and just letting it stream in front of me. At that level, if the software has been written correctly, you’ll notice oddities in the logs that start you pointing to the area you have to delve into. You can then refine the skimming, and eventually drill down to the point where you actually just analyse a very small fragment of the file.

So I look at big data and think – is this a problem caused by a lack of AI being applied to standard data processing techniques? Of admitting – we need to build a level of heuristic decision making into standard products so they can scale up to deal with ever increasing data sets? That the solution is more intelligence and self-management capabilities in the software and hardware? And equally, of developers failing to produce systems that generate data in such a way that it’s susceptible to automated types of pattern analysis?

Of course, this is, to a good degree, what people are talking about when they’re talking about big data.

But why? Do we gain any better management and analysis by cleaving “data” and “big data” into two separate categories?

Or is this a self-fulfilling meme that came out as a result of poor approaches to information science?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.