A Database By Any Other Name...

Some friends have just adopted a rather cute ginger cat and decided to name it Juno, perhaps after the Queen of the Roman Gods. Though it regularly leads to the interesting conversation: "What's your cat's name?" - "Juno" - "No I don't, that's why I'm asking"...

Meanwhile, here at p&p we're just starting on a project named after one of the new religions of information technology: Big Data. It seems like a confusing name for a technology if you ask me (though you probably didn't). Does it just consist of numbers higher than a billion, or words like "floccinaucinihilipilification" and "pseudopseudohypoparathyroidism"?

Or maybe what they really mean is Voluminous Data, where there's a lot of it. Too much, in fact, for an ordinary database to be able to handle and query in a respectable time. Though most of the examples I've seen so far revolve around analyzing web server log files. It's hard to see why you'd want to invent a whole new technology just for that.

Of course, what's at the root of all this excitement is the map/reduce pattern for querying large volumes of distributed data, though the technology now encompasses everything from highly distributed file systems (HDFS) to connectors for Excel and other products to allow analysis of the data. And, of course, the furry elephant named Hadoop that sits in the middle remembering everything.

Thankfully Microsoft has adopted a new name for its collection of technologies previously encompassed by Big Data. Now it's HDInsight, where I assume the "HD" means "highly distributed". There's a preview in Windows Azure and a local server-based version you can play with.

What's interesting is that when I first started playing with real computers (an IBM 360) all data was text files with fixed width columns that the code had to open and iterate through, parsing out the values. The company where I worked used to have four distinctly separate divisions, each with its own data formats, but these had now been melded into one company-wide sales division. To be able to assemble sales data we had a custom program written in RPG 2 that opened a couple of dozen files, read through them extracting data, and assembled the summaries we needed - we'd built something vaguely resembling the map/reduce pattern. Though we could only run it an night because it prevented most other things from working by locking all the files and soaking up all of the processing resources.

Thankfully relational databases and Structured Query Language (SQL) put paid to all that palaver. Now we had a proper system that could store vast amounts of data and run fast queries to extract exactly what we needed. In fact we could even do it from a PC. And yet here we are, with our highly distributed data and file systems, going back to the world of reading multiple files and aggregating the results by writing bits of custom code to generate map and reduce algorithms.

But I guess when you appreciate the reasons behind it, and start to grasp the concepts of the vast amounts of data involved, our new (old) approach starts to make sense. By taking the processing to the data, rather than moving the data around, you get distributed parallel processing across multiple nodes, and faster responses. And when you discover just how vast some of the data is, you realize that our modern relational and SQL-based approach just doesn't cut it.

Though there are some interesting questions that nobody I've spoken to so far has answered satisfactorily. What happens when you need more than just a simple aggregate result? It seems likely that the map function needs to produce a result set that is considerably smaller than the data it's working on, and if there is little correspondence between the data in each node the reduce function won't be able to do much reducing.

Maybe I just don't get it yet. And maybe that's why being just a "database programmer" is no longer good enough. Now, it seems, you need to be a "data scientist". You not only need to know about Database Theory, but Agile Manifesto and Spiral Dynamics as well according to DataScientists.net. You're going to spend the rest of your life organizing, packaging, and delivering data rather than writing programs that simply run SQL queries.

But it does seem that data scientists get paid a lot more, so maybe this Big Data thing really is a good idea after all...