NOTE This post is one in a series on Hadoop for .NET Developers.
Big Data has been a source of excitement in the analytics community for a few years now. For the purpose of this blog series, I’ll loosely define the term to mean an expansion of focus from data originating from core operational systems - the domain of traditional Business Intelligence - to include new data sources either historically neglected or newly available.
That is a gross simplification of the term Big Data but it’s challenges inherent to working with these new data that have driven the adoption of new data platforms. And as we’re focusing on Hadoop, the most widely recognized of these new data platforms, such a limited definition of Big Data would seem to suffice.
Hadoop is an elastic, distributed, schemaless data processing platform and is ideal for scenarios where you have large data sets with low per-record value, e.g. log files, in that it provides a low investment data access solution. It is also a good platform for complex data that require sophisticated parsing and interpretation, e.g. XML or JSON documents, image files, etc., and/or which may be subject to variable interpretations, e.g. customer Tweets (in a JSON document).
In addition, Hadoop is a great platform when you need massive scalability beyond what might be accomplished with a traditional relational database platform. That said, I’m not finding this last scenario to be applicable to many of my customers (though it is applicable to some). For the vast majority of the folks with which I work, the economics and flexibility of Hadoop tend to be the most compelling reasons to explore this platform.