What’s so BIG about “Big Data”?

As announced during the PASS Summit 2011 Day One Keynote, we are diving deeper into the world of Big Data by embracing and contributing to the open source community and Hadoop.

We’ve had a lot of good coverage on this topic with some examples below.


Openness – yes, we’re serious about it!

A key aspect is openness and our commitment to give back to the Open Source community.  While this is just a small start, we made a first step forward with SQL Community ambassadors Lara Rubbelke (@sqlgal) and Bronwyn McNutt (@MissBronwyn) leading the effort at one of the ultimate open source conference – ApacheCon North America 2011.  

Hadoop on Azure / Windows Server Founder and General Manager Alexander Stojanovic (@stojanovic), Dave Vronay (@davevr), and I had also joined the festivities to introduce ourselves and listen to the Apache community (and we learned a lot!).  Luis Daniel (@luisdans) from our product planning team also joined – this is just the beginning but should show our commitment to openness!   More to come on that in future blog posts.

@dsfnet had appropriately tweeted  “@dennylee You are really taking this Hadoop/Apache thing to heart :)”. 
Yes – we are!

BTW, for more thoughts on the openness aspect, please reference my own personal blog on the topic: The aggressive optimism of Hadoop.


But why is Hadoop important to us?

But from a CAT perspective, why is Hadoop important?  The straightforward answer is that it is important for our customers.  

As you may know from our team’s blog posts, technical notes, and whitepapers; we work on some of the most complex Tier-1 Enterprise SQL Server implementations.  The last few years, we’ve seen more and more customers talk about their Big Data problems. Buck Woody (@buckwoody) described these problems the best as “Big data is defined as a large set of computationally expensive data that is worked on simultaneously” in his post Big Data and the Cloud – More Hype or a Real Workload?

As Alexander Stojanovic, Founder and General Manager of Hadoop on Azure / Windows Server, noted during the ApacheCon 2011 North America Meetup:

It’s not just your “Big Data” problems, it’s about your BIG “Data Problems”

A great example of one of these BIG “Data Problems” is the Yahoo! 24TB Analysis Services cube – the largest known cube!  The cube’s data source is 2PB from a huge Hadoop cluster.   


Following the 4 V’s of Big Data, Yahoo! not only processes a massive volume of campaign web analytics data, but they have to quickly make decisions based on the incoming events (i.e. velocity).  Both the variety (many different formats of data) and variability (many different ways of interpreting this data) of the data coming in can be intense hence the importance of using Hadoop.   For fast interactive querying of the data, Yahoo! is using Analysis Services to pivot against all of this interesting data.

Another great posting on this very topic from Klout: Big Data, Bigger Brains
…Yes, you read it right.  We use a Windows based multi-dimensional (MOLAP) product from Microsoft to load 350 million rows of Hive data per day and achieve an average query response time of under 10 seconds on 35 billion rows of data

For more information, you can also refer to the PASS 2010 Summit Day One Keynote and/or the PASS 2011 Session “SQLCAT: Tier-1 BI in a world of Big Data”.   And yes, we are working on a case study as we speak!