Take away on Seattle Technique Forum: Case of Big Data

Yesterday, I attended Seattle Technique Forum: Case of Big Data, I like to post my take away here for your reference.

  • Data Visualization: Tools and Techniques by Avkash Chauhan, Sr. Escalation Engineer, Windows Azure and HDInsight, Microsoft

    I came later yesterday night, so only attend half of Avkash's talk.  Avkash did a cool demo on data visualization and using HDInisght.  I include Avkash in the e-mail in case you want to get more information about this talk.

     

  • Big Data for Erlang VM and Manycore OS by Ying Li, Chief Scientist, Concurix (Former GM of Microsoft AdCenter)

    Ying has been in Microsoft for 15 years and always doing data mining.  She left Microsoft last year, and co-founded Concurix which aims to fully take advantage of multi-core technique.  

     Ying's talk focus on how she can use her knowledge as data scientist to tune OS and Language stacks.  A couple of high-lights are

  1. Today, we can have linear scale up to 8 to 16 core, but very hard to scale up to 64,128 core.   Her goal is that the customer gives their apps, and they tune the OS and the Language to make it scale without changing your app.
  2. Treat this problem as measuring and tuning lot of independent variable, such as HEAP size, Garage collection, etc.
  3. She claim that they can handle 3600 (I forget whether this is 3600 or 36000) typical web request per seconds in a typical Amazon VM.
  4. They use Erlang as the language since it is suitable for multicore programming.
  5. They are using Amazon EC2.   

I ask her one question about Data scientist as a career: since Bigdata is so hot,  and many people are interesting in this area (this is the first time that the meeting room of Bellevue Hall is full and more than 10 people were standing), what is your advises for people?

She answered my question with three key points:

    • Being a outlier.  When she joined Microsoft MSN team from academic background, there are only two people in MSR knowing data mining. So it is pretty rare that she did not join MSR, but product team.  Also, her mangers always support her work, and make such outlier happy to work here. Personally, I am not fully understand this, I think her point is that you should be unique, be passion, and doing things differently.
    • As a data scientist, you should always remind yourself that you provide SERVICE to your customer. You should know more that your customer's product than themselves, and also don't let them know that you know more than them.   She gave an example when doing data mining on MSN  Music, they understand the whole stack and knowing all the details about how a user can click go through.
    • Love data, live with data.

She also mentioned another topic which is how to let people knowing the value of data and make decision based on data.  She said that it is very hard, she has 10:1 ratio convincing people to do data driven decision.  In  average it took 10 to 18  months for a manager to fully get the idea of using data driven to making decision (you can image that lot of factors impact on How people doing decision today, and it might not easy to get there).  At that time, when you realized this, it is too late:

  1. Your data is out of data, and useless.
  2. Your target is changing, and what you are measuring is not relevant.
  3. And you are changing, you might be re-orged, your product might change direction.

My take away for this is that speed is one of the import factors, you should get your data out as soon as possible.  If we focus too much on building complete solution for data analytic, it might be useless when it finally arrived.   Starting from small, asking questions you want to answer, generating some report from your data, leading to some actions from your manager, and continue to improve this. 

  • Big Data for the Masses: How We Opened Up the Doors to Google's Dremel by Jim Caputo, Engineering Manager, BigQuery, Google

Jim talks about Google's Dremel query engine and BigQuery.   Unlike Map-Reduce which aims to do batch processing on large set of data, Dremel aims to answer complex adhoc queries on very short time.   The presenter shows a live demo on query all Wikipedia documents and find out the top pages which have description contains G*g??l?, the query returns the query in 30 seconds with scanning 600Gb of data.  Dremel used lot of traditional RDBMS techniques, and Columnstore as well which I think some of DBMS guys in Microsoft might be interesting.