Getting started with big data (HDInsight, Hadoop, etc.)

I'm currently focused on learning about big data. Over a series of posts I'll show the path I took from being new to big data to eventually doing something worthwhile with Big Data, specifically using HDInsight which is the Microsoft Azure Hadoop distribution. If big data is new to you then I hope you can learn a few things below.  

 

Also, this blog will contain my personal notes. Rather than keep my notes locally I thought I would share them so that others can also learn as well. Since these are my notes they will appear rough at times.  

Getting started with HDInsight and Hadoop

I started using HDInsight a few months ago. I found the following info helpful:    

  • Hortonworks has great tutorials.  

  • Microsoft offers this learning map. See the "real world scenarios" column at the bottom of that page for hands-on tutorials.

If you like the overviews above you'll need a place to try the tutorials. I'm aware of 3 options.  

  • Signup and provision your own Hadoop cluster at https://azure.microsoft.com/. You can use the free trial or when that's over pay for your usage.  

  • Use an emulator or sandbox.  

    I wish I would have known about the HDInsight Emulator a few months ago. It's good to know how to setup an HDInsight cluster in Azure. Also, getting remote access to your cluster in Azure is good as well. However, once you do these things once it's simple to repeat.  

    However, when not using your HDInsight cluster in Azure you should delete it otherwise you'll be charged (unless you're using the free trial). Each time I wanted to learn something new about Hadoop I had to spin up a cluster, walk through tutorials and lastly delete the cluster. With the Azure emulator I could have avoided these 2 steps and a few $ bucks in charges.  

    Other than the emulator not having HBase I'm not aware of any other limitations... yet...  

    The emulator was very easy to install, very "hands off" and worked the first time.  

    Another option is to install the Hortonworks Sandbox linked above. I haven't installed it yet since the Azure emulator is working fine for me. Hortonworks has plenty of docs and support in their forums for the sandbox.