In my past blogs, I have discussed about the concept and now I am going to focus on platform. Here are the platform options from Microsoft to handle such big data.
1) HDInsight on prem (Preview)
2) HDInsight service on cloud (Preview)
3) PDW v2.0
HDInsight is Microsoft brand name of distributing Hadoop. One can download HDInsight on prem which is a one box solution from here.
Microsoft HDInsight services is cloud based offering. It means no worries about installation, configuration, deployment, manageability and its cost less.That’s the benefit of cloud platform. As I explained in my previous blog the data is stored in Data/Slave node however on HDInsight data is going to store in Windows Azure Storage Blob. Refer below diagram…
So during the query processing if compute node needs data to process it goes to windows azure storage blob. Windows Azure Storage Blob will receive the request and return back the data as per the request. Windows Azure Storage Blob has its own concept of data replication called “Enable Geo Replication”. The benefits of storing data outside the cluster is let say if cluster is down due to any issue the data will be available. The other benefit of cost of storing on blob storage. Storing data outside the cluster means that read/write performance suffer? Actually it’s not and that’s because of high speed network used between compute node and storage.
The pricing details about HDInsight services (Preview) can be found here.
Now let’s get started with setting up with the HDInsight cluster. Firstly, I will setup the storageaccount. Go to Azure portal and click New-> Click on Storage -> Quick create
Give name to the storage account and location as well. It’s up to the end user to choose “Enable Geo-Replication” option.
Click HDINSIGHT (PREVIEW) -> Click QUICK CREATE
Give name to your cluster. Cluster size in terms of how many compute/data nodes required.
Next is to wait for some time till UI setup HDInsight cluster. Once HDInsight cluster is setup you can browse on various options. Below is dashboard which gives information about active map/reduce task, usage and storage associated etc.
Next screen is Monitor. It will shows information about the map/reduce run during a time period.
Next screen is CONFIGURATION. HDP 2.0 by default don’t enable RDP to headnode. So, before RDP enable access through configuration page.
*Disclaimer: - The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.