Analyze semi-structured data using HDInsight


Why should you care?

Recent Gartner reports suggest that 85% of the modern data is generated in new data types – those that were traditionally not stored in data warehouses. By 2015, organizations integrating high-value, diverse, new information types and sources into a coherent information management infrastructure will outperform their industry peers financially by more than 20%. Organizations today are ramping up their IT infrastructure and skills to take on this challenge. Big Data COE are getting formed to design, architect and implement such solutions.

Just to illustrate why it makes sense, it will be exciting to know how organizations today can analyze huge pile of emails sitting in their exchange servers to find out how teams collaborate with each other, if at all they are!! You would agree that most projects in an organizations fail due to improper or lack of communication. And there are numerous examples – sentiment analysis for marketing purposes, political campaigns, etc.; click-stream analysis for digital marketing, customer behavior, personalized advertising, targeted surveys, etc Opportunities are endless for organizations to innovate.


Following example shows how IT communities are connected in different parts of the country. Communities which are closely connected have better collaboration. This is some manner influences how skilled the resources are. Most recruitment companies today are doing such analysis.


 The data shown below is extracted from a social networking site called GitHub (a social network site for open code).  The network graph shows that Bangalore is much better connected than Delhi/NCR or Pune.  

Also, Bangalore and Singapore are somewhat similar and have pretty closely knit IT community.




How to you build these?

HDInsight is Microsoft’s implementation of Hadoop on cloud. A 100% apache Hadoop service that you can spin-off in just a few minutes. HDInsight offers enterprise-class security, scalability and manageability. Thanks to a dedicated Secure Node, HDInsight helps you secure your Hadoop cluster. You can also take full advantage of the elastic scalability of Windows Azure. In addition, HDInsight simplifies manageability of your Hadoop cluster through extensive support for PowerShell scripting.

Analysis is seldom done in isolation. It is usually, and almost always, done in conjunction with structured data. How do you combine the two together without transporting neither the results nor the raw data? Microsoft Power Query and PolyBase are innovative and next generation BI tools that allow organizations mesh-up semi-structured, unstructured and structured datasets into a single unified data model. Power Query can directly connect to Hadoop distributed file system (HDFS) running on=premise or on Azure (HDInsight). PolyBase in its v1 avatar, can connect to HDFS running on premise. This truly redefines Business Analytics in the modern data warehouse.




Comments (0)

Skip to main content