Last week I attended The 2012 Eduserv Symposium. The event was focused on 'Big Data' and discussed whether Big Data represents a challenge or an opportunity and how we can best make use of it. This year's event was held at the Royal College of Physicians.
The key message from the introduction was the importance of uncoupling the issues of Big Data from the industry hype. Throughout the event the Gartner 3 V’s (velocity, volume and variety of data ) were discussed in depth.
Volume as a defining attribute of Big Data
Whilst it is fairly obvious that data volume is the primary attribute of big data, people often ask for a definitive quantity in GB, TB, PB etc. that would qualify as big data.
Whilst simplest answer is to give a data volume, for example 50TBytes which today present represents a reasonably large and expensive dataware house –this answer of course changes as the technology changes over time due to Moore’s Law.
But it’s also worth thinking about what it is you are looking as data, for instance for a large library of photographs the actual data contained in all the images themselves is very large for instance a RAW File from a Nikon D300 is about 25MB, so library with 2 million such images would be about 50TBytes , but the meta data describing those images isn’t that large perhaps 2GBytes.
So to someone actually searching the images for content, e.g. using facial recognition find all the photos of Adele, then that is a big data problem, but if the photos have already been labelled and tagged as being of Adele then that isn’t really a big data challenge as you are only search the Meta data.
Data Feed Velocity as a defining attribute of Big Data
Big data can be described by its velocity or speed. Or you may prefer to think of it as the frequency of data generation or frequency of data delivery. For example, think of the stream of data coming off of any kind of sensor, say thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. This isn’t new; many firms have been collecting click stream data off of Web sites for years, using streaming data to make purchase recommendations to Web visitors. With sensor and Web data flying at you relentlessly in real time, data volumes get big in a hurry.
Technologies such as Stream Insight are well positioned for certain types of streaming data, whilst other applications may need specialist development or tools.
Data Feed Variety.
A proliferation of data types from social, machine to machine, and mobile sources add new data types to traditional transactional data. Data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID’s, search, sentiment, streaming data, social, text, and web. The addition of unstructured data such as speech, text, and language increasingly complicate the ability to categorize data. Some technologies that deal with unstructured data include data mining, text analytics, and noisy text analytics.
A key message was Big Data and the role of the Data Scientist, was not limited to the computer Scientist of the future as Big Data is of use to all disciplines.
Please find below a quick overview deck of how we see the Big Data opportunity at Microsoft.
A interesting case study is Klout Data Services Firm Uses Microsoft BI and Hadoop to Boost Insight into Big Data
Klout wanted to give consumers, brands, and partners faster, more detailed insight into hundreds of terabytes of social-network data. It also wanted to boost efficiency. To do so, Klout deployed a business intelligence solution based on Microsoft SQL Server 2012 Enterprise and Apache Hadoop. As a result, Klout processes data queries in near real time, minimizes costs, boosts efficiency, increases insight, and facilitates innovation.
Klout helps clients make sense of the hundreds of terabytes of data generated each day by more than 1 billion signals on 15 leading social networks including Facebook and LinkedIn. The data that Klout analyzes is generated by the more than 100 million people who are indexed by the firm. This includes Klout members and the people that they interact with on social sites. Individuals join Klout to understand their influence on the web, which is rated on a scale from 1 to 100. They also sign up to participate in campaigns where they can receive gifts and free services. More than 3,500 data partners also join Klout to better understand consumers and network trends including changes in demand and how peoples’ influence might affect word-of-mouth advertising.
To deliver the level of insight that customers seek and yet meet the budget constraints of a startup firm, Klout maintained a custom infrastructure based on the open-source Apache Hadoop framework, which provides distributed processing of large data sets. The solution included a separate silo for the data from each social network. To manage queries, Klout used custom web services, each with distinct business logic, to extract data from the silos and deliver it as a data mashup.
Maintaining Hadoop and the custom web services to support business intelligence (BI) was complex and time-consuming for the team. The solution also hindered data insight. For example, accessing detailed information from Hadoop required extra development, and so mashups often lacked the level of detail that users sought. In addition, people often waited minutes, or sometimes hours, for queries to process, and they could only obtain information based on predetermined templates.
Klout wanted to update its infrastructure to speed efficiency and support custom BI. Engineers sought technologies that could deliver mission-critical availability and still scale to meet big-data growth and performance requirements.
In 2011, Klout decided to implement a BI solution based on Microsoft SQL Server 2012 Enterprise data management software and the open-source Hive data warehouse system. Based on employees’ previous experience with the Microsoft BI platform, Klout also knew that SQL Server offers excellent compatibility with third-party software and it can handle the data scale and query performance needed to manage big-data sets.
In August 2011, engineers implemented a data warehouse with Hive, which consolidates data from all of the network silos hosted by Hadoop. In addition, Klout deployed SQL Server 2012 on a system that runs the Windows Server 2008 R2 Enterprise operating system to take advantage of Microsoft SQL Server 2012 Analysis Services. Engineers use it to manage all business logic required to facilitate multidimensional online analytical processing (MOLAP). Data is stored in multidimensional cubes, which helps preserve detail and speed analysis. To provide high availability, Klout replicates the database to a secondary system using SQL Server 2012 AlwaysOn.
At the time that Klout was initially deploying its solution, SQL Server 2012 and Hive could not communicate directly. To work around this issue, engineers set up a temporary relational database that runs MySQL 5.5 software. It includes data from the previous 30 days and serves as a staging area for data exchange and analysis. Klout engineers are currently working to implement the new open database connectivity driver in SQL Server 2012 to directly join Hive with SQL Server 2012 Analysis Services. In addition, to enhance insight Klout plans to work with Microsoft to incorporate other Microsoft BI tools into its solution, such as Microsoft SQL Server Power Pivot for Microsoft Excel.
With its new solution, Klout expects to boost efficiency, reduce expenses, expand insight, and support innovation.
Speeds Efficiency and Cuts Costs By taking advantage of the Microsoft platform for BI, users will be able to get the data they seek in near real time.
Klout is implementing the flexible and scalable infrastructure it needs to continue to push the limits of data analysis.
This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.