24HOP/SQLRally - Fitting Microsoft Hadoop Into Your Enterprise BI Strategy

24HOP/SQLRally - Fitting Microsoft Hadoop

Into Your Enterprise BI Strategy

 

Small Bites of Big Data

Cindy Gross, SQLCAT PM

The world of #bigdata and in particular #Hadoop is going mainstream. At 24HOP 2012 I talked about how a SQL Server professional fits into this big data world.

Hadoop generally falls into the NOSQL realm. NOSQL uses basically available, soft-state, eventual consistency (BASE) instead of requiring immediate, 100% consistency via ACID properties. This allows certain types of queries/operations to be much faster since locking and consistency rules don’t have to be followed. This works well when any given program/query looks at the entire dataset and does not try to filter the data or query a subset of it. Basically with Hadoop you take data that may not have a known, consistent structure and impose a structure on it for the duration of your MapReduce program (query). This is all based on the concept of scaling out rather than up, with new commodity (low end enterprise server) hardware being added to an existing Hadoop cluster to meet dynamic demands. With Hadoop you are spreading hardware costs out over a longer time, possibly paying less overall for hardware, and shifting your IT costs to different departments and parts of the lifecycle.

Hive is a database which sits on top of Hadoop’s HDFS (Hadoop Distributed File System). The data stays in HDFS but Hive stores metadata about the imposed structure and may store intermediate files. HiveQL looks a lot like TSQL and like most SQL languages makes an effort to align with the SQL standard when it makes sense. HiveQL will generate (usually multiple) MapReduce jobs to take an entire large result set and allow easier use of filters, joins, and aggregates. It’s still going to be slower than a relational database for most filters and joins but it allows access to data that may be too expensive (for whatever reason) to put into a relational or multi-dimensional database at this point in the lifecycle.

You may keep your source data outside HDFS and bring it in only for the duration of a project. You can sqoop data to/from a relational database (there is a sqoop driver for SQL Server and PDW) or pull data from blob stores like Windows Azure and Amazon S3, use secure FTP, query OData sources like the Windows Azure DataMarket. The Hive ODBC driver, often via the Excel Hive Add-in, and sqoop can be used to pull data from Hadoop into other tools or systems.

So far I’ve been talking as if big data = Hadoop. However, that’s not really true. Big data is data that is too big, expensive, or complex to process using your existing environment (often traditional RDBMSs with expensive SANs and high end hardware). While MapReduce, most commonly via Hadoop, is a very common solution it’s not the only one. There are streaming technologies like StreamInsight and HStreaming, machine learning like Mahout, massively parallel processing databases like PDW, and more. Generally big data at least starts out as unstructured or semi-structure, or perhaps of variable structure. It may flow very quickly with the need to process large amounts of data in a very small time window where decisions have to be made in real time.

At its core Hadoop has the file system HDFS which sits on top of the Windows or Linux file system and allows data to be mapped over many nodes in a Hadoop cluster. A head node maintains metadata about where each piece of data resides. MapReduce programs “map” the data to the many nodes then reduce the output based on the required logic, resulting in an output file that is the end result of the program/query. Other pieces of the Hadoop ecosystem may include things like Hive, HBase, HCatalog, Lucene, Mahout, Zookeeper and more.

So when would you use Hadoop? It’s good for exploring data, finding out what you don’t know. If you need to know your unknown unknowns or look for unexpected correlations, Hadoop may be what you need. It allows you to be very flexible and not lock yourself into a rigid schema until after you’ve fully explored your options. Some examples of common use cases are risk modeling, machine learning, cyber security, sentiment analysis, recommendation engines, log analysis, and sensor data. However, don’t think big data is the solution to all your problems. Often it is used to find and define the most useful bits of data or find the most useful algorithms. Then a more traditional design process may be used to create an RDBMS or multiple dimensional system for faster querying during day to day operations. Think of Hadoop as the foundation for a very specialized type of database that meets very specific needs, not as a replacement for relational. It’s important to note that every VLDB is not a fit for big data. “Big” is somewhat of a misnomer, size is only one of the characteristics and is relative to what your environment is prepared to handle.

Often you ask about the 4 “Vs” when deciding whether to use Hadoop - volume, velocity, variety, variability. Think about big data when you describe a problem with terms like tame the chaos, reduce the complexity, explore, I don’t know what I don’t know, unknown unknowns, unstructured, changing quickly, too much for what my environment can handle now, or unused data.

  • Volume = More data than the current environment can handle with vertical scaling, need to make sure of data that it is currently too expensive to use
  • Velocity = Small decision window compared to data change rate, ask how quickly you need to analyze and how quickly data arrives
  • Variety = Many different formats that are expensive to integrate, probably from many data sources/feeds
  • Variability = Many possible interpretations of the data

Microsoft is taking the existing Apache Hadoop code and making sure it runs on Windows. We have checked back in HDFS, MapReduce, and Hive code to the Apache open source community. More is on the way. We are adding things like the Excel add-in for the Hive ODBC driver, JavaScript, cloud (https://HadoopOnAzure.com), C# samples, etc. Where Microsoft is often seen as adding value is making Hadoop more available and allowing the reuse of existing skill sets. We offer self service, visualization, easier data movement, elasticity, and familiarity for faster completion of projects. This allows data scientists (people who know the business and statistics first and tech 2nd), data teams, information workers, and anyone seeking insights through BI to do their job better and faster.

We offer visualization through PowerPivot, Power View, and the Excel Hive ODBC Add-in. For many used to Hadoop on Linux these things are new and exciting, even though they seem familiar to many in the SQL Server world.

In the end, big data is really all about discovering insights and then taking action. If you don’t act on what you find, why did you spend all that effort mining the data? Sometimes the action is to go back and refine your process or change the way you looked at the data, but the end goal should be to make decisions that impact your business. Big data should add value to the business proposition.

References

My full presentation from #24HOP can be downloaded below. There are notes on many of the slides with further details. The recording from 24 Hours of PASS as soon as it is available at https://www.sqlpass.org/LearningCenter/SessionRecordings/24HoursSpring2012.aspx. Download their player then choose the session by title - " Fitting Microsoft Hadoop into your Enterprise BI
Strategy
(Cindy Gross)".

I updated the attachment to include the updated slides from my May 11, 2012 talk at #SQLRally. There are few changes other than the background.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

MSHadoopEntBIStrategy24HOPandSQLRally2012.zip