Hadoop in the Cloud or: the evolution of Hadoop into an API, Capabilities and Components

Note: In an effort to share more often and react to feedback I might receive, I'm going to break this up into a series of posts rather than writing a book.

Part 1: An Introduction
Part 2: Capabilities and Components (you are here)
Part 3: Storage & Processing at Scale

Capabilities

Whenever I talk to a customer or look to solve a challenge I always think it's best to start with a discussion around capabilities. Capabilities speak to the actual needs and help us not get caught up in a technology discussion or trying to solve the challenge before we fully understand the challenge. When we need to transform large amounts of data or process through large amounts of data in preparation for business centered objectives knowing the capability needs is essential. This is not intended to be a complete or exhaustive capability list.

Transforming, querying or processing large amounts of data means storing large amounts of data (also need to know other characteristics like where it's coming from, the nature of data movement, security requirements, etc. but we'll start with storing). Historically this was done in either a file system or some kind of database or both. Storing data is only valuable if you can also retrieve the data needed when you needed it so relational databases were created to make this easier. As data volumes grew things like columnar store databases and massively parallel processing databases were created that eased some growing pains of changing data characteristics. All along capabilities grew and changed and new data stores arose included distributed databases and distributed file systems. From a capability perspective, we still need to store (or write) data to a store and we need to retrieve (or query or read) data from a store. As complexity grows & technologies change these capabilities are still constant.

Another key capability is the retrieval of data from a data store, with Hadoop that means MapReduce or YARN… or Hive, or Pig, or Drill, or Spark. This is a growing list, but Spark is a general execution engine that has been successful from an industry adoption point of view, it's a de-facto standard. YARN is the backend service that provides resources for processing and in most cases services like Spark will create a YARN application to get the work done in a distributed fashion. I like to think of YARN as an interface for which applications can be executed on a cluster of resources.

Other capabilities most organizations need include high availability and fault tolerance. I won't deep dive into them, but they are equally important as massive scalability. This is not an exhaustive capability list.

Components

After understanding the capability needs to solve a challenge then we can start looking at ways we can meet those capabilities via services or tools (only look for a hammer if you need the capability of using a nail, it might not help much if you need to use glue). For example when there is a need for a capability of storing petabytes of data one might consider the most simplistic route to build a computer/server that can store petabytes of data, maybe even something like the Titan in use at Oak Ridge National Laboratory which can store 40 PB of data (which is good) but costs $97 million and takes up over 4,300 square feet of floor space. Most organizations don't have that kind of budget or want to give up that much floor space. Since more and more organizations are seeing a need for storing volumes of data closer to that size but don't have that kind of budget (remember there are also people costs in maintaining an infrastructure like that) or expertise in system of that scale other options need to be explored. There are other distributed file systems and distributed data stores that might make sense. Over the past decade Apache Hadoop HDFS has been the star of the distributed file system game, although others (Lustre, GlusterFS,etc.) are popular in some circles. HDFS made it possible to buy inexpensive server hardware and achieve capabilities like massive scale (many petabytes), high availability & fault tolerance and other capabilities all through the application layer (re-read this sentence and mentally highlight inexpensive hardware).

In the next post, I'll go more in-depth on HDFS and YARN and how is meets the capability needs and how those can be implemented in the cloud. As always – please share any thoughts or feedback below.