An overview of Hadoop Distributed File System, HCatalog, Hive and map-reduce.

Article
12/21/2014

Yet another step in my journey learning about big data. Below are some things I've learned along the way.

The image here provides a high-level overview of Hadoop. (I'm still trying to find a more detailed architecture diagram that I can freely re-use).

After setting up an HDInsight cluster on Azure I dove in and wrote some Hive commands. However, before going too far there are a few key pieces to understand namely Hive, map-reduce, Hadoop Distributed File System and HCatalog.

In summary:

Hive is built on Hadoop. Hive enables a SQL-like query language. I've seen the Hive query language also referred to as HQL or HiveQL. That language is not the same as SQL but it looks very similar to SQL as I'll show in a later post.

Hive functions differently than a relational DB management system. Hive is an abstraction layer on top of map-reduce and actually turns your HQL queries into map-reduce jobs. "So... why not just write map-reduce jobs?" you might ask. Well you could but you'd need to write the actual code in Java, C# or another language rather than just writing SQL-like statements. I'm sure there are times where an HQL query would suffice but other times where an actual map-reduce job would need to be written.

Also, there are many details behind Hive and HQL that I won't go into here. The Hortonworks University has a really good introductory tutorial on Hive.

The Hadoop Distributed File System (HDFS) is a Java based, distributed and fault tolerant file system. HDFS includes a master node and data nodes.

Since HDFS is a file system you will find commands that you would expect like file copy, create a subdirectory, delete files, etc. The HDFS commands I used most frequently from the Hadoop command line are:

- hadoop dfs -mkdir HDFS_destination_folder --> Creates an HDFS subdirectory "HDFS_destination_folder"

- hadoop dfs -put c:\local_subdir\filename.txt HDFS_destination_folder --> Copies filename.txt to the HDFS_destination_folder

- hadoop dfs -ls HDFS_destination_folder --> Shows all of the files in the HDFS_destination_folder

- hadoop dfs -get /HDFS_destination_folder/file.txt c:\local_subdir --> Gets file.txt from HDFS and places it in your local c:\local_subdir

- A full list of commands can be found by entering "hadoop dfs -help". In adddition, this seems to be a good summary of commands.

Map-reduce utilizes HDFS. For example, when an HQL query is submitted by a user the master node uses a "map" process to assign parts of that query to data nodes to execute. The data nodes execute their assigned portion and return their individual results back to the master node. The master node then aggregates or "reduces" the individual results into a cohesive result that is returned to the user who executed the HQL query.

HCatalog provides an abstraction over the files in HDFS. Hive, map-reduce and other parts of Hadoop use HCatalog to simplify reading and writing to HDFS. I found this to be a good tutorial of HCatalog.

I'm skipping over many other parts of Hadoop mainly because I haven't used them yet. Realistically, you may not need to use everything in the Hadoop toolset. Some of the other parts include Pig, Sqoop, etc. This a good description of each of the Hadoop tools.

In the next post I'll import a file into a hive table and create some queries.

An overview of Hadoop Distributed File System, HCatalog, Hive and map-reduce.

Additional resources