In my previous post I described Hive at a high level, the Hive query language and how Hive works with map-reduce.
In this post we'll create a hive table, get a data file and query the data in that file. These steps will also demonstrate schema on read. In addition, we'll create a view. Lastly, we'll delete that table.
1) Create an HDInsight cluster
- After signing into Microsoft Azure choose HDInsight -> Create an HDInsight Cluster -> Hadoop.
- Give your cluster a name for *.azurehdinsight.net.
- A single data node cluster is fine for testing purposes.
- In that process you'll be asked to associate your Hadoop cluster with a storage account. This is necessary so that your data can live even after your HDInsight cluster has been deleted.
- After you choose "Create HDInsight Cluster" it will take a few minutes to create your cluster.
You can use the sources below to get more details on creating an HDInsight cluster:
- Getting started with Microsoft Azure HDinsight.
- HDInsight background info and creating your first Hadoop cluster with HDInsight in minutes.
- Create an Azure storage account.
2) Access your HDInsight cluster
To access your HDInsight cluster you can use the query console or enable remote access. The remaining steps below assume you are using remote access to your HDInsight cluster.
I noticed that each HDInsight cluster has Hive samples under \apps\dist\examples\hive. However, I also noticed that the Azure Emulator has different samples.
- Once you gain access to your cluster, on the desktop open the Hadoop command prompt and enter cd %hive_home%\bin which will take you to \apps\dist\hive-<version>\bin.
- Run "hive" at the command prompt.
- You should see the hive command prompt as "hive>"
3) Execute the following to create and query a Hive table
Note, there are other ways to run Hive queries. I only describe one option below.
Also, ensure that you have a semicolon at the end of each Hive statement.
-- At the hive prompt run the following to create a table called "HiveSampleTable":
CREATE TABLE IF NOT EXISTS HiveSampleTable (
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
-- The statement above will create an INTERNAL table by default. This blog post gives an in-depth description of internal vs. external hive tables, how they function and when to use each.
-- Run the following at the hive prompt. This command will get the file "HiveSampleData.txt" from your local file system and place it into the default hive directory at /hive/warehouse.
LOAD DATA LOCAL INPATH '\Apps\dist\examples\hive\HiveSampleData.txt' OVERWRITE INTO TABLE HiveSampleTable;
-- Get some data to ensure it was created successfully.
SELECT * FROM HiveSampleTable LIMIT 10;
Note: What you just witnessed is an example of schema on read. The schema is applied when the data is read from the data store (in this case HDFS). When we ran the select query above it applied the previously defined "HiveSampleTable" schema to the data file "HiveSampleData.txt" and showed the results. Alternatively, you could collect additional data file(s) and define one or more schemas later. You can also change your schema without affecting the data files. In addition, here is a great explanation of schema on read.
-- See how many rows we have in the HiveampleTable overall.
SELECT COUNT(*) FROM HiveSampleTable;
-- Towards the bottom of the command window you'll see something like the following, "59793" is the answer we're looking for.
Total MapReduce CPU
Time Spent: 5 seconds 483 msec
Time taken: 33.338
seconds, Fetched: 1 row(s)
-- Run the following to see the field names and field types of a table.
-- Need more info about the table? Run:
Describe formatted HiveSampleTable;
-- Get a list of all table names
-- Count the number of rows where the state is Nevada. The result should be 732.
select count (*) from HiveSampleTable where state = 'Nevada';
-- You get the idea...
select DeviceMake, DeviceModel from HiveSampleTable where state = 'Nevada';
select DeviceMake, DeviceModel from HiveSampleTable where state = 'Nevada' order by DeviceMake;
-- Rather than just selecting rows let's total one of the columns called "QueryDwellTime". Dwell time is the length of a users stay within a particular cell. After running the SELECT below you should see "1.314205…" in the command window.
SELECT SUM(QueryDwellTime) FROM HiveSampleTable;
-- Let’s compare the average query dwell time from an iPhone 4.2.x and iPhone 4.3.x. You should see 38.92 and 54.04 respectively.
SELECT AVG(QueryDwellTime) FROM HiveSampleTable where devicemodel='iPhone 4.2.x';
SELECT AVG(QueryDwellTime) FROM HiveSampleTable where devicemodel='iPhone 4.3.x';
One of the benefits of using views is that they provide a level of abstraction from the underlying tables. For example, with views you can include a subset of data from one or more tables. Run the command below to create a view called "DeviceMake_View" for the devices used in Nevada.
Create VIEW DeviceMake_View AS select DeviceMake, DeviceModel from HiveSampleTable where state = 'Nevada';
-- Select some data using that view using the following queries.
Select * from DeviceMake_View where devicemake = 'Apple';
Select count (*) from devicemake_view where devicemake = 'Apple';
--- Towards the bottom of the command window you'll see something like the following, "242" is the answer we're looking for.
Total MapReduce CPU
Time Spent: 6 seconds 374 msec
Time taken: 34.235
seconds, Fetched: 1 row(s)
-- Finally, get rid of that view.
drop view devicemake_view;
-- Get rid of the table.
Drop table HiveSampleTable;
Be sure to delete your HDInsight cluster or else you'll be charged $$ for it. The data can still live in your storage account that was associated with the Hadoop cluster.
In my next post I'll provide a quick overview of HBase and some queries as well.