HDInsight: Hive Internal and External Tables Intro

Article
02/05/2013

Small Bites of Big Data

Cindy Gross, SQLCAT PM

HDInsight is Microsoft's distribution, in partnership with Hortonworks, of Hadoop. Hive is the component of the Hadoop ecosystem that imposes structure on Hadoop data in a way that makes it usable from BI tools that expect rows and columns with defined data types. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.

We'll walk through creating basic tables with a few rows of data so you can see some of the differences between EXTERNAL and INTERNAL tables. The demo data files are attached at the bottom of the blog. Alternatively you can simply open notepad and create your own files with a series of single column rows. If you create your own files make sure you have a carriage return/line feed at the end of all rows including the last one. The files should be in a Windows directory called c:\data on the HDInsight Head Node. For HDInsight Server (on-premises) that's the machine where you ran setup. For HDInsight Services (Azure) you can create a Remote Desktop connection (RDP) to the head node from the Hadoop portal.

Note: Your client tool editor or the website may change the dashes or other characters in the following commands to “smart” characters. If you get syntax errors from a direct cut/paste, try pasting into notepad first or deleting then retyping the dash (or other special characters).

Create an HDInsight cluster. You can do this on your own Windows machine by installing HDInsight Server or by signing up for HDInsight Services on Azure. For the CTP of HDInsight Services as of February 2013 you fill out a form to request access and receive access within a few days. Soon the service will be available from the Azure portal via your Azure subscription. Since the portal interface will be changing soon and all the commands are straightforward I will show you how to do all the steps through the Hive CLI (command line interface).

Open a Hadoop Command Prompt:

Change to the Hive directory (necessary in early preview builds of Hive):

cd %hive_home%\bin

Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

hadoop fs -put c:\data\bacon.txt /user/demo/food/bacon.txt

hadoop fs -lsr /user/demo/food

The put command doesn't return a result, the list command returns one row per file or subdirectory/file:

-rw-r--r-- 1 cgross supergroup 124 2013-02-05 22:41 /user/demo/food/bacon.txt

Enter the Hive CLI (command line interface):

hive

Tell Hive to show the column names above the results (all Hive commands require a semi-colon as a terminator, no result is returned from this set command):

Set hive.cli.print.header=true;

Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:

CREATE INTERNAL TABLE internal1 (col1 string) LOCATION '/user/demo/food';

Oops... that failed because INTERNAL isn't a keyword, the absence of EXTERNAL makes it a managed, or internal, table.

FAILED: Parse Error: line 1:7 Failed to recognize predicate 'INTERNAL'.

So let's create it without the invalid INTERNAL keyword. Normally we would let an INTERNAL table default to the default location of /hive/warehouse but it is possible to specify a particular directory:

CREATE TABLE internal1 (col1 string) LOCATION '/user/demo/food';

That will return the time taken but no other result. Now let's look at the schema that was created:. Note that the table type is MANAGED_TABLE.

DESCRIBE FORMATTED internal1;

col_name data_type comment
# col_name data_type comment

col1 string None

# Detailed Table Information
Database: default
Owner: cgross
CreateTime: Tue Feb 05 22:45:57 PST 2013
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://localhost:8020/user/demo/food Table Type: MANAGED_TABLE Table Parameters:
transient_lastDdlTime 1360133157

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1

And now look at some rows:

SELECT * FROM internal1;

col1
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon

What happens if we don't specify a directory for an INTERNAL table?

CREATE TABLE internaldefault (col1 string);

It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):

dfs -lsr /hive/warehouse;

We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory: drwxr-xr-x - cgross supergroup 0 2013-02-05 22:52 /hive/warehouse/internaldefault

However, we won't use this table for the rest of the demo so let's drop it to avoid confusion. The drop also removes the subdirectory.

DROP TABLE internaldefault;

dfs -lsr /hive/warehouse;

Once we dropped the internaldefault table the directory that Hive created was automatically cleaned up. Now let's add a 2nd file to the first internal table and check that it exists:

dfs -put c:\data\bacon2.txt /user/demo/food/bacon2.txt;

dfs -lsr /user/demo/food;

-rw-r--r-- 1 cgross supergroup 124 2013-02-05 23:04 /user/demo/food/bacon.txt
-rw-r--r-- 1 cgross supergroup 31 2013-02-05 23:03 /user/demo/food/bacon2.txt

Since the CREATE TABLE statement points to a directory rather than a single file any new files added to the directory are immediately visible (remember that the column name col1 is only showing up because we enabled showing headers in the output - there is no row value of col1 in the data as headers are not generally included in Hadoop data):

SELECT * FROM internal1;

col1
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

Now let's create an EXTERNAL table that points to the same directory and look at the schema:

CREATE EXTERNAL TABLE external1 (colE1 string) LOCATION '/user/demo/food';

DESCRIBE FORMATTED external1;

col_name data_type comment
# col_name data_type comment

cole1 string None

# Detailed Table Information
Database: default
Owner: cgross
CreateTime: Tue Feb 05 23:07:12 PST 2013
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://localhost:8020/user/demo/food
Table Type: EXTERNAL_TABLE Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1360134432

This time the table type is EXTERNAL_TABLE. You can see that the location was expanded to include the default settings which in this case are the localhost machine using the default HDFS (as opposed to ASV or Azure Storage Vault).

Now look at the data:

SELECT * FROM external1;

The result set is a combination of the two bacon files:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

That table returns the same data as the first table - we have two tables pointing at the same data set! We can add another one if we want:

CREATE EXTERNAL TABLE external2 (colE2 string) LOCATION '/user/demo/food';

DESCRIBE FORMATTED external2;

SELECT * FROM external2;

You may create multiple tables for the same data set if you are experimenting with various structures/schemas.

Add another data file to the same directory and see how it's visible to all the tables that point to that directory:

dfs -put c:\data\veggies.txt /user/demo/food/veggies.txt;

SELECT * FROM internal1;

SELECT * FROM external1;

SELECT * FROM external2;

Each table will return the same results:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456

Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:

DROP TABLE internal1;

SELECT * FROM external1;

SELECT * FROM external2;

dfs -lsr /user/demo/food;

Result: lsr: Cannot access /user/demo/food: No such file or directory.

Because the INTERNAL (managed) table is under Hive's control, when the INTERNAL table was dropped it removed the underlying data. The other tables that point to that same data now return no rows even though they still exist!

Clean up the demo tables and directory:

DROP TABLE external1;

DROP TABLE external2;

exit;

This should give you a very introductory level understanding of some of the key differences between INTERNAL and EXTERNAL Hive tables. If you want full control of the data loading and management process, use the EXTERNAL keyword when you create the table.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the TAP and CTP builds may change rapidly.

bacon.zip

HDInsight: Hive Internal and External Tables Intro

Small Bites of Big Data

Cindy Gross, SQLCAT PM

Additional resources