Exploring the Hive IMPORT and EXPORT Commands

In a recent post, I mentioned I am working with a customer to migrate from a native Hortonworks (HDP) cluster to Azure HDInsight. As part of this work, I’ve had need to use the Hive IMPORT and EXPORT commands as part of the migration of tables from one cluster to another.

The Hive IMPORT and EXPORT commands aren’t terribly difficult to use, but still, I’ve found their documentation to be a bit sparse, leaving me to figure a few things out by trial and error. In addition, I’ve found that the use of the WASB file system, the default file system for Azure HDInsight, changes some of the commands’ behaviors.

In this post, I’ll document the basics of the commands, assuming the use of the HDFS file system, and then address some of the differences that arise when used with WASB. Be before doing this, I’ll introduce a few tables which I’ll use to illustrate some of the commands’ functionality.

 

The Sample Hive Tables

To illustrate some differences in behavior between managed and external as well as partitioned and unpartitioned tables, I’ll create and populate four simple Hive tables, representing a combination of each of these elements:

CREATE TABLE ManagedUnpartitioned ( column1 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/demo/managedunpartitioned/';CREATE TABLE ManagedPartitioned ( column1 STRING ) PARTITIONED BY (partition1 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/demo/managedpartitioned/';CREATE EXTERNAL TABLE ExternalUnpartitioned ( column1 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/demo/externalunpartitioned/';CREATE EXTERNAL TABLE ExternalPartitioned ( column1 STRING ) PARTITIONED BY (partition1 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/demo/externalpartitioned/';INSERT INTO TABLE ManagedUnpartitioned VALUES ('this is a test'), ('this is also a test');INSERT INTO TABLE ManagedPartitioned PARTITION(partition1='A') VALUES ('this is a test'); INSERT INTO table ManagedPartitioned PARTITION(partition1='B') VALUES ('this is also a test');INSERT INTO TABLE ExternalUnpartitioned VALUES ('this is a test'), ('this is also a test');INSERT INTO TABLE ExternalPartitioned PARTITION(partition1='A') VALUES ('this is a test'); INSERT INTO TABLE ExternalPartitioned PARTITION(partition1='B') VALUES ('this is also a test');

NOTE The code above should be executed as a block. Subsequent code samples include redundant statements that if executed as a block will produce an error. The redundant statements are provided below to illustrate supported variations in command syntax.

 

The Hive EXPORT Command

The Hive EXPORT command is relatively simple. The basic form is:

EXPORT TABLE tablename TO 'targeted export location';

Under the directory identified as the targeted export location, the export command will create a _metadata file containing table definition information. If the table is unpartitioned, a subdirectory named data will be created under the target export location and within it will be copied the data for the table. If the table is partitioned, then a subdirectory for each partition is created under the targeted export location to hold that partition’s data.

When submitting the command against a cluster using HDFS as its default file system, the syntax is very straightforward. You can use a fully-qualified targeted export location or a relative location path. The targeted export location does not need to be created in advance, but if it has, you only need to ensure the directory to which you are exporting data is empty. Otherwise, you will receive an error message specifying this as the problem.

Here are examples of the EXPORT command for each of the four sample tables on a cluster using HDFS as its default file system. Please note that the elements in the fully-qualified path will vary with the configuration of your cluster:

EXPORT TABLE ManagedUnpartitioned TO '/export/managedunpartitioned'; EXPORT TABLE ManagedUnpartitioned TO 'hdfs://master3:8020/export/managedunpartitioned';EXPORT TABLE ManagedPartitioned TO '/export/managedpartitioned'; EXPORT TABLE ManagedPartitioned TO 'hdfs://master3:8020/export/managedpartitioned';EXPORT TABLE ExternalUnpartitioned TO '/export/externalunpartitioned'; EXPORT TABLE ExternalUnpartitioned TO 'hdfs://master3:8020/export/externalunpartitioned';EXPORT TABLE ExternalPartitioned TO '/export/externalpartitioned'; EXPORT TABLE ExternalPartitioned TO 'hdfs://master3:8020/export/externalpartitioned';

When working with a partitioned table, you have the option of exporting a single partition by adding a PARTITION clause to the EXPORT command. Here is each partition of the two partitioned tables being exported individually (using relative targeted location paths for brevity):

EXPORT TABLE ManagedPartitioned PARTITION (partition1='A') TO '/export/managedpartitioned/a'; EXPORT TABLE ManagedPartitioned PARTITION (partition1='B') TO '/export/managedpartitioned/b';EXPORT TABLE ExternalPartitioned PARTITION (partition1='A') TO '/export/externalpartitioned/a'; EXPORT TABLE ExternalPartitioned PARTITION (partition1='B') TO '/export/externalpartitioned/b';

As before, the targeted export location should either not exist or should be an empty folder. Under this location, the command will create a _metadata file as well as a subdirectory for the targeted partition’s data. If individually exporting multiple partitions like this, it’s important that each partition have its own, unique targeted export location. On the import, you’ll need to bring these partitions into the target table one at a time.

 

The Hive IMPORT Command

The basic form of the Hive IMPORT command identifies the target table and the directory containing the _metadata file and data subdirectories that will be used with the operation. This is the form of the command used when importing a table that has been exported in its entirety, i.e. not one partition at a time. The optional EXTERNAL clause indicates the newly created table should be created as an external table as opposed to a managed table:

IMPORT TABLE tablename FROM 'targeted import location'; IMPORT EXTERNAL TABLE tablename FROM 'targeted import location';

A few points about the IMPORT command as presented here:

  • The table being created should not already exist as this command will attempt to create a new table using information in the _metadata file.
  • The name of the table being created does not need to match the name of the original (exported) table.
  • The table being created can be either managed or external, regardless of the status of the original table.

With that in mind, here are a few examples of the command illustrating some of its possible variations:

-- original export command -- EXPORT TABLE ManagedPartitioned TO '/export/managedpartitioned';IMPORT TABLE managedpartitioned_imported FROM '/export/managedpartitioned';IMPORT EXTERNAL TABLE managedpartitioned_imported FROM '/export/managedpartitioned';IMPORT TABLE managedpartitioned_imported FROM 'hdfs://master3:8020/export/managedpartitioned';IMPORT EXTERNAL TABLE managedpartitioned_imported FROM 'hdfs://master3:8020/export/managedpartitioned';

To import individual partitions, a PARTITION clause can be added as demonstrated earlier. Again, here are a few examples of a partitioned table being imported by partition (using a relative path for the targeted import location for brevity):

-- original export commands -- EXPORT TABLE ManagedPartitioned PARTITION (partition1='A') TO '/export/managedpartitioned/a'; -- EXPORT TABLE ManagedPartitioned PARTITION (partition1='B') TO '/export/managedpartitioned/b';IMPORT TABLE managedpartitioned_imported PARTITION (partition1='A') FROM '/export/managedpartitioned/a'; IMPORT TABLE managedpartitioned_imported PARTITION (partition1='B') FROM '/export/managedpartitioned/b';IMPORT EXTERNAL TABLE managedpartitioned_imported PARTITION (partition1='A') FROM '/export/managedpartitioned/a'; IMPORT EXTERNAL TABLE managedpartitioned_imported PARTITION (partition1='B') FROM '/export/managedpartitioned/b';

Unlike with the previous IMPORT statements, when importing a partition, the targeted table may already exist. IMPORT will throw an error if the targeted partition already exists in the targeted table or if the partition value specified with the PARTITION clause does not match the partition values of the original table as defined in the _metadata file. For example, if you attempt to import the partition with the partition1 value of ‘A’ to the partition with partition1 value of ‘B’ or a previously unknown partition with partition1 value of ‘C’, the IMPORT command will return an error.

Also interesting to note is that if you import the first partition as EXTERNAL, you can omit the EXTERNAL keyword on subsequent partition imports to that table and all partitions will be handled as part of an EXTERNAL table. However, if you omit the EXTERNAL keyword on the first partition import, then attempting to specify EXTERNAL with subsequent partition imports will generate an error complaining that the table is defined as managed.

Last thing to note about the IMPORT statement is that it allows you to copy data from the targeted import location to a new location which will serve as the go-forward path for the table. This is accomplished using the LOCATION clause:

IMPORT TABLE managedpartitioned_importedrelo FROM '/export/managedpartitioned' LOCATION '/relo/managedpartitioned';IMPORT EXTERNAL TABLE managedpartitioned_importedrelo FROM '/export/managedpartitioned' LOCATION '/relo/managedpartitioned';

As with the previous statements, the path associated with the LOCATION clause should not exist prior to the import (or if it does, it should be empty).

When using LOCATION with a partition import, the path given should reference the path for the table and not the partition itself. The IMPORT command will create the partition folder automatically under this path:

IMPORT TABLE managedpartitioned_importedrelo PARTITION (partition1='A') FROM '/export/managedpartitioned/a' LOCATION '/relo/managedpartitioned_importedrelo'; IMPORT TABLE managedpartitioned_importedrelo PARTITION (partition1='B') FROM '/export/managedpartitioned/b' LOCATION '/relo/managedpartitioned_importedrelo';

Hive IMPORT and EXPORT with the WASB File System

In the previous examples, directory path information could be provided using either a relative path or a fully-qualified path. When using the WASB file system, you must provide paths to the IMPORT and EXPORT command using the fully-qualified syntax only. If you do not, the Hive IMPORT and EXPORT commands will hang. Assuming you are using an Azure Storage account named mystorageaccount and a container named mycontainer for your WASB file system, a proper EXPORT and IMPORT command would look like this:

EXPORT TABLE ManagedUnpartitioned TO 'wasb://mycontainer@mystorageaccount.blob.core.windows.net/export/managedunpartitioned';IMPORT TABLE ManagedUnpartitioned_imported FROM 'wasb://mycontainer@mystorageaccount.blob.core.windows.net/export/managedunpartitioned';

Surprisingly, when performing an IMPORT with the LOCATION clause, the path associated with the LOCATION clause does not need to be fully-qualified:

IMPORT TABLE ManagedUnpartitioned_imported FROM 'wasb://mycontainer@mystorageaccount.blob.core.windows.net/export/managedunpartitioned' LOCATION '/relo/managedunpartitioned_importedrelo';