Loading data in HBase Tables on HDInsight using bult-in ImportTsv utility


Apache HBase can give random access to very large tables– billions of rows X millions of columns. But the question is how do you upload that kind of data in the Hbase tables in the first place? HBase includes several methods of loading data into tables. The most straightforward method is to either use the TableOutputFormat class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods.

Overview

HBase ships with build-in ImportTsv utility and in many cases it will be much faster and easier to upload data in HBase using ImportTsv utility compared to other methods. As the name suggests using ImportTsv tool you can upload data in TSV format into HBase. In a TSV file each field value of a record is separated from the next by a tab stop character. However, the tool has an option importtsv.separator which allows you to specify a separator if the filed are separated on a different separator instead of tab – for example pipes or comma. ImportTsv has two distinct usages.

  1. Loading data from TSV format in HDFS into HBase via Puts ((i.e., non-bulk loading)
  2. Preparing StoreFiles to be loaded via the completebulkload(Bulk Loading)

If you don’t have huge amount of data may be you can directly upload to HBase via Puts (#1). Using Bulk Loading (#2) on the other hand will come in handy when you have huge amount of data to upload. Bulk loading will be faster as it uses less CPU and network resources than simply using the HBase API. However, keep in mind bulk loading bypasses the write path, the Write Ahead Log (WAL) doesn’t get written to as part of the process and it can cause some issue for some other process, for example, replication. To find out more about HBase bulk loading please review the Bulk Loading page in Apache HBase reference guide. The HBase bulk load process consists of two main steps.

  1. The first step of a bulk load is to generate HBase data files (StoreFiles) from a MapReduce job using HFileOutputFormat.
  2. After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using completebulkload.

 

Examples

HBase in HDInsight (Hadoop in Microsoft Azure) is the same in its core as HBase in any other environment. However, someone not familiar with Microsoft Azure environment may get stuck by some minor differences when interacting with the HBase cluster in HDInsight. This is why the examples provided in this blog are specific to HBase cluster in HDInsight and I hope it will make your experience with Hbase cluster in HDInsight smoother. We will provide detail steps for both the usage scenarios of ImportTsv utility.

Prerequisites

Before uploading the data to HBase we need to move the data to Windows Azure Storage Blob (WASB) first and we also need to create an empty HBase table to upload the data. So let’s do the following steps to get ready to upload the data in HBase using the ImportTsv utility.

  1. For this blog we will use the sample data.tsv as shown below where each filed in a row is separated by a Tab.

    row1    c1    c2

    row2    c1    c2

    row3    c1    c2

    row4    c1    c2

    row5    c1    c2

    row6    c1    c2

    row7    c1    c2

    row8    c1    c2

    row9    c1    c2

    row10    c1    c2

    row11    c1    c2

    row12    c1    c2

  2. Follow any of the methods/tools described in Upload data for Hadoop jobs in HDInsight Azure document to upload data.tsv file to WASB. For example I used the PowerShell script sample provided in the above link to upload the data.tsv file at example/data/data.tsv and used the Azure Storage Explorer tool to verify that the file is uploaded in the right location.
  3. Now we need to create the table from HBase shell. We will call the table ‘t1‘ and our row key will be the first column. We will have the two remaining columns in a column family called ‘cf‘.

    If you are preparing a lot of data for bulk loading, you need to make sure the target HBase table is pre-split appropriately. The best practice when creating a table is to split it according to the row key distribution. If your rowkeys start with a letter or number, you can split your table at letter or number boundaries. In our sample data.tsv file we only have 12 rows but we will use three splits just to show how it works.

    To open HBase shell we need to RDP to the head node; open Hadoop command line; navigate to %hbase_home%\bin and then type the following.

    C:\apps\dist\hbase-0.98.0.2.1.6.0-2103-hadoop2\bin>hbase shell

    Then run the following from Hbase shell to create the table with 3 splits.

    hbase(main):008:0> create ‘t1’, {NAME => ‘cf1’}, {SPLITS => [‘row5’, ‘row9’]}

  4. Now let’s browse to the link below Hbase dashboard from the headnode to check the table we just created.

    http://zookeeper2.MyHbaseCluster.d3.internal.cloudapp.net:60010/master-status

    In the dashboard go to the Table Details tab and you will see the list of all tables and the one we just created ‘t1‘. Names of all the tables are hyper linked. Click ‘t1‘ and should be able to view the three regions and other details as shown in the screenshot below.

Usage 1: Upload the data from TSV format in HDFS into HBase via Puts ((i.e., non-bulk loading)

Open a new Hadoop command like and type ‘cd %hbase_home%\bin’ to navigate to the HBase home and then run the following to upload the data from the tsv file data.tsv in HDFS to Hbase table t1.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=”HBASE_ROW_KEY,cf1:c1,cf1:c2″ t1 /example/data/data.tsv

Note: If the fields in the file were separated by a comma instead of Tab and the corresponding file name were data.csv then we would have used the following to upload the data to the Hbase table ‘t1’ where the separator comma (“,” ) is specified using the option importtsv.separator.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=”HBASE_ROW_KEY,cf1:c1,cf1:c2″ -Dimporttsv.separator=”,” t1 /example/data/data.csv

To verify that the data is uploaded open HBase shell again and run the following.

scan ‘t1’

You should see the rows as below.

Usage 2: Preparing StoreFiles to be loaded via the completebulkload (bulk Loading).

We will use the same table ‘t1’ to bulk load the data from the same input file. So let’s disable, drop and recreate table ‘t1‘ from HBase shell as shown in the screen shot below. Our input data file data.tsv will remain in the same location in WASB.

Now that table ‘t1‘ is recreated let’s follow the steps to prepare StoreFiles and then load them to the Hbase table via the completebulkload tool.

  1. Run the following to transform the data file to StoreFiles and store at a relative path specified by Dimporttsv.bulk.output.

    hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=”HBASE_ROW_KEY,cf1:c1,cf1:c2″ -Dimporttsv.bulk.output=”/example/data/storeDataFileOutput” t1 /example/data/data.tsv

    You should see the output as below in WASB (this screen shot is taken using Azure Storage Explorer). Notice there are three files under “example/data/storeDataFileOutput/cf1/”, one per region.

Note: If the fields in the file were separated by a comma instead of Tab and the corresponding file name were data.csv then we would have used the following.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=”HBASE_ROW_KEY,cf1:c1,cf1:c2″ -Dimporttsv.separator=”,” -Dimporttsv.bulk.output=”/example/data/storeDataFileOutput” t1 /example/data/data.csv

  1. Now we need to use the completebulkload tool to complete the bulk upload. Run the following to upload the data from the HFiles located at /example/data/storeDataFileOutput to the HBase table t1.

    hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
    /example/data/storeDataFileOutput t1

Again to verify that the data is uploaded open HBase shell and run the following.

scan ‘t1’

You can also use Hive and Pig to upload data in HBase tables on HDInsight. I intend to blog on those in future. This is it for today and I hope it was helpful.

Comments (5)

  1. Sridevi says:

    Hi

    Using Importtsv, could not load csv files. mapreduce completes successfully but nothing is inserted in hbase table

  2. AB says:

    How and using which tool we can schedule this upload?

  3. Vikas Jindal says:

    I am getting following error "uninitialized constant Importtsv".

  4. Ankit Beohar says:

    I want load azure BLOB files directly into Hbase can anyone has a solution

  5. Mory says:

    Bonjour,

    J'ai essayé la dernière méthode ça marche très bien

    Merci beaucoup pour ce travail

Skip to main content