Getting familiar with Apache HBase.


This post is a first step to understanding Apache HBase. Consider this a knowledge nugget chock full O' goodness! Specifically, I am using HDInsight which is the Microsoft Azure Hadoop distribution. My prior posts can be found here.

 

What is HBase and why would you use it? 

HBase is one example of a "NoSQL" (aka "Not Only SQL") database. NoSQL refers to a variety of non-relational DBs. I spent about an hour reading several definitions of the "Not Only SQL" part and they don’t really converge, search for yourself. In addition, Martin Fowler's definition of NoSQL is the most indepth explanation I've seen.

Rather than continue to spend time on a phrase lets see what HBase can do.

HBase:

  • can handle massive amounts of structured or messy, unstructured data that relational DB's would struggle with. Massive amounts of data includes billions of rows and more than a million columns.
  • functions on-top of Hadoop Distributed File System (HDFS)
  • allows data formats to vary from row to row
  • supports very fast updates and inserts to smaller data sets without "locking" tables like a relational DB would and provides nearly instant access to that inserted/updated data.
  • can do much more than I've summarized here

 

Generally, there are 4 types of NoSQL DBs: 

  • key/value
  • structured documents (e.g.: HTML, JSON, XML)
  • graph (think of nodes with relationships and directions between those nodes)  
  • columnar (data is stored in columns rather than rows).

 

To try HBase you can provision a cluster from the Microsoft Azure portal. As I mentioned in an earlier post the Azure emulator doesn’t come with HBase.

After signing-in to the portal above choose Data Services -> HDInsight -> HBase. For my purposes I simply wanted to become familiar with HBase so I created a single node cluster.

 

Note, the info below has been simplified to be a quick tutorial. I would NOT follow the same steps below in a production environment. Also, my current HBase knowledge doesn’t include other means to load data into HBase. Briefly, I've read that you can also load data into HBase with Hive, Pig and tools such as CompleteBulkLoad or ImportTSV. There's also the option of writing code to load data into HBase. I'll keep this as an action item for a future post.

 

After your HBase cluster is running, enable remote access, log into your cluster and launch the Hadoop command line enter

cd %HBASE_HOME%\bin

 

Which took me to \apps\dist\hbase-0.98.0.2.1.6.0-2103-hadoop2\bin>

 

Next, I ran

hbase shell

 

You should be at a prompt like this

hbase(main):001:0>

 

Tutorial

Create a table and add data

Hypothetically speaking, let's say that you want to create an HBase table to track the myriad of employees working at your imaginary company. To create an Hbase table the syntax is:

 create '<table name>', '<column family 1>', '<column family 2>', ...

 

Enter the following at the command prompt to create a sample table with 2 column families "name" and "address"

create 'employee', 'name', 'address'

 

To see whether the table has been created use this syntax:

 describe '<table name>'

 

In this case enter the following at the command prompt:

describe 'employee'

 

Or use the following to get a list of all tables:

 List

You should see "employee" as a table.

 

To manually add info to the table I ran the statements below at the HBase command prompt. You should be able to copy and paste all put statements below and run it at once. You can delete the empty lines and leading spaces since I added these for readability.

 

put 'employee', 'id1', 'name:first', 'Bob'

put 'employee', 'id1', 'name:last', 'Jones'

put 'employee', 'id1', 'address:street', '123 Main Street'

put 'employee', 'id1', 'address:city', 'Seattle'

put 'employee', 'id1', 'address:state', 'WA'

put 'employee', 'id1', 'address:country', 'USA'

put 'employee', 'id1', 'address:zip', '98102'

put 'employee', 'id2', 'name:first', 'Jane'

put 'employee', 'id2', 'name:last', 'Smith'

put 'employee', 'id2', 'address:street', '456 4th Ave'

put 'employee', 'id2', 'address:city', 'Renton'

put 'employee', 'id2', 'address:state', 'WA'

put 'employee', 'id2', 'address:country', 'USA'

put 'employee', 'id2', 'address:zip', '98104'

Above, 'name' is column family 1 and  'address' belongs to column family 2. Personally, I find it interesting that you can cram a bunch of fields into a single column family. Also, a column family can have a varying number of columns.

So far the data above is in alignment. What about the "messy" data that HBase can handle well but a relational DB can't? I spent about an hour looking but couldn't find a great, messy example. If you happen to find an example of messy data that HBase handles well I'd be interested in seeing it.

 

Sample queries

To dump the entire table to the screen the syntax is:

scan '<table name>'

 

Enter the following at the command prompt:

scan 'employee'

 

To get a specific row the syntax is:

get '<table name>' ,'<row#>'

 

For this example, enter the following at the command prompt: 

get 'employee', 'id1'

 

To get a count of the rows in a table the syntax is:

count '<table name>'

 

Enter the following at the command prompt:

count 'employee'

 

To query the table for a specific value, like any employee named "Bob" I ran the following (it seems to be case sensitive...)

scan 'employee', {FILTER => "ValueFilter(=, 'binary:Bob')"}

 

Or any employee with a Washington state address

scan 'employee', {FILTER => "ValueFilter(=, 'binary:WA')"}

 

Although, this only returns a specific entry rather than the whole row. The HBase filters are something that I need to become more familiar with since they are unlike anything I've used before.

 

Delete

To delete specific cell values, like "Bob" and "Jones", enter:

delete 'employee', 'id1', 'name:last'

delete 'employee', 'id1', 'name:first'

 

To verify these results run a 'scan' on the employee table like:

scan 'employee'

...and you should see the following, although the timestamp will be different for obvious reasons:

 

ROW                COLUMN+CELL

 id1                  column=address:city, timestamp=1420868942823, value=Seattle

 id1                  column=address:country, timestamp=1420868942901, value=USA

 id1                  column=address:state, timestamp=1420868942870, value=WA

 id1                  column=address:street, timestamp=1420868942776, value=123 Main Street

 id1                  column=address:zip, timestamp=1420868942948, value=98102

[Note, above there is no column=name:first or column=name:last but there is one below for the next employee.]

  id2                  column=address:city, timestamp=1420868943104, value=Renton

  id2                  column=address:country, timestamp=1420868943167, value=USA

  id2                  column=address:state, timestamp=1420868943136, value=WA

  id2                  column=address:street, timestamp=1420868943058, value=456 4th Ave

  id2                  column=address:zip, timestamp=1420868944739, value=98104

  id2                  column=name:first, timestamp=1420868942995, value=Jane

  id2                  column=name:last, timestamp=1420868943026, value=Smith

 

Alternatively, enter the command below to show everything associated with id1. You'll notice that column=name:first or column=name:last doesn’t exist:

get 'employee', 'id1'

 

To prepare this post I forced myself to find how to delete an entire row (which is one reason why I'm blogging…). Since we've munged the data associated with id1 lets delete that entire row. The syntax is:

deleteall '<table name>', '<row>'   (There are some optional params that I won't mention here.)

 

Enter the following at the command prompt:

deleteall 'employee', 'id1'

 

Run a 'scan' on our employee table and see that the data associated with id1 no longer exists. The scan results only show data for id2 at this point.

 

To delete all data in a table the syntax is:

Truncate '<table name>'

 

Enter the following at the command prompt:

truncate 'employee'

 

To delete an Hbase table first drop it then delete it - the syntax is:

disable '<table name>'

drop '<table name>'

 

In this example enter the following at the command prompt:

disable 'employee'

drop 'employee'

 

Lastly, delete the HBase test cluster we created above.

 

Maybe I'm a geek but I enjoyed researching HBase and putting together the small tutorial above. I hope this information was valuable to you.

 

Comments (0)

Skip to main content