In my previous post, we discussed the map, array and struct data types and their implementation in Hive. Continuing on the Hive theme, this post will introduce partitioning and bucketing as method for segmenting large data sets to improve query performance.
If you have previous experience working in the relational database world then inevitably the concept of partitions and partitioning is not new. Partitions are fundamentally horizontal slices of data which allow larges sets of data to be segmented into more manageable chunks.
Partitioning in this manner takes many different forms with boundaries defined on either a single or range of values for the one or more columns that act as a splitter. This is commonly seen in a data warehouse environments using dates (such as transaction or order date) and occasionally geography to partition large fact tables.
In SQL Server, this partitioning support (for single columns) is built in through the use of partitioning schemas and functions. For Hive, partitioning is also built into for both managed and external tables through the table definition as seen below.
CREATE TABLE customer ( id INT, name STRING, address1 STRING, address2 STRING, city STRING, state STRING, zip STRING ) PARTITIONED BY ( region STRING, country STRING );
A couple of interesting things become apparent as you digest the sample above:
- Multiple 'slicer' or partitioning columns are supported (i.e. Region/Country)
- The partitioning columns do not need to be included in the table definition as the function both as the slicer and column. You will still be able to include them in the query projection.
Behind the scenes, the partitioning statement alters the way managed tables are stored. If you browse HDFS or whatever underlying file system you are using the normal file structure for a given database and table looks like <DB>.db/<TABLE NAME> or in the case of our example ERP.db/Customer. Data files are then written directly to the Customer directory.
When you define the table with partitions, the underlying structure is changed such that sub-directories are created for each slicer or partitioning column. If multiple slicers are specified, the sub-directories will be nested based on the order of of the columns in the definition statement. To provide a better visual picture, lets circle back to our Customer table with Region and Country partitions and see how the directories now enumerate:
/erp.db/customer/region=North America/country=US /erp.db/customer/region=North America/country=CA /erp.db/customer/region=South America/country=BR
Now when a Hive query is issued that contains either one or both of our partition slicers in the criteria or predicate (the WHERE clause) what effectively happens is partition elimination where only those data directories that are needed are scanned. In the event that the slicers are not used all directories will be scanned (i.e. a table scan) and the partitioning has no effect.
A few final important notes on partitioning:
- Be careful not to over partition your data within Hive. If your partitions are relatively small then the expense of recursing directories becomes more expensive than simply scanning the data. Likewise, partitions should be roughly similar in size to prevent a single long running thread from holding things up (see bucketing below).
- If your Hive table is extremely large and table scans (i.e. queries without partition slicers) need to be blocked you can put Hive into strict mode using the command below. In this mode, when a querying that would result in a table scan is issued an error is thrown back to the user.
- Partitioning works both managed and external tables.
- When loading partitioned data, you must explicitly or dynamically (dynamic partitions must be enabled, disabled by default) define the targeted partition.
Although you the term Bucketing may not be familiar to you, you are already familiar with the concept behind it. Much like partitioning, bucketing is a technique that allows you to cluster or segment large sets of data to optimize query performance. The syntax to create a table with bucketing is listed below:
CREATE TABLE order ( username STRING, orderdate STRING, amount DOUBLE, tax DOUBLE, ) PARTITIONED BY (company STRING) CLUSTERED BY (username) INTO 25 BUCKETS;
Now that you've see it let's talk about how it works. Unlike partitioning where each value for the slicer or partition key gets its own space, in clustering a hash is taken for the field value and then distributed across buckets. In the example above, we created 25 buckets and are clustering on username. Each bucket then would contain multiple users, while restricting each username to a single bucket.
There's a danger with this technique though. Bucketing is entirely dependent on data correctly being loaded to the table. To properly load data to a table utilizing this technique you need to either:
- Set the maximum number of reducers to the same number of buckets specified in the table metadata (i.e. 25)
set map.reduce.tasks = 25
- Use the following command to enforce bucketing:
set hive.enforce.bucketing = true
In this post, we introduced the horizontal partitioning capabilities within Hive. In addition to the partitioning capabilities you are familiar with, Hive also supports the ability to bucket data to more evenly partition data when a more natural slicer or key is not available. In the next post we will introduce Hive Indexes and Views.
Till next time!