In this bog post we have captured some common questions and answers related to Hive in Azure HDInsight.
How do I check DAG counters for a query?
Go to Tez View. Find the DAG and click on Dag counters tab.
How do I check that amount of data being read/writted/shuffled by a query?
Go to Dag counters and look for HDFS_BYTES_READ/HDFS_BYTES_WRITTEN/SHUFFLE_BYTES.
How do I download all YARN container logs?
yarn logs -applicationId <application_id> > logs
How do I download YARN container logs for a particular container?
yarn logs -applicationId <application_id> -containerId <container_id> > containerlogs
How do I download YARN container logs for all Application Master (AM) container?
yarn logs -applicationId <application_id> -am ALL > amlogs
How do I download YARN container logs for only the latest Application Master (AM) container?
yarn logs -applicationId <application_id> -am -1 > latestamlogs
How do I download YARN container logs for first 2 Application Master (AM) container?
yarn logs -applicationId <application_id> -am 1,2 > first2amlogs
Where is hive client logs?
Where is hive metastore logs?
Where is hiveserver logs?
How to I start hive shell with debug logging enabled on console?
hive -hiveconf hive.root.logger=ALL,console
How do I use tez engine?
How do I download tez DAG data?
hadoop jar /usr/hdp/current/tez-client/tez-history-parser-*.jar org.apache.tez.history.ATSImportTool -downloadDir . -dagId <DagId>
From Ambari tez view:
Go to Ambari --> Go to Tez view (hidden under tiles icon in upper right corner) --> Click on the dag you are interested in --> Click on Download data.
How do I get CrticalPath for a Tez DAG?
hadoop jar /usr/hdp/current/tez-client/tez-job-analyzer-*.jar CriticalPath --saveResults --dagId <DagId> --eventFileName <DagData.zip>
What are the other analyzers available for Tez DAG?
You can list all the available analyzers with hadoop jar /usr/hdp/current/tez-client/tez-job-analyzer-*.jar. Currently it has the following analyzers available:
- ContainerReuseAnalyzer: Print container reuse details in a DAG
- CriticalPath: Find the critical path of a DAG
- LocalityAnalyzer: Print locality details in a DAG
- ShuffleTimeAnalyzer: Analyze the shuffle time details in a DAG
- SkewAnalyzer: Analyze the skew details in a DAG
- SlowNodeAnalyzer: Print node details in a DAG
- SlowTaskIdentifier: Print slow task details in a DAG
- SlowestVertexAnalyzer: Print slowest vertex details in a DAG
- SpillAnalyzer: Print spill details in a DAG
- TaskAssignmentAnalyzer: Print task-to-node assignment details of a DAG
- TaskConcurrencyAnalyzer: Print the task concurrency details in a DAG
- VertexLevelCriticalPathAnalyzer: Find critical path at vertex level in a DAG
How do I kill an application?
From the command line:
yarn application -kill <application_id>
From the Yarn UI:
Go to Ambari --> Yarn UI --> click on application that you want to kill --> Click on "kill application" towards top left.
This will kill any query that is being run as part of the Tez session also.
How do I check currently running application on the cluster?
How to I export hive metastore and import it on new cluster?
Run the following where you want to export the metastore:
for d in `hive -e "show databases"`; do echo "create database $d; use $d;" >> alltables.sql ; for t in `hive --database $d -e "show tables"` ; do ddl=`hive --database $d -e "show create table $t"`; echo "$ddl ;" >> alltables.sql ; echo "$ddl" | grep -q "PARTITIONED\s*BY" && echo "MSCK REPAIR TABLE $t ;" >> alltables.sql ; done; done
This will generate a allatables.sql file. Copy this file to new cluster and run the following on new cluster:
hive -f alltables.sql
This assumes that data path on new cluster are same as on old. If not, you can manually edit the generated alltables.sql file to reflect any changes.
How do I specify the database while starting hive?
hive -database <databaseName>
How do I specify a config or variable while starting hive?
hive -hiveconf a=b
How do I list all columns and their types for a table?
Within hive/beeline shell run desc <tablename>
How do I list all properties of a table?
Within hive/beeline shell run desc extended <tablename>. The output includes all the column names, their types, data format, compression properties, table statistics, columns statistics etc.
How do I disable mapjoin?
How do I decide which joins are converted to mapjoins?
set hive.auto.convert.join.noconditionaltask.size to value of the largest join you want to convert to mapjoin. For example, set hive.auto.convert.join.noconditionaltask.size=1000000 will convert all joins to mapjoins where hive "estimates" the sum of smallest n-1 tables in n-join is less then 1MB.
Does hive support Transactions?
Yes. Look here
How do I change the container size for hive on tez?
Set hive.tez.container.size to the desired value in MB. While changing the container size you should make sure that values of hive.tez.java.opts (0.8*hive.tez.container.size) and tez.runtime.io.sort.mb (0.4*hive.tez.container.size) are adjusted accordingly.
How do I turn off vectorization in hive?
set hive.vectorized.execution.enabled to false. Hive vectorization has been source of many many errors and usually the error trace when the query fails, has the vectorization operator name.
My cluster disk space is filled what do I do?
Go to Ambari -> HDFS. Check the values of Disk Usage (DFS Used) and Disk Usage (Non DFS Used) and decide which one is taking most of the space on the cluster. If most data is used for,
- DFS : You have lot of data on HDFS, which is consuming disk space. The solution is to delete the data from the HDFS that you no longer require. FOr example, you can use hdfs dfs -rm -r -skipTrash hdfs://mycluster/tmp to remove the /tmp HDFS directory.
- Non DFS : You have lot of intermediate job data that is consuming the disk space. You have to kill the application(s) responsible for writing lots of data. You can check the values of FILE_BYTES_WRITTEN counter to infer which is the bad application(s). Once you have killed the bad application(s), the cluster should become healthy in few minutes. If you re-run the hive query without fixing it to not to produce large intermediate data, the cluster will again end up in the same situation. One of the very common example of queries which result into a cross join. If you do explain of the query, the cross join would be flagged as [WARNING].
In both cases, using a larger cluster or cluster with more disk spaces on nodes can help mitigate the problem for short time.
How do I use connect automatically when starting beeline?
beeline -u "jdbc:hive2://localhost:10001/;transportMode=http" -n <usernmae> -p <password>
How do I connect to a different database(mydb) when starting beeline?
beeline -u "jdbc:hive2://localhost:10001/mydb;transportMode=http" -n <usernmae> -p <password>
How do I set hive scrath directory?
How do I configure the trash interval for HDFS?
Trash interval decides how long the data deleted from HDFS is kept around. You can configure it with fs.trash.interval. The deleted data is kept under .Trash/Current directory on HDFS and can be restored from there.
How do I use Hplsql?
How do I know version of various components?
How do I check the version of hadoop?
hadoop version should return detailed version information as follows:
Subversion firstname.lastname@example.org:hortonworks/hadoop.git -r f3d31dd3dc216fff8acd4da01075656131947e1d
Compiled by jenkins on 2016-11-07T18:55Z
Compiled with protoc 2.5.0
From source with checksum a6c8c64bfe734b01a46c7c480ebc7ff
This command was run using /usr/hdp/22.214.171.124-1/hadoop/hadoop-common-126.96.36.199.5.2.1-1.jar
How do I check if native compression codecs are properly setup?
hadoop checknative -a should list output similar to
Native library checking:
hadoop: true /usr/hdp/188.8.131.52-1/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib/x86_64-linux-gnu/libz.so.1
snappy: true /usr/lib/x86_64-linux-gnu/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib/x86_64-linux-gnu/libbz2.so.1
openssl: true /usr/lib/x86_64-linux-gnu/libcrypto.so
How do I dump metadata of an ORC file?
hive --orcfiledump <path-to-ORC-file>
How do I dump content of an ORC file?
hive --orcfiledump -d <path-to-ORC-file>
How do I specify the ORC compression?
orc.compress. Possible values are NONE, ZLIB and SNAPPY. ZLIB is a good default choice.
How do I specify ORC strip size?
How do I specify ORC encoding strategy for integers?
hive.exec.orc.encoding.strategy. Possible values are SPEED and COMPRESSION.
Where can I find more information about Hive ORC configurations?
How can I list all the effective configurations?
How do I get the query plan?
run explain followed by the query string.
How do I get machine readble query plan?
run explain formatted followed by the query string.
How do I get very detailed query plans?
run explain extended followed by the query string. This includes things like what partitions were pruned.
How do I know if the a perticular join in the query is a map join or a merge join?
The query plan will have Map Join Operator and Merge Join Operarator to indicate that.
How do I specify ORC split strategy?
set hive.exec.orc.split.strategy. Possible values are BI, ETL and HYBIRD.
Where are all hive configurations listed?
Credit Dharmesh Kakadia for helping capture commonly asked questions