Our customers often use compression technologies like ORC and Snappy that can compress data and offer high performance. The expectation is that since the data is compressed, the job should run faster. However, more often than not, the job still takes a long time to run.
The main cause of this is that Hive often only spins up one reducer to run the job. An example is if the original input data is 100GB, and using ORC with Snappy compression brings it to 2GB. It turns out that Hive will often estimate the # of reducers in this case to be one - causing a bottleneck to performance.
Hive estimates # of reducers as:
# of reducers = (# of bytes input to mappers) / hive.exec.reducers.bytes.per.reducer
Decreasing the value of
hive.exec.reducers.bytes.per.reducer should increase # of reducers.
Having too many reducers can also be detrimental. Here is some guidance on how many reducers you should choose.