Apache Sqoop is designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. HDInsight is Hadoop cluster deployed in Microsoft Azure and it includes Sqoop. When transferring small amount of data Sqoop performance is not an issue. However, when transferring huge amount of data it is important to consider the things that can improve the performance to keep the execution time within the desirable limit.
Increase the number of parallel tasks by using an appropriate value for –m parameter
A Sqoop job essentially boils down to bunch of map tasks (there is no reducer). So the performance tuning of any Sqoop job is somewhat same as optimizing a map-reduce job or at least this is where one should start. Therefore, the first thing one should consider to improve the performance of a Sqoop job is to increase the number of parallel tasks. In other words increase the number of mappers to utilize maximum available resources in the cluster. This may require some experimentation given the user's dataset and system in which Sqoop is running. The argument is "-m, --num-mappers". By default –m is set to 4 and therefore if not specified Sqoop will use only four map tasks in parallel. In general you would want to use a higher value of –m to increase the degree of parallelism and hence the performance. However, it is not recommended to increase the degree of parallelism greater than the resources available in the cluster because mappers will run serially and will likely increase the amount of time required to complete the job.
Now the questions arises how do you determine the right value for –m. Actually there is no perfect way to find that magic number but you can try to determine some approximate range based on your cluster size and test to find out which gives you the best results. Hadoop 2.x (HDI 3.x) uses YARN and each Yarn task is assigned a container which has a memory limit. In other words each mapper would require a container to run. So if we can get a rough estimate of the maximum number containers available in your cluster then maybe you can use that number for –m as a starting point assuming there is no other job running in the cluster and you do not want to run the multiple sets of mappers in serial. The number of available containers in a cluster depends on few configuration settings. Based on the HDInsight release notes following are the default settings in HDInsight for the mapper, reducer and AM (Application Master) as of 10/7/2014 release:
mapreduce.map.memory.mb = 768
mapreduce.reduce.memory.mb = 1536
yarn.app.mapreduce.am.resource.mb = 768
Currently each data node of an HDInsight cluster uses a Large Size Azure PaaS VM which has 4 cores and 7 GB RAM. Out of which about 1 GB is used by the node manager (The NodeManager daemon's heap size is set via yarn-env.sh, via its YARN_NODEMANAGER_HEAPSIZE env-var.) So with the remaining 6 GB you can have maximum (6*1024)/768 = 8 containers per worker node for mapper. The reducers are configured to use twice as much memory (1536 MB) as mappers but in a Sqoop job there is no reducer. Let's assume we have a 16 node cluster and there is no other job running. So the total number of available containers or maximum number of parallel map tasks we can have is 8x16=128. So if you do not want to run multiple sets to map talks in serial and there is no other job running in the cluster then maybe we can set –m as 128.
Use smaller fs.azure.block.size to increase the number of mapper further.
However, the value passed for –m parameter is a guide only and the actual number of mappers may be different based on other factors like input file size and number, dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc. If the individual input files are smaller than the block size then we will have one map task for each input file. However, if an input file is bigger than the block size; number of mappers for that input file would be (file size/block size). Therefore if you have resources available in the cluster you can try to increase the number of mappers by setting a smaller value for block size (which is represented by fs.azure.block.size in WASB) and see if that improves the performance and this is the second thing you should consider when tuning the performance of a Sqoop job. For Hive or Map-reduce jobs we can set the fs.azure.block.size property to a different value during running the job. But unfortunately HDInsight PowerShell object New-AzureHDInsightSqoopJobDefinition doesn't include the parameter [-Defines <Hashtable>] which allows Hadoop configuration values to be set during the job execution. However, we can always provision a customize HDInsight cluster and set the fs.azure.block.size to a smaller value when creating the HDInsight cluster if needed. This will change the default in the cluster level and will use that value for all jobs running in the cluster.
Is my cluster too small to handle the data?
After you have tested enough to find optimum values for –m parameter and the fs.azure.block.size and feel that you can't improve the performance of your sqoop job any further then may be it is time to think about increasing the cluster resources or in other words increase the size of your cluster and is the third thing you should consider when performance tuning a sqoop job. You should especially consider this when you are transferring huge amount of data and you need to bring down the execution time significantly. To give you some idea I have recently worked on a case where the customer wanted to export about 75GB data and each input file was ~500 MB. Initially we used a 24 node cluster with –m set to 160 and the export took about ~58 hours to complete. Then we tested with a 36 node cluster setting –m as 300 and it took about ~34 hours to compete. This customer didn't want to try setting a smaller value for fs.azure.block.size as they didn't want to change the default in the cluster level. If you are transferring a reasonably huge amount of data then you should start with a reasonable size cluster even before starting to play with –m or fs.azure.block.size parameters. I hope the example of my customer's data and cluster size gives you some idea in that regard.
Is the database a bottleneck?
Increasing the cluster size or the degree of parallelism will not improve the performance indefinitely. For example if you increase the degree of parallelism higher than that which your database can reasonably support it will not improve the overall performance. Sqoop exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result. This brings us to the fourth thing that you should consideration while tuning Sqoop job performance and that is to check if the database is the bottleneck. If logs captured from the database side show indeed that is the case then you need to figure out if there is a way to scale up the database capabilities. The customer I mentioned earlier was using Azure SQL Database and we actually found that the database was a significant bottleneck for performance. We scaled up his SQL Azure Database by using higher performance levels of Database Throughput Unit (or simply DTU) and as a result we were able to improve the overall performance by 50%. This
Azure SQL Database Service Tiers and Performance Levels MSDN article has more information on different scale up options for Azure SQL Database.
Is the storage a bottleneck?
HDInsight uses Windows Azure Blob Storage, WASB, for storing the data and this Azure document details what are the benefits of using WASB. However, the HDInsight cluster can throttled when the throughput rate of Windows Azure Storage blob exceeds the limits detailed in this blog post. Therefore, when running Sqoop jobs in HDInsight cluster another point of bottleneck could be the WASB throughput and that is the fifth thing you should consider while tuning the performance of a Sqoop job. You can use the Windows Azure Storage Log Analysis Tool detailed in this blog post to determine if that is case and then take appropriate measures to mitigate the same. While importing data in WASB you also want to make sure the data size did not cross the WASB block size limits as described in this MSDN document otherwise you may see error like below.
Caused by: com.microsoft.windowsazure.services.core.storage.StorageException: The request body is too large and exceeds the maximum permissible limit.
Two other scenario specific Sqoop performance tips
Let's briefly discuss two other scenario specific Sqoop performance tips. For Sqoop export you can use --batch argument which uses batch mode for underlying statement execution and thus may improve performance. For example you can set --batch=200 or higher. If the table has too many columns and you use a higher batch value you may end up seeing OOM errors. The second one is specific to when running sqoop jobs from Oozie. Sqoop copies the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job. When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache. Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs. Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.
I am sure there are other ways one can think of optimizing the performance of a Sqoop job in HDInsight. I tried to cover the main ones in this blog post and I hope it either helps you to improve the performance of your Sqoop job or at least serves as a starting point for you.