导出HDInsight数据到MySQL(Azure)

在大数据计算中,经常会有一个场景:在Hadoop分布式存储DFS和关系型数据库(MySQL、Oracle、SQL Server)间导入(导出)数据。由于Azure HDInsight是基于公有云的托管式Hadoop,用户在执行数据转移时,与云上Hadoop相连的关系型数据库往往不是本地的数据库引擎,取而代之的是公有云提供的关系型数据库服务,如SQL Database和MySQL on Azure。针对如何使用HDInsight中的工具Sqoop导入(导出)数据到SQL Database,Azure官方文档给出了参考教程,本文具体实践如何使用Sqoop(HDInsight)导出数据到MySQL on Azure的场景。

关于HDInsight:https://azure.microsoft.com/en-us/documentation/articles/hdinsight-high-availability/,本实践中将会提交任务到HDInsight集群,集群中的Head Node分配任务到Worker Node,Worker Node具体实施数据的读取、处理和转移。


 

关于MySQL on Azure:https://www.windowsazure.cn/home/features/mysql/

关于Sqoop: https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html

关于HDInsight和Azure Storage:https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/,HDInsight中的数据以冗余的形式存储在Azure Storage中,HDInsight节点可以通过分布式存储路径(WASB)来读写目标数据。同时,用户也可以直接访问Azure Storage对应容器中的HDInsight数据文件。

 

注:本实践通过Azure PowerShell来实现,主要包括:

  1. 创建MySQL数据表,用于接收Sqoop导出的数据。
  2. 上传数据样本文件到HDInsight的存储(Azure Storage)
  3. 提交数据导出任务(Sqoop Job)到HDInsight集群
  4. 跟踪Sqoop任务执行情况并检查输出结果。

 

第一步:准备MySQL数据表

#MySQL variables = "mysqlservices-sha.chinacloudapp.cn"

$MySQLLogin = "login1@jianwmysql"  #注意格式: 登录用户@MySQL服务器名

$MySQLPassword = "password"

$MySQLExportedDBName = "testdb01"

$tableName = "jtable3"

 

$confirmation= Read-Host "To create a new MySQL table, input 'y'; use existing MySQL table, input anything else ..."

if ($confirmation -eq 'y') {

    $MySQLAdminUserName =$MySQLLogin;

    $MySQLAdminPassword =$MySQLPassword;

    $MySQLDatabase =$MySQLExportedDBName;

    $MySQLHost =$MySQLServerName;

    $ConnectionString ="server=" +$MySQLHost +";port=3306;uid=" + $MySQLAdminUserName+ ";pwd="+ $MySQLAdminPassword + ";database="+$MySQLDatabase

    $Query = "CREATE TABLE " + $tableName + "(id VARCHAR(40),name VARCHAR(40))"

 

    Try {

      [void][System.Reflection.Assembly]::LoadWithPartialName("MySql.Data")

      $Connection = New-Object MySql.Data.MySqlClient.MySqlConnection

      $Connection.ConnectionString = $ConnectionString

      $Connection.Open()

 

      $Command = New-Object MySql.Data.MySqlClient.MySqlCommand($Query, $Connection)

      $DataAdapter = New-Object MySql.Data.MySqlClient.MySqlDataAdapter($Command)

      $DataSet = New-Object System.Data.DataSet

      $RecordCount = $dataAdapter.Fill($dataSet, "data")

      $DataSet.Tables[0]

 

      $Query = "select count(*) from " + $tableName

      $Command = New-Object MySql.Data.MySqlClient.MySqlCommand($Query, $Connection)

      $DataAdapter = New-Object MySql.Data.MySqlClient.MySqlDataAdapter($Command)

      $DataSet = New-Object System.Data.DataSet

      $RecordCount = $dataAdapter.Fill($dataSet, "data")

      $DataSet.Tables[0].Rows[0].ItemArray[0].ToString();

      [System.Console]::WriteLine("MySQL table was created successfully.");

      }

    Catch {

      Write-Host "ERROR : Unable to run query : $query `n$Error[0]"

     }

    Finally {

      $Connection.Close()

      }

  }

 

第二步:上传样本数据

Add-AzureAccount -Environment azurechinacloud

# Define the cluter variables

$clusterName = "hadoopjianw3"

$storageAccountName = "jianwstorage"

$containerName = "hadoopjianw3"

 

#Upload a file into Azure storage (DFS of Hadoop)

$StorageAccountKey = Get-AzureStorageKey -StorageAccountName $storageAccountName

$Ctx = New-AzureStorageContext
$StorageAccountName -StorageAccountKey $StorageAccountKey.Primary

Get-ChildItem -Path d:\log2copy.txt | Set-AzureStorageBlobContent -Container $containerName -Blob tutorials2/log2copy.txt -Context $Ctx

此处用到的样本文件log2copy.txt,含有以下样本数据(分隔符为tab跳格),该样本文件可见附件。

mytest  123

world    456

 

第三步:提交Sqoop任务到HDInsight

[System.Console]::WriteLine("Start Sqoop-MySQL test ...");

 #Add-AzureAccount

Add-AzureAccount -Environment azurechinacloud

 

# Define the cluter variables

$clusterName = "hadoopjianw3"

$storageAccountName = "jianwstorage"

$containerName = "hadoopjianw3"

 

# Connection string for Azure MySQL Database.

$conStr = "jdbc:mysql://$MySQLServerName/$MySQLExportedDBName"

#Hive file location

$exportDir = "wasb://$containerName@$storageAccountName.blob.core.chinacloudapi.cn/tutorials2"

$sqoopDef = New-AzureHDInsightSqoopJobDefinition -Command "export --connect $conStr --username $MySQLLogin --password $MySQLPassword --table $tableName --export-dir $exportDir --fields-terminated-by \t -m 1"

 

$sqoopJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $sqoopDef -Debug -Verbose

 

Wait-AzureHDInsightJob -WaitTimeoutInSeconds 3600 -Job $sqoopJob

Write-Host "Standard Error" -BackgroundColor Green

Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $sqoopJob.JobId -StandardError

Write-Host "Standard Output" -BackgroundColor Green

Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $sqoopJob.JobId -StandardOutput  

 

第四步:跟踪Sqoop任务并检查输出

查看HDInsight的控制台主页

 

查看Sqoop Job的执行结果:

 

查看MySQL中的结果数据

 

 

调试与问题解决:

如果建立的HDInsight集群采用的是较低版本的Hadoop,Sqoop任务执行过程中可能抛出以下错误,该错误的原因是在HDInsight集群中找不到对应的MySQL Driver,此类问题的手动解决方法是:用户手动copy相应的MySQL Driver(mysql-connector-java-5.1.36-bin.jar)到HDInsight的C:\apps\dist\sqoop-<version>\lib目录下。MySQL Driver可以在https://www.mysql.com/products/connector/下载。

15/09/21 03:35:12 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5.2.2.7.1-0004

15/09/21 03:35:12 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

15/09/21 03:35:13 INFO manager.SqlManager: Using default fetchSize of 1000

15/09/21 03:35:13 INFO tool.CodeGenTool: Beginning code generation

15/09/21 03:35:13 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

      at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:848)

      at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)

      at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:736)

      at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:759)

      at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:269)

      at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:240)

      at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:226)

      at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:295)

      at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1773)

      at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1578)

      at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)

      at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)

      at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)

      at org.apache.sqoop.Sqoop.run(Sqoop.java:143)

      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

      at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)

      at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)

      at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)

      at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

[DBG]: PS C:\WINDOWS\system32>> 

此问题也可以通过HDInsight中的action script来解决,使用脚本的方法安装MySQL Driver到目标目录。

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-script-actions/

 

本实践的完整PowerShell脚本如附件,Azure用户修改其中的(MySQL、Storage和HDInsight)参数既可应用。

 

参考文档:

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-sqoop/

https://azure.microsoft.com/en-us/documentation/articles/storage-powershell-guide-full/

https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/

sqoop-mysql-summary-3.2.ps1.txt