How to allow Spark to access Microsoft SQL Server


 

Today we will look at configuring Spark to access Microsoft SQL Server through JDBC. On HDInsight the Microsoft SQL Server JDBC jar is already installed. On Linux the path is /usr/hdp/2.2.7.1-10/hive/lib/sqljdbc4.jar. If you need more information or to download the driver you can start here Microsoft
SQL Server JDBC

Spark needs to know the path to the sqljdbc4.jar. There are multiple ways to add the path to Spark's classpath. Spark has two runtime environment properties that can do this spark.driver.extraClassPath and spark.executor.extraClassPath. To review all the properties available, see Spark's Configuration - Spark 1.4.1 Documentation.

If you use spark-shell or spark-submit you can pass these properties with –conf. I like to add the properties to Spark's default configuration file at /etc/spark/conf/spark-defaults.conf. A third option is to include the sqljdbc.jar in your assembly jar. This same technique works for other jars that your Spark application might need. Whichever technique you choose, Spark needs to know where to find the sqljdbc4.jar for both the driver application and the executors.

You can check the environment tab in the Spark Properties section to verify the properties are set.

 

Spark's API is very dynamic and changes are being made with each new release, especially around JDBC. If you are going to use Spark with JDBC I would suggest reviewing Spark's API documentation for the version of Spark you are using Spark 1.4.1 API to make sure the methods are still valid and the same behavior exists. Depending on the release there are a few places to look for methods involving JDBC, which include SQLContext, DataFrame, and JdbcRDD. Also notice that some methods are marked experimental and or deprecated. Make sure you test your code.

Some issue to consider are:

  • Make sure firewall ports are open for port 1433.
  • If using Microsoft Azure SQL Server DB, tables require a primary key. Some of the methods create the table, but Spark's code is not creating the primary key so the table creation fails.

 

Here are some code snippets. A DataFrame is used to create the table t2 and insert data. The SqlContext is used to load the data from the t2 table into a DataFrame. I added the spark.driver.extraClassPath and spark.executor.extraClassPath to my spark-default.conf file.

//Spark 1.4.1

//Insert data from DataFrame

case class Conf(mykey: String, myvalue: String)

val data = sc.parallelize( Seq(Conf("1", "Delaware"), Conf("2", "Virginia"), Conf("3", "Maryland"), Conf("4", "South Carolina") ))

val df = data.toDF()

val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"

val table = "t2"

df.insertIntoJDBC(url, table, true)

//Load from database using SqlContext

val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"

val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver";

val tbl = { sqlContext.load("jdbc", Map( "url" -> url, "driver" -> driver, "dbtable" -> "t2", "partitionColumn" -> "mykey", "lowerBound" -> "0", "upperBound" -> "100", "numPartitions" -> "1" ))}

tbl.show()

 

 

If you run a Microsoft SQL Server profiler trace while running the spark-shell you can see the table being created, data inserted and then data being read.

 

HDInsight and Spark is a great platform to process and analyze your data, but often data resided in a relational database system like Microsoft SQL Server. Allowing Spark to read and write data from Microsoft SQL Server allows you to create a richer pipeline.

 

Hope this helps,

 

Bill

Comments (5)

  1. Ignacio López says:

    Hi there! Great article! I have set a HDInsights linux cluster and used a custom script to load spark 1.5.1. I'm trying to connect to SQL Azure, but even though I followed the instructions I'm still getting an error stating that no suitable driver was found. I only configured the classpath on the head node, because I have no way of ssh-ing into the worker nodes. Any help would be appreciated.

    Thanks!

  2. Bill says:

    Hi Once you SSH to the head node you can then ssh to the workernodes. From the head node try "ssh workernode0". You will be prompted for your ssh password.

    Bill

  3. Veera M says:

    is this faster than BCPing out the data from SQL server, and copying on to HDFS ?

    My usecase, i need to transfer 10Gbs of data from SQL server to HDFS.

    Veera

  4. Avnish Singh says:

    We are getting an issue with SQLJdbc driver with spark sql context, it retrieve the date from db as two days back.

    Example :

    if db table has date like 16-12-2015:XXXX , when we fetch the data using sqljdbc driver then we get the date like 14-12-2015.

    Any help would be appreciated.

  5. Sri says:

    Hi Bill,

    I am trying to append data using below spark code I am hitting table already exists exception is this a bug in spark ? I am using spark 1.5

    val sourcedfmode=sourcedf.write.mode("append")

    sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops)

    There is already an object named 'customer_spark' in the database.

    Thanks

    Sri

Skip to main content