Understanding Spark’s SparkConf, SparkContext, SQLContext and HiveContext


 

The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark’s cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration parameters that your Spark driver application will pass to SparkContext. Some of these parameters define properties of your Spark driver application and some are used by Spark to allocate resources on the cluster. Such as, the number, memory size and cores uses by the executors running on the workernodes. setAppName() gives your Spark driver application a name so you can identify it in the Spark or Yarn UI. You can review the documentation for Spark 1.3.1 for SparkConf to get a complete list of parameters. SparkConf documentation

 

import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName(“MySparkDriverApp”).setMaster(“spark://master:7077”).set(“spark.executor.memory”, “2g”)

 

Now that we have a SparkConf we can pass it into SparkContext so our driver application knows how to access the cluster.

 

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

val conf = new SparkConf().setAppName(“MySparkDriverApp”).setMaster(“spark://master:7077”).set(“spark.executor.memory”, “2g”)

val sc = new SparkContext(conf)

 

You can review the documentation for Spark 1.3.1 for SparkContext to get a complete list of parameters. SparkContext documentation

Now that your Spark driver application has a SparkContext it knows what resource manager to use and can ask it for resources on the cluster. If you are using YARN, Hadoop’s resourcemanager (headnode) and nodemanager (workernode) will work to allocate a container for the executors. If the resources are available on the cluster the executors will allocate memory and cores based your configuration parameters. If you are using Sparks cluster manager, the SparkMaster (headnode) and SparkSlave (workernode) will be used to allocate the executors. Below is a diagram showing the relationship between the driver applications, the cluster resource manager and executors.

 

 

 

Each Spark driver application has its own executors on the cluster which remain running as long as the Spark driver application has a SparkContext. The executors run user code, run computations and can cache data for your application. The SparkContext will create a job that is broken into stages. The stages are broken into tasks which are scheduled by the SparkContext on an executor.

One of Sparks’s modules is SparkSQL. SparkSQL can be used to process structured data, so with SparkSQL your data must have a defined schema. In Spark 1.3.1, SparkSQL implements dataframes and a SQL query engine. SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. Hortonworks and the Spark community suggest using the HiveContext. You can see below that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands. The same behavior occurs for pyspark. You can review the Spark 1.3.1 documentation for SQLContext and HiveContext at SQLContext documentation and HiveContext documentation

 

 

HDInsight provides multiple end user experiences to interact with your cluster. It has a Spark Job Server HDInsight spark job server that allows you to remotely copy and submit your jar to the cluster. This submission will run your driver application. In your jar you have to implement SparkConf, SparkContext and HiveContext. HDInsight also provides a notebook experience with Jupyter and Zeppelin. For the Zeppelin notebook it automatically creates the SparkContext and HiveContext for you. For Jupyter you must create them yourself. You can read more about Spark on HDInsight at HDInsight Spark Overview and HDInsight Spark Resource Management

 

I hope understanding the relationship between SparkConf, SparkContext, SQLContext, and HiveContext and how a Spark driver application uses them will makes your Spark on HDInsight project a successful one!

 

 

Bill

 

 

 

Comments (1)

  1. Ilya Geller says:

    SQL, Spark – they are obsolete.

    For instance, there are two sentences:

    a) ‘Pickwick!’

    b) 'That, with the view just mentioned, this Association has taken into its serious consideration a proposal, emanating from the aforesaid, Samuel Pickwick, Esq., G.C.M.P.C., and three other Pickwickians hereinafter named, for forming a new branch of United Pickwickians, under the title of The Corresponding Society of the Pickwick Club.'

    Evidently, that the ' Pickwick' has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrases, which contain 'Pickwick', weights: the first has 1, the second – 0.11; the greater weight signifies stronger emotional ‘acuteness’; where the weight refers to the frequency that a phrase occurs in relation to other phrases.

    SQL cannot produce the above statistics – SQL is obsolete and out of business.

Skip to main content