Understanding Spark’s SparkConf, SparkContext, SQLContext and HiveContext


The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark's cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration parameters that your Spark driver application will pass to SparkContext. Some of these parameters define properties of your Spark driver application and some are used by Spark to allocate resources on the cluster. Such as, the number, memory size and cores uses by the executors running on the workernodes. setAppName() gives your Spark driver application a name so you can identify it in the Spark or Yarn UI. You can review the documentation for Spark 1.3.1 for SparkConf to get a complete list of parameters. SparkConf documentation


import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("MySparkDriverApp").setMaster("spark://master:7077").set("spark.executor.memory", "2g")


Now that we have a SparkConf we can pass it into SparkContext so our driver application knows how to access the cluster.


import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

val conf = new SparkConf().setAppName("MySparkDriverApp").setMaster("spark://master:7077").set("spark.executor.memory", "2g")

val sc = new SparkContext(conf)


You can review the documentation for Spark 1.3.1 for SparkContext to get a complete list of parameters. SparkContext documentation

Now that your Spark driver application has a SparkContext it knows what resource manager to use and can ask it for resources on the cluster. If you are using YARN, Hadoop's resourcemanager (headnode) and nodemanager (workernode) will work to allocate a container for the executors. If the resources are available on the cluster the executors will allocate memory and cores based your configuration parameters. If you are using Sparks cluster manager, the SparkMaster (headnode) and SparkSlave (workernode) will be used to allocate the executors. Below is a diagram showing the relationship between the driver applications, the cluster resource manager and executors.




Each Spark driver application has its own executors on the cluster which remain running as long as the Spark driver application has a SparkContext. The executors run user code, run computations and can cache data for your application. The SparkContext will create a job that is broken into stages. The stages are broken into tasks which are scheduled by the SparkContext on an executor.

One of Sparks's modules is SparkSQL. SparkSQL can be used to process structured data, so with SparkSQL your data must have a defined schema. In Spark 1.3.1, SparkSQL implements dataframes and a SQL query engine. SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. Hortonworks and the Spark community suggest using the HiveContext. You can see below that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands. The same behavior occurs for pyspark. You can review the Spark 1.3.1 documentation for SQLContext and HiveContext at SQLContext documentation and HiveContext documentation



HDInsight provides multiple end user experiences to interact with your cluster. It has a Spark Job Server HDInsight spark job server that allows you to remotely copy and submit your jar to the cluster. This submission will run your driver application. In your jar you have to implement SparkConf, SparkContext and HiveContext. HDInsight also provides a notebook experience with Jupyter and Zeppelin. For the Zeppelin notebook it automatically creates the SparkContext and HiveContext for you. For Jupyter you must create them yourself. You can read more about Spark on HDInsight at HDInsight Spark Overview and HDInsight Spark Resource Management


I hope understanding the relationship between SparkConf, SparkContext, SQLContext, and HiveContext and how a Spark driver application uses them will makes your Spark on HDInsight project a successful one!







Comments (2)

  1. Ilya Geller says:

    SQL, Spark – they are obsolete.

    For instance, there are two sentences:

    a) ‘Pickwick!’

    b) 'That, with the view just mentioned, this Association has taken into its serious consideration a proposal, emanating from the aforesaid, Samuel Pickwick, Esq., G.C.M.P.C., and three other Pickwickians hereinafter named, for forming a new branch of United Pickwickians, under the title of The Corresponding Society of the Pickwick Club.'

    Evidently, that the ' Pickwick' has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrases, which contain 'Pickwick', weights: the first has 1, the second – 0.11; the greater weight signifies stronger emotional ‘acuteness’; where the weight refers to the frequency that a phrase occurs in relation to other phrases.

    SQL cannot produce the above statistics – SQL is obsolete and out of business.

    1. David Howell says:

      Any Turing complete language can in theory calculate anything. ANSI SQL may not be but T-SQL and Spark & Hive SQL certainly are. Spark is an in-memory cluster execution engine, written in Scala, which runs on the Java virtual machine. Scala is a functional language with object oriented features, but Spark implements what is essentially a MySQL inspired SQL DSL. Why? Because SQL does all the things you usually need to do with relational tables.

      The problem you pose is as simple as counting words or phrases in two strings and fuzzy matching to your target ‘pickwick’. I can use both T-SQL or SparkSQL to accomplish that quite easily. In the case of T-SQL, I would probably use c# clr function to split the string first, or turn on full-text-search in the database to make it easier. In the case of Spark, well, seriously this trivial example you’ve put forward is the essence of the classic twitter scraping which is the analytics equivalent of Hello World.

      SQL and declarative query languages as a paradigm are not dead at all. Why are all these hadoop and nosql technologies re-implementing SQL features if you are right?

      The truth is that unstructured data as a panacea is a lie. There is only data, and to extract meaning, structure must be applied. SQL is an amazing abstraction that separates what the analyst wants from how the query is executed, and in mature products like SQL server, trust me, the query optimizer is way smarter than you.

      Spark is currently one of the most popular real time analytics engines for hadoop. How is it obsolete? Maybe in a few years when the next thing takes over. Maybe the NSA or the Russian equivalent have long since moved on and it’s obsolete there, but in the real world it’s pretty cutting edge.

Skip to main content