Run your PySpark Interactive Query and batch Job in Visual Studio Code

We are excited to introduce the integration of HDInsight PySpark into Visual Studio Code (VSCode), which allows developers to easily edit Python scripts and submit PySpark statements to HDInsight clusters. This interactivity brings the best properties of Python and Spark to developers and empowers you to gain faster insights.


Spark Job Submission on HDInsight 101

This article is part two of the Spark Debugging 101 series we initiated a few weeks ago. Here we discuss ways in which spark jobs can be submitted on HDInsight clusters and some common troubleshooting guidelines. So here goes. Livy Batch Job Submission Livy is an open source REST interface for interacting with Apache Spark remotely from…

Spark Debugging on HDInsight 101

Apache Spark is an open source processing framework that runs large-scale data analytics applications. Built on an in-memory compute engine, Spark enables high performance querying on big data. It leverages a parallel data processing framework that persists data in-memory and disk if needed.  This article details common ways of submitting spark applications on our HDInsight…

PySpark: Appending columns to DataFrame when DataFrame.withColumn cannot be used

The sample Jupyter Python notebook described in this blog can be downloaded from In many Spark applications, there are common use cases in which columns derived from one or more existing columns in a DataFrame are appended during the data preparation or data transformation stages. DataFrame provides a convenient method of form DataFrame.withColumn([string] columnName,…