HDInsight provides numerous ways of executing Spark applications on your cluster. This blogpost outlines how to run Spark SQL queries on your cluster remotely from Visual Studio using C#. The examples explained below is intended to serve as a framework on which you can extend it to build your custom Spark SQL queries.
Before you begin, you must have the following:
- An Azure Subscription: See Get Azure free trial.
- A HDInsight cluster: See Get Started with HDInsight on Linux.
- Visual Studio 2015: See Get Visual Studio 2015.
- Microsoft Spark ODBC driver: HDInsight Spark uses Simba driver to connect to the cluster and run queries remotely. Download the driver here.
The project that contains the sample scenario is hosted in Microsoft GitHub. Download the source from here: https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql
The sample described in the source is intended to do the following:
- Connect to your HDInsight Spark cluster using ODBC connection string.
- Take a list of queries as input and run each of the them serially on the cluster.
- Measure time taken by each query to complete.
- Measure average, min and max query running time.
The queries used in this example perform the following operations on your cluster:
- List all tables in your cluster.
- Drop the table mysampletable if it exists.
- Create an external table mysampletable which contains device platform and number of devices using that platform.
- Extract all records from hivesampletable which come built in with your HDInsight cluster, group them by device platform and write to mysampletable.
- Collect all entries from mysampletable and print them.
Executing the Sample
- Go to the source directory downloaded from Github – cd hdinsight-dotnet-odbc-spark-sql
- Open the ODBC Client csproj file using Visual Studio 2015
- Navigate to Program.cs using Solution Explorer
- Input your username, password and cluster name in the main function.
- Execute the project by clicking the “Start” button at the top navigation bar.
Customizing the Sample
This sample provides a framework which can be customized to fit your needs. An example scenario for customization could be as follows:
- Running your custom queries: The sample has a queryList which contains the queries explained in this section. Replace the queries in this list by you’re the queries you want to run.
- Changing timeouts: The sample has currently set command timeout to 5 minutes and the connection timeout to 10 minutes. While this should work perfectly fine for the sample described here, you might want to change it when running long running queries.