How to allow Spark to access Microsoft SQL Server

  Today we will look at configuring Spark to access Microsoft SQL Server through JDBC. On HDInsight the Microsoft SQL Server JDBC jar is already installed. On Linux the path is /usr/hdp/2.2.7.1-10/hive/lib/sqljdbc4.jar. If you need more information or to download the driver you can start here Microsoft SQL Server JDBC Spark needs to know the…

5

Multi-Stream support in SCP.NET Storm Topology

Streams are in the core of Apache Storm. In most cases topologies are based on a single input stream, however there are situations when one may need to start the topology with two or more input steams. User code to emit or receive from distinct streams at the same time is supported in SCP. To…

0

Using Azure SDK for Python

  Python is a great scripting tool with a large user base. In a recent support case I needed a way to constantly generate files with some random data in windows azure storage (wasb) in order to process them with Spark on HDInsight. Python, the Azure SDK for Python and a few lines of code…

0

A KMeans example for Spark MLlib on HDInsight

  Today we will take a look at Sparks’s module for MLlib or its built-in machine learning library Sparks MLlib Guide . KMeans is a popular clustering method. Clustering methods are used when there is no class to be predicted but instances are divided into groups or clusters. The clusters hopefully will represent some mechanism…

2

Understanding Spark’s SparkConf, SparkContext, SQLContext and HiveContext

  The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark’s cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration…

1

Dealing with RequestRateTooLarge errors in Azure DocumentDB and testing performance

In Azure DocumentDB support, one of the most common errors we have seen as reported by our customers is RequestRateTooLargeException or HTTP Status code 429. For example, from an application using DocumentDB .Net SDK, we may see an error like this – System.AggregateException: One or more errors occurred. —> Microsoft.Azure.Documents.DocumentClientException: Exception: Microsoft.Azure.Documents.RequestRateTooLargeException, message: {“Errors”:[“Request rate…

2

How to configure Hortonworks HDP to access Azure Windows Storage

  Recently I was asked how to configure a Hortonworks HDP 2.3 cluster to access Azure Windows Storage. In this post we will go through the steps to accomplish this. The first step is to create an Azure Storage account from the Azure portal. My storage account is named clouddatalake. I choose the “local redundant”…

2

Collecting logs from Apache Storm cluster in HDInsight

While running an Apache Storm topology in a multi node storm cluster different components of the topology log in different files that are saved in different nodes in the cluster, depending on where that component is running. Today in this blog I will discuss the log files that are available in a storm cluster and…

0

Troubleshooting Oozie or other Hadoop errors with DEBUG logging

In troubleshooting Hadoop issues, we often need to review the logging of a specific Hadoop component. By default, the logging level is set to INFO or WARN for many Hadoop components like Oozie, Hive etc. and in many cases this level of logging is sufficient to trace the issue. However, in certain cases, INFO or…

1

Some things to consider for your Spark on HDInsight workload

  When it comes time to provision your Spark cluster on HDInsight we all want our workloads to execute fast. The Spark community has made some strong claims for better performance compared to mapreduce jobs. In this post I want to discuss two topics to consider when deploying your Spark application on an HDInsight cluster.  …

0