How does Spark determine partitions for an RDD?

The most fundamental data structure in Spark is called RDD (Resilient Distributed Dataset). An RDD can have one or many partitions, and each partition is a subset of data that resides on one node in your Spark cluster. The number of partitions is one of the factors to determine the parallelism of Spark task execution….


Understanding and Using HDInsight Spark Streaming

There are plenty of blogs and materials out there talking about Spark Streaming. Most of them focus on the internals of Spark Streaming and how in detail Spark Streaming works. I think those are not best suited for developers or data scientists who want to use Spark Streaming. As I worked to enable Spark Streaming…


Performance Tuning for HDInsight Storm and Microsoft Azure EventHubs

Apache Storm is a popular real time data processing framework. Microsoft Azure HDInsight provides a service to deploy a Storm cluster in the cloud. Customers can readily use HDInsight Storm clusters to process data from Azure EventHubs using a Java based spout implementation. The EventHubSpout source code has been integrated into Apache Storm trunk. You…


HDInsight Storm Topology Submission Via VNet

1. Introduction To submit a Storm topology to an HDInsight cluster, a user can RDP to the headnode of the cluster and run storm command. This is not always convenient. It is actually possible to submit a Storm topology from outside of an HDInsight cluster. The idea is to create an HDInsight Storm cluster with…


Hadoop Yarn memory settings in HDInsight

(Edit: thanks Mostafa for the valuable feedback, I updated this post with explanation about the relationship between Yarn base and Java base memory settings)   There are several related memory settings for jobs running in HDInsight cluster which most customers need to pay close attention to. When not correctly set, it will cause obscure failures…