Nodes in HDInsight

Knowing the types and functions of nodes in HDInsight is key to taking full advantage of the service. This article is aimed at users who are familiar with big data concepts but are newer to HDInsight. Please feel free to read the article and provide me feedback even if you’re beyond the target audience for…


How WebHCat Works and How to Debug (Part 2)

Link to Part 1 2. How to debug WebHCat 2.1. BadGateway (HTTP status code 502) This is a very generic message from Gateway nodes. We will cover some common cases and possible mitigations. This is the most common Templeton problems customer are seeing right now. 2.1.1. WebHcat service down This happens in-case WebHCat server on…

2

How WebHCat Works and How to Debug (Part 1)

1. Overview and Goals One of the common scenarios our customers facing are: why my Hive, Pig, or Scoop job submissions are failing? Most likely something is wrong with your WebHCat service. In this article, we will try to answer some of the common questions like: What is WebHCat or sometimes referred to also as…

0

Garbage Collection and its performance impact

Hadoop is a beautiful abstraction that allows us to deal with the numerous complexities of data without delving into the details of the infrastructure. But once in a while to see why the performance of your applications are stalled, one has to look underneath the hood and find ways to extract performance or find why…


Restarting Storm EventHub Topology on a new cluster

Azure EventHub is a popular highly scalable data streaming platform. More about Azure EventHub can be found here: https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs It is very common to use Storm in conjunction with EventHub as a platform for downstream event processing. As part of this, an EventHub Spout is created which receives events from the EventHub. The current implementation…


Using Oozie SLA on HDInsight clusters

Introduction Often we have several jobs running on our HDInsight clusters that have tight timelines requirements associated with them. This could be in terms on how much time it takes for the job to start, how much time does the job run, what is the maximum time before which the jobs should complete etc. Oozie…


Spark Job Submission on HDInsight 101

This article is part two of the Spark Debugging 101 series we initiated a few weeks ago. Here we discuss ways in which spark jobs can be submitted on HDInsight clusters and some common troubleshooting guidelines. So here goes. Livy Batch Job Submission Livy is an open source REST interface for interacting with Apache Spark remotely from…


Spark Debugging on HDInsight 101

Apache Spark is an open source processing framework that runs large-scale data analytics applications. Built on an in-memory compute engine, Spark enables high performance querying on big data. It leverages a parallel data processing framework that persists data in-memory and disk if needed.  This article details common ways of submitting spark applications on our HDInsight…


Executing Spark SQL Queries using dotnet ODBC driver

Introduction HDInsight provides numerous ways of executing Spark applications on your cluster. This blogpost outlines how to run Spark SQL queries on your cluster remotely from Visual Studio using C#.  The examples explained below is intended to serve as a framework on which you can extend it to build your custom Spark SQL queries. Prerequisite…


OozieBot: Automated Oozie Workflow and Coordinator Generation

Introducing OozieBot – a tool to help customers automate Oozie job creation. Learn how to use OozieBot to generate Apache Oozie coordinators and Workflows for Hive, Spark and Shell actions and run them on a Linux based HDInsight cluster. Introduction Apache Oozie is a workflow/coordination system that manages Hadoop jobs. It is integrated with the…