Multi-Stream support in SCP.NET Storm Topology

Streams are in the core of Apache Storm. In most cases topologies are based on a single input stream, however there are situations when one may need to start the topology with two or more input steams. User code to emit or receive from distinct streams at the same time is supported in SCP. To…

0

Collecting logs from Apache Storm cluster in HDInsight

While running an Apache Storm topology in a multi node storm cluster different components of the topology log in different files that are saved in different nodes in the cluster, depending on where that component is running. Today in this blog I will discuss the log files that are available in a storm cluster and…

0

Troubleshooting Oozie or other Hadoop errors with DEBUG logging

In troubleshooting Hadoop issues, we often need to review the logging of a specific Hadoop component. By default, the logging level is set to INFO or WARN for many Hadoop components like Oozie, Hive etc. and in many cases this level of logging is sufficient to trace the issue. However, in certain cases, INFO or…

1

Troubleshooting Hive query performance in HDInsight Hadoop cluster

One of the common support requests we get from customers using Apache Hive is –my Hive query is running slow and I would like the job/query to complete much faster – or in more quantifiable terms, my Hive query is taking 8 hours to complete and my SLA is 2 hours. Improving or tuning hive…

0

Spark or Hadoop

  Spark is the most active Apache project and has a lot of media press in the big data world. So how do you know if Spark is right for your project and what is the difference between Spark and Hadoop when run on HDInsight? I’ll cover some of the differences between Spark and Hadoop…

0

How to access Hive using JDBC on HDInsight

While following up on a customer question recently on this topic, I realized that we have seen the same question coming up from other users a few times and thought I would share a simple example here on how to connect to HiveServer2 on Azure HDInsight using JDBC. For background, please review the apache wiki and the…

0

Azure PowerShell 0.8.14 Released, fixes problems with pipelining HDInsight configuration cmdlets

We recently pushed out the 0.8.14 release of Azure PowerShell. This release includes some updates to the following cmdlets to ensure that values passed in via the PowerShell pipeline, or via the -Config parameter, are maintained: Set-AzureHDInsightDefaultStorage Add-AzureHDInsightStorage Add-AzureHDInsightMetastore Previously if you had done something like: $myconfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 2 -ClusterType HBase $myconfig =…

0

Some Commonly Used Yarn Memory Settings

We were recently working on an out of memory issue that was occurring with certain workloads on HDInsight clusters. I thought it might be a good time to write on this topic based on all the current experience troubleshooting some memory issues. There are a few memory settings that can be tuned to suit your specific…


How to use HBase Java API with HDInsight HBase cluster, part 1

Recently we worked with a customer, who was trying to use HBase Java API to interact with an HDInsight HBase cluster. Having worked with the customer and trying to follow our existing documentations here and here, we realized that it may be helpful if we clarify a few things around HBase JAVA API connectivity to…

1

How to use parameter substitution with Pig Latin and PowerShell

When running Pig in a production environment, you’ll likely have one or more Pig Latin scripts that run on a recurring basis (daily, weekly, monthly, etc.) that need to locate their input data based on when or where they are run. For example, you may have a Pig job that performs daily log ingestion by…

1