SSIS Catalog and Project Deployment with PowerShell

This may be my shortest blog post ever as I get ready to sign off from work for the next three weeks. But before I do, I wanted to share a quick script to automate deployment for SSIS 2012 (and 2014). I can’t take full credit for this script as the foundation was taken from…


Programmatically Executing SSIS Packages

While working on the next iteration of my SSIS ETL Framework, I’ve discovered that the capabilities of the out-of-the-box Execute Package task are quite lacking. Luckily, with SQL Server 2012, it has never been easier to execute SSIS packages programmatically. In this post, we will look at two different options for executing SSIS packages from…


3 Little Piggy’s: Advanced #Pig Join Scenarios

One of the most common operations in any Pig job is the join. A join, much like what you like the ones you work with in SQL Server, brings together two sets of data into one. These joins can happen in multiple different ways and Pig supports most of what you would expect including inner…


Streaming #Pig

As a C# developer there are a number of opportunities available for writing code that is either used by or interacts with a Hadoop/HDInsight cluster. A number of these have been well publicized and documented. In fact there is an entire .Net SDK for Hadoop (HERE) that will allow you to easily write streaming MapReduce…


Oink: Improving #Pig Development

Over the last couple (ok more than a couple) of months, we’ve taken a meandering stroll through the different parts and pieces that form the foundation of the Hadoop ecosystem. We’ve covered Hive, Mahout, Recommendation Engines and even a little bit about Pig. In this post we are going to again circle back to Pig…


Indexes & Views in #Hive

In my last Hive post, we introduced partitions and bucketing both of which allow you to horizontally slice data to make it more manageable and easy to query. Staying the course in this post we will introduce two more techniques to improve your experience in Hive through the use of indexes and views. Indexes In…


Partitions & Buckets in #Hive

In my previous post, we discussed the map, array and struct data types and their implementation in Hive. Continuing on the Hive theme, this post will introduce partitioning and bucketing as  method for segmenting large data sets to improve query performance. Partitions If you have previous experience working in the relational database world then inevitably…


Introduction to #Hive Collections

After a much needed vacation in the sunny Florida Keys and some time away from the work and blogosphere world, its time to get back on the hamster wheel. Like most RDBMS systems Hive supports a number of different primitive data types including various size integers, precision floating point, boolean, timestamp and of course strings….


MapReduce Ninja Moves: Combiners, Shuffle & Doing A Sort

Who’s driving this car? At first glance it appears that as a developer, you have very little if no control over how MapReduce behaves. In some regards this is an accurate assessment. You have no control over when or where a MapReduce job runs, what data a specific map job will process or which reducer…


Azure Machine Learning – Data Preparation

In my last post (HERE) we started a more pragmatic look at the Azure Machine Learning service using the CRISP Data Mining methodology as our outline. We began by looking at Data Understanding, which is Phase 2 and includes data acquisition, data exploration and insight. In this post we will move on to Phase 3,…