Implementing a MapReduce Join with Hadoop and the .Net Framework

Article
11/13/2012

I have often been asked how does one implement a Join whilst writing MapReduce code. As such, I thought it would be useful to add an additional sample demonstrating how this is achieved. There are multiple mechanisms one can employ to perform a Join operation, and the one to be discussed will be a Reduce Side 1-to-many join.

As always this sample, amongst others, can be found in the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code download; within the Samples folder.

Join Semantics

Say one wants to join two sets of data, A and B, via a common set attribute. Set A could be defined as a collection of tuples of the form (k_i, a_i, A_i), where k represents the key value on which we want to do the join, s the set item unique identifier, and S the other attributes. For this set each k value would correspond to a unique value of a.

Set B would be similarly defined as a collection of tuples of the form (k_i, b_x, B_x). In this case each value of k would equate to multiple values of b.

As an example, set A could be thus represented by an OrderHeader type and B the corresponding OrderDetail types. The k and a values in this case would represent the sales order identifier. The b value would represent the sales order detail identifier.

The basic concept is that the MapReduce job will create a new set C, which would be defined by the tuple collection (k_i, a_x, b_y, A_x, B_y). For each key identifier k, there would be a single value of a, and multiple values for b.

(k₆₄, a₅₂, b₁₀₆, A₅₂, B₁₀₆)
(k₆₄, a₅₂, b₁₂₁, A₅₂, B₁₂₁)
…
(k₆₄, a₅₂, b₂₃₄, A₅₂, B₂₃₄)

Furthering the sample this new set C could be represented by a new type being the aggregate of the OrderHeader and OrderDetail types; say OrderLine. The SalesOrderId would be the common key attribute.

To perform the Reduce Side join the Mapper would read in the data representing both the sets of data. When processing set A data the emitted value would be (k_i, a_x, A_x, Ø, Ø), and for set B the value (k_i, Ø, Ø, b_y, B_y). In both cases the emitted key would be common key attribute value.

The Reducer would receive the set values for each shared key attribute; namely a single (k_i, a_x, A_x, Ø, Ø) value, and multiple (k_i, Ø, Ø, b_y, B_y) values. It would then emit the full set of (k_i, a_x, b_y, A_x, B_y) values; for each key attribute.

As an example when processing orders the input data would be parsed into either an OrderHeader or OrderDetail object. In both cases the Mapper would emit an OrderLine item; with a null value for the missing data. The Reducer would then read these values for a specific SalesOrderID key value and then emit the set of complete OrderLine values.

So as an example, lets do a join between order header and detail information, from the SQL Server AdventureWorks sample database. The sample download also includes a BCP scripts to extract sales data from the sample database. For completeness I have also written the sample in both C# (with heavy use of LINQ) and F#.

Defining the Types

To start we need to define types that represent the sales header and detail information. These type definitions, listed below, are OrderHeader and OrderDetail. As mentioned we also need an aggregate type, called OrderLine, that aggregates both the header and detail types.

C# Classes

/// Order Header Information
public class OrderHeader
{
public int SalesOrderID { get; set; }
public DateTime OrderDate { get; set; }
public DateTime ShipDate { get; set; }
public int CustomerID { get; set; }
public string SalesOrderNumber { get; set; }
public string PurchaseOrderNumber { get; set; }
public string AccountNumber { get; set; }
}
/// Order Header Information
public class OrderDetail
{
public int SalesOrderID { get; set; }
public int SalesOrderDetailID { get; set; }
public int ProductID { get; set; }
public decimal OrderQty { get; set; }
public decimal UnitPrice { get; set; }
}
/// Order Line Information
public class OrderLine
{
public OrderHeader OrderHeader { get; set; }
public OrderDetail OrderDetail { get; set; }
}

F# Records

/// Order Header Information
type OrderHeader = { SalesOrderID:int; OrderDate:DateTime; ShipDate:DateTime; CustomerID:int; SalesOrderNumber:string; PurchaseOrderNumber:string; AccountNumber:string }
/// Order Detail Information
type OrderDetail = { SalesOrderID:int; SalesOrderDetailID:int; ProductID:int; OrderQty:decimal; UnitPrice:decimal }
/// Order Line Information
type OrderLine = { OrderHeader:OrderHeader; OrderDetail:OrderDetail }

The OrderLine aggregated type is used as the output from the Mapper and the Reducer.

Mapper Processing

The purpose of the mapper is to read in both the header and detail data. For each data record the mapper outputs a value of an OrderLine instance, along with a key value of the SalesOrderID value. When processing a sales header data item an OrderLine is output with a null value for the OrderDetail; and of course vice-versa when processing a detail data item.

The code for parsing the data items is as below. In this instance I have made the determination of the data item type by inspecting the second data value. In this case I am looking for an OrderDate; however one can adopt many different approaches based on the input data.

C# Mapper

// Processes the Order Header and Detail files (In Memory)
class OrderJoinMemoryMapper : MapperBaseText<OrderLine>
{
public override IEnumerable<Tuple<string, OrderLine>> Map(string value)
{
var splits = value.Split('\t');
int salesOrderID = Int32.Parse(splits[0]);
if (splits[1].Contains("-"))
{
DateTime orderDate = DateTime.Parse(splits[1]);
DateTime shipDate = DateTime.Parse(splits[2]);
int customerID = Int32.Parse(splits[3]);
string salesOrderNumber = splits[4];
string purchaseOrderNumber = splits[5];
string accountNumber = splits[6];
var header = new OrderHeader()
{
SalesOrderID = salesOrderID,
OrderDate = orderDate,
ShipDate = shipDate,
CustomerID = customerID,
SalesOrderNumber = salesOrderNumber,
PurchaseOrderNumber = purchaseOrderNumber,
AccountNumber = accountNumber
};
yield return Tuple.Create(splits[0], new OrderLine() { OrderHeader = header, OrderDetail = null });
}
else
{
int salesOrderDetailID = Int32.Parse(splits[1]);
int productID = Int32.Parse(splits[2]);
decimal orderQty = Decimal.Parse(splits[3]);
decimal unitPrice = Decimal.Parse(splits[4]);
var detail = new OrderDetail()
{
SalesOrderID = salesOrderID,
SalesOrderDetailID = salesOrderDetailID,
ProductID = productID,
OrderQty = orderQty,
UnitPrice = unitPrice
};
yield return Tuple.Create(splits[0], new OrderLine() { OrderHeader = null, OrderDetail = detail });
}
}
}

F# Mapper

// Processes the Order Header and Detail files (In Memory)
type OrderJoinMemoryMapper() =
inherit MapperBaseText<OrderLine>()
// Performs the split into the correct type
let (|Header|Detail|Unknown|) (value:string) =
try
let splits = value.Split('\t')
let salesOrderID = Int32.Parse(splits.[0])
if splits.[1].Contains("-") then
// Processing a Header record
let orderDate = DateTime.Parse(splits.[1])
let shipDate = DateTime.Parse(splits.[2])
let customerID = Int32.Parse(splits.[3])
let salesOrderNumber = splits.[4]
let purchaseOrderNumber = splits.[5]
let accountNumber = splits.[6]
Header (splits.[0], { OrderHeader.SalesOrderID = salesOrderID; OrderDate = orderDate; ShipDate = shipDate; CustomerID = customerID;
SalesOrderNumber = salesOrderNumber; PurchaseOrderNumber = purchaseOrderNumber; AccountNumber = accountNumber })
else
// Processing a detail record
let salesOrderDetailID = Int32.Parse(splits.[1])
let productID = Int32.Parse(splits.[2])
let orderQty = Decimal.Parse(splits.[3])
let unitPrice = Decimal.Parse(splits.[4])
Detail (splits.[0], {OrderDetail.SalesOrderID = salesOrderID; SalesOrderDetailID = salesOrderDetailID; ProductID = productID;
OrderQty = orderQty; UnitPrice = unitPrice })
with
| :? System.ArgumentException -> Unknown
// Map the data from input name/value to output name/value
override self.Map (value:string) =
seq {
match value with
| Header (key, header) -> yield (key, { OrderLine.OrderHeader = header; OrderDetail = Unchecked.defaultof<OrderDetail> })
| Detail (key, detail) -> yield (key, { OrderLine.OrderHeader = Unchecked.defaultof<OrderHeader>; OrderDetail = detail })
| Unknown -> ()
}

Literally that is it for the Mapper. It determines the type of input line and emits an OrderLine item.

As you can see the key is the SalesOrderID value. More on this later, when talking about a performance optimization.

Reducer Processing

The purpose of the Reducer is to locate the OrderHeader value for each SalesOrderID key value, and for each corresponding OrderDetail value output a complete OrderLine; consisting of the located header value and the detail values.

In this version of the code, again shown below, use is made of a List of OrderDetail items. Once all the values have been read then the OrderLine sequence of values are returned.

C# Reducer

// Performs the combined data 1-many join (In Memory)
class OrderJoinMemoryReducer : ReducerBase<OrderLine, OrderLine>
{
public override IEnumerable<Tuple<string, OrderLine>> Reduce(string key, IEnumerable<OrderLine> values)
{
OrderHeader orderHeader = null;
var orderDetails = new List<OrderDetail>();
foreach (var line in values)
{
if (line.OrderDetail != null)
{
orderDetails.Add(line.OrderDetail);
}
else
{
orderHeader = line.OrderHeader;
}
}
if (orderHeader != null)
{
return orderDetails.Select(detail => Tuple.Create(key, new OrderLine() { OrderHeader = orderHeader, OrderDetail = detail }));
}
else
{
return Enumerable.Empty<Tuple<string, OrderLine>>();
}
}
}

F# Reducer

// Performs the combined data 1-many join (In Memory)
type OrderJoinMemoryReducer() =
inherit ReducerBase<OrderLine, OrderLine>()
override self.Reduce (key:string) (values:seq<OrderLine>) =
let orderHeader = ref Unchecked.defaultof<OrderHeader>
let hasValue value = not (obj.ReferenceEquals (value, Unchecked.defaultof<_>))
let orderDetails =
values
|> Seq.choose (fun item ->
if (hasValue item.OrderDetail) then
Some(item.OrderDetail)
else
orderHeader := item.OrderHeader
None)
if (hasValue orderHeader) then
orderDetails
|> Seq.toList
|> Seq.map (fun item ->
(key, { OrderLine.OrderHeader = !orderHeader; OrderDetail = item }))
else
Seq.empty

As you can see this code has a memory limitation. For each SalesOrderID value all the corresponding detail values need to be cached in memory, as no determination can be made as to when the header value is located. This may not be an issue with a small number of details associated to a header, but what about data where this number can be extremely large.

This limitation however is easily overcome through the use of a secondary sort. Basically two key values are used to ensure the header value always arrives first within the Reducer.

Secondary Sort Optimization

When processing the order data for a particular key within the reducer, if one knows that the first value will always be the header data then this can be saved for detail data processing. Thus when reading the subsequent detail lines the aggregated OrderLine instance values can be emitted directly without the intermediate List processing step. This is where a secondary sort comes into play.

If one uses two keys rather than one, it is possible to sort the data such that the first value, for each SalesOrderID, will be the header data. This can be achieved in our case by using a secondary sort key of the SalesOrderDetailID. For the header one can then just use a value of zero to ensure it is the first value; as all details items have a positive value. Of course, different approaches can be taken depending on the data domain.

Although the data is sorted on two separate keys, the partitioning must be such that the data for each SalesOrderID is sent to a single Reducer; spanning multiple SalesOrderDetailID values.

Using this approach here is the code for the modified Mapper logic:

C# Mapper

// Processes the Order Header and Detail files
class OrderJoinMapper : MapperBaseText<OrderLine>
{
public override IEnumerable<Tuple<string, OrderLine>> Map(string value)
{
var splits = value.Split('\t');
int salesOrderID = Int32.Parse(splits[0]);
if (splits[1].Contains("-"))
{
DateTime orderDate = DateTime.Parse(splits[1]);
DateTime shipDate = DateTime.Parse(splits[2]);
int customerID = Int32.Parse(splits[3]);
string salesOrderNumber = splits[4];
string purchaseOrderNumber = splits[5];
string accountNumber = splits[6];
string key = Context.FormatKeys(splits[0], "0");
var header = new OrderHeader()
{
SalesOrderID = salesOrderID,
OrderDate = orderDate,
ShipDate = shipDate,
CustomerID = customerID,
SalesOrderNumber = salesOrderNumber,
PurchaseOrderNumber = purchaseOrderNumber,
AccountNumber = accountNumber
};
yield return Tuple.Create(key, new OrderLine() { OrderHeader = header, OrderDetail = null });
}
else
{
int salesOrderDetailID = Int32.Parse(splits[1]);
int productID = Int32.Parse(splits[2]);
decimal orderQty = Decimal.Parse(splits[3]);
decimal unitPrice = Decimal.Parse(splits[4]);
string key = Context.FormatKeys(splits[0], splits[1]);
var detail = new OrderDetail()
{
SalesOrderID = salesOrderID,
SalesOrderDetailID = salesOrderDetailID,
ProductID = productID,
OrderQty = orderQty,
UnitPrice = unitPrice
};
yield return Tuple.Create(key, new OrderLine() { OrderHeader = null, OrderDetail = detail });
}
}
}

F# Mapper

// Processes the Order Header and Detail files
type OrderJoinMapper() =
inherit MapperBaseText<OrderLine>()
// Performs the split into the correct type
let (|Header|Detail|Unknown|) (value:string) =
try
let splits = value.Split('\t')
let salesOrderID = Int32.Parse(splits.[0])
if splits.[1].Contains("-") then
// Processing a Header record
let orderDate = DateTime.Parse(splits.[1])
let shipDate = DateTime.Parse(splits.[2])
let customerID = Int32.Parse(splits.[3])
let salesOrderNumber = splits.[4]
let purchaseOrderNumber = splits.[5]
let accountNumber = splits.[6]
let key = Context.FormatKeys(splits.[0], "0")
Header (key, { OrderHeader.SalesOrderID = salesOrderID; OrderDate = orderDate; ShipDate = shipDate; CustomerID = customerID;
SalesOrderNumber = salesOrderNumber; PurchaseOrderNumber = purchaseOrderNumber; AccountNumber = accountNumber })
else
// Processing a detail record
let salesOrderDetailID = Int32.Parse(splits.[1])
let productID = Int32.Parse(splits.[2])
let orderQty = Decimal.Parse(splits.[3])
let unitPrice = Decimal.Parse(splits.[4])
let key = Context.FormatKeys(splits.[0], splits.[1])
Detail (key, {OrderDetail.SalesOrderID = salesOrderID; SalesOrderDetailID = salesOrderDetailID; ProductID = productID;
OrderQty = orderQty; UnitPrice = unitPrice })
with
| :? System.ArgumentException -> Unknown
// Map the data from input name/value to output name/value
override self.Map (value:string) =
seq {
match value with
| Header (key, header) -> yield (key, { OrderLine.OrderHeader = header; OrderDetail = Unchecked.defaultof<OrderDetail> })
| Detail (key, detail) -> yield (key, { OrderLine.OrderHeader = Unchecked.defaultof<OrderHeader>; OrderDetail = detail })
| Unknown -> ()
}

The multiple keys values are delimitated with a Tab value.

The Reducer code then becomes a lot simpler as no List caching is involved; and where LINQ enables succinct C# code.

C# Reducer

// Performs the combined data 1-many join
class OrderJoinReducer : ReducerBase<OrderLine, OrderLine>
{
public override IEnumerable<Tuple<string, OrderLine>> Reduce(string key, IEnumerable<OrderLine> values)
{
string salesOrderID = Context.GetKeys(key)[0];
OrderHeader orderHeader = values.ElementAt(0).OrderHeader;
return values.Skip(1).Select(detail => Tuple.Create(salesOrderID, new OrderLine() { OrderHeader = orderHeader, OrderDetail = detail.OrderDetail }));
}
}

F# Reducer

// Performs the combined data 1-many join
type OrderJoinReducer() =
inherit ReducerBase<OrderLine, OrderLine>()
override self.Reduce (key:string) (values:seq<OrderLine>) =
let salesOrderID = Context.GetKeys(key).[0]
let orderHeader = Seq.nth 0 values
values
|> Seq.skip 1
|> Seq.map (fun item ->
(salesOrderID, { OrderLine.OrderHeader = orderHeader.OrderHeader; OrderDetail = item.OrderDetail }))

As you can see the code is probably simpler, especially when it comes to the Reducer, and is also more efficient than the In-Memory approach.

Submitting the Jobs

For the In-Memory code the job submission is quite simple:

%BASEPATH%\MSDN.Hadoop.MapReduce\Release\MSDN.Hadoop.Submission.Console.exe
-input "join/data"
-output "join/order"
-mapper "MSDN.Hadoop.MapReduceFSharp.OrderJoinMemoryMapper, MSDN.Hadoop.MapReduceFSharp"
-reducer "MSDN.Hadoop.MapReduceFSharp.OrderJoinMemoryReducer, MSDN.Hadoop.MapReduceFSharp"
-file "%BASEPATH%\MSDN.Hadoop.MapReduceFSharp\Release\MSDN.Hadoop.MapReduceFSharp.dll"

However, for the optimized version one needs to accommodate the fact that the sorting takes place on two keys, and the data partitioning only on one. This is achieved by setting these exact options:

%BASEPATH%\MSDN.Hadoop.MapReduce\Release\MSDN.Hadoop.Submission.Console.exe
-input "join/data"
-output "join/order"
-mapper "MSDN.Hadoop.MapReduceFSharp.OrderJoinMapper, MSDN.Hadoop.MapReduceFSharp"
-reducer "MSDN.Hadoop.MapReduceFSharp.OrderJoinReducer, MSDN.Hadoop.MapReduceFSharp"
-file "%BASEPATH%\MSDN.Hadoop.MapReduceFSharp\Release\MSDN.Hadoop.MapReduceFSharp.dll"
-numberKeys 2 -numberPartitionKeys 1

The reason these options were added to version 1.0.0 of the code was to accommodate this exact type of processing.

Conclusion

To conclude here is some sample output from the join.

43667    {"OrderDetail":{"OrderQty":3,"ProductID":710,"SalesOrderDetailID":77,"SalesOrderID":43667,"UnitPrice":5.7000},"OrderHeader":{"AccountNumber":"10-4020-000646","CustomerID":29974,"OrderDate":"\/Date(1120172400000+0100)\/","PurchaseOrderNumber":"PO15428132599","SalesOrderID":43667,"SalesOrderNumber":"SO43667","ShipDate":"\/Date(1120777200000+0100)\/"}}
43667    {"OrderDetail":{"OrderQty":1,"ProductID":773,"SalesOrderDetailID":78,"SalesOrderID":43667,"UnitPrice":2039.9940},"OrderHeader":{"AccountNumber":"10-4020-000646","CustomerID":29974,"OrderDate":"\/Date(1120172400000+0100)\/","PurchaseOrderNumber":"PO15428132599","SalesOrderID":43667,"SalesOrderNumber":"SO43667","ShipDate":"\/Date(1120777200000+0100)\/"}}
43667    {"OrderDetail":{"OrderQty":1,"ProductID":778,"SalesOrderDetailID":79,"SalesOrderID":43667,"UnitPrice":2024.9940},"OrderHeader":{"AccountNumber":"10-4020-000646","CustomerID":29974,"OrderDate":"\/Date(1120172400000+0100)\/","PurchaseOrderNumber":"PO15428132599","SalesOrderID":43667,"SalesOrderNumber":"SO43667","ShipDate":"\/Date(1120777200000+0100)\/"}}
43667    {"OrderDetail":{"OrderQty":1,"ProductID":775,"SalesOrderDetailID":80,"SalesOrderID":43667,"UnitPrice":2024.9940},"OrderHeader":{"AccountNumber":"10-4020-000646","CustomerID":29974,"OrderDate":"\/Date(1120172400000+0100)\/","PurchaseOrderNumber":"PO15428132599","SalesOrderID":43667,"SalesOrderNumber":"SO43667","ShipDate":"\/Date(1120777200000+0100)\/"}}

Utilizing the output is merely a case of serializing the data back into the OrderLine type.

Hopefully I have demonstrated that MapReduce joins are not that difficult to perform. With a little thought around the types, the code is not much more than data parsing.

A word of caution, compared to standard database joins, MapReduce joins are slow. If you can join the data before running MapReduce processing then this should be your preferred approach.

Implementing a MapReduce Join with Hadoop and the .Net Framework

Join Semantics

Defining the Types

C# Classes

F# Records

Mapper Processing

C# Mapper

F# Mapper

Reducer Processing

C# Reducer

F# Reducer

Secondary Sort Optimization

C# Mapper

F# Mapper

C# Reducer

F# Reducer

Submitting the Jobs

Conclusion

Additional resources