Framework for .Net Hadoop MapReduce Job Submission Json Serialization

A while back one of the changes made to the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code was to support Binary Serialization from Mapper, in and out of Combiners, and out from the Reducer. Whereas this change was needed to support the Generic interfaces there were two downsides to this approach. Firstly the size of the output was dramatically increased, and to a lesser extent the data could no longer be visually inspected and was not interoperable with non .Net code. The other subtle problem was that to support textual output from the Reducer, ToString() was being used; not an ideal solution.

To fix these issues I have changed the default Serializer to be based on the DataContractJsonSerializer. For the most part this will result in no changes to the running code. However this approach does allow one to better control the serialization of the intermediate and final output.

As an example consider the following F# Type, that is used in one of the samples:

type MobilePhoneRange = { MinTime:TimeSpan; AverageTime:TimeSpan; MaxTime:TimeSpan }

This now results in the following output from the Streaming job; where the key is a Device Platform string and the value is the query time range in JSON format:

Android {"AverageTime@":"PT12H54M39S","MaxTime@":"PT23H59M54S","MinTime@":"PT6S"}
RIM OS {"AverageTime@":"PT13H52M56S","MaxTime@":"PT23H59M58S","MinTime@":"PT1M7S"}
Unknown {"AverageTime@":"PT10H29M27S","MaxTime@":"PT23H52M36S","MinTime@":"PT36S"}
Windows Phone {"AverageTime@":"PT12H38M31S","MaxTime@":"PT23H55M17S","MinTime@":"PT32S"}
iPhone OS {"AverageTime@":"PT11H51M53S","MaxTime@":"PT23H59M50S","MinTime@":"PT1S"}

Not only is this output readable but it is a lot smaller in size than the corresponding binary output.

If one wants further control over the serialization one can now use the DataContract and DataMember attributes; such as in this C# sample class definition, again used in the samples:

[DataContract]
public class MobilePhoneRange
{
    [DataMember] public TimeSpan MinTime { get; set; }
    [DataMember] public TimeSpan MaxTime { get; set; }

    public MobilePhoneRange(TimeSpan minTime, TimeSpan maxTime)
    {
        this.MinTime = minTime;
        this.MaxTime = maxTime;
    }
}

This will result in the Streaming job output:

Android {"MaxTime":"PT23H59M54S","MinTime":"PT6S"}
RIM OS {"MaxTime":"PT23H59M58S","MinTime":"PT1M7S"}
Unknown {"MaxTime":"PT23H52M36S","MinTime":"PT36S"}
Windows Phone {"MaxTime":"PT23H55M17S","MinTime":"PT32S"}
iPhone OS {"MaxTime":"PT23H59M50S","MinTime":"PT1S"}

Most types which support binary serialization can be used as Mapper and Reducer types. One type that warrants a quick mention is supporting serializing Generic Lists in F#. If one wants to use a Generic List, a simple List type is required to be defined that inherits from List<T>. Using the following definition one can now use a List of ProductQuantity types as part of any output.

type ProductQuantity = { ProductId:int; Quantity:float}
 
type ProductQuantityList() =
    inherit List<ProductQuantity>()

 If one still wants to use Binary Serialization one can specify the optional output format parameter:

-outputFormat Binary

This submission attribute is optional and if absent the default value of Text is assumed; meaning the output will be in text format using the JSON serializer.

Hopefully this change will be transparent but will result in better performance due to the dramatically reduced files sizes; and more data readability and interoperability.