Framework for .Net Hadoop MapReduce Job Submission Binary Output

To end the week I decided to make a minor change to the “Generics based Framework for .Net Hadoop MapReduce Job Submission”.

I have been doing some work on creating a co-occurrence matrix for item recommendations. I was going to map the process to a MapReduce job(s), then came across the issue of how I would output the vector data from the reducer. In the current framework the reducer outputs the key/value data in a string format. This works fine for simple data but for a vector this quickly becomes problematic.

To resolve this I have enabled a parameter called “outputFormat”. The default output will be the usual string format; optionally specified with the parameter value “Text”. Additionally a parameter value of “Binary” is supported:

MSDN.Hadoop.Submission.Console.exe
-input "mobile/data" -output "mobile/querytimes"
-mapper "MSDN.Hadoop.MapReduceFSharp.MobilePhoneQueryMapper, MSDN.Hadoop.MapReduceFSharp"
-reducer "MSDN.Hadoop.MapReduceFSharp.MobilePhoneQueryReducer, MSDN.Hadoop.MapReduceFSharp"
-outputFormat Binary
-file "C:\Projects\Release\MSDN.Hadoop.MapReduceFSharp.dll"

When the output format is specified as binary the reducer value is output as a binary serialized version of the data, represented as a Base64 string. Reading the reduced output one can then easily serialize this object back into a .Net type:

  1. let Deserialize (value:string) =    
  2.  
  3.     let bytes = Convert.FromBase64String(value);
  4.     use stream = new MemoryStream(bytes)
  5.  
  6.     let formatter = new BinaryFormatter()
  7.     formatter.Deserialize(stream)

Hopefully one will find this a lot simpler than performing string manipulations.