Hadoop XML Streaming and F# MapReduce

Article
01/21/2012

So, to round out the Hadoop Streaming samples I thought I would put together an XML Streaming sample. As always the code can be found here:

https://code.msdn.microsoft.com/Hadoop-Streaming-and-F-f2e76850

XML Streaming Reader

So how does one stream in XML? If you read the Hadoop Streaming documentation you will notice the following FAQ:

You can use the record reader StreamXmlRecordReader to process XML documents.

 hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=BEGIN_STRING,end=END_STRING" ..... (rest of the command)

Anything found between BEGIN_STRING and END_STRING would be treated as one record for map tasks.

This should allow one to define a start and end tag using commands such as:

-inputreader "StreamXmlRecord,begin=<Row>,end=</Row>"

However, if one tries to run this from the command console one gets an error due to the “<” character; as this is used to redirect standard input. It seems that using the usual caret escape character ^ is not enough.

In this example, to parse the XML one could just specify the XML Element name, “Row” in this case, and the start and end tags could easily be derived. As such to assist in XML processing I have provided an XML reader to do exactly this. I was going to modify the base Hadoop stream reader but it seems that there is traction behind the usage of a Mahout library, called “org.apache.mahout.classifier.bayes.XmlInputFormat”.

I have changed this base library to allow a configuration value of “"xmlinput.element" that allows one to just define the XML Element to be processed. This allows one to support an XML Streaming job used with the following configuration:

"-D xmlinput.element=Store" -inputformat org.apache.mahout.classifier.bayes.XmlElementStreamingInputFormat

As a note one has to remember to place the job configuration setting in quotes.

The changes to support this are quite minimal. Firstly the Input Format is defined as:

public class XmlElementStreamingInputFormat

extends FileInputFormat <NullWritable, Text> {

public static final String XML_ELEMENT_KEY = "xmlinput.element";

@Override

public RecordReader<NullWritable, Text> getRecordReader(

InputSplit inputSplit, JobConf jobConf, Reporter reporter) throws IOException {

return new XmlElementStreamingRecordReader((FileSplit) inputSplit, jobConf);

}

Secondly the Record Reader is modified to define the start and end tags based on the new configuration value:

String elementTagName = jobConf.get(XML_ELEMENT_KEY);

startTag = ("<" + elementTagName + ">").getBytes("UTF8");

endTag = ("</" + elementTagName + ">").getBytes("UTF8");

In using this code one has the restriction that the element tag being searched for is well-formed and has no corresponding attributes. If this is not the case one can easily create a new XML Input Format class specific to your needs.

The complete original source can be found in various places; including github. My modifications are in the source code download.

With the new XML Input Format class in place we are good to go with our F# code.

Map and Reduce Classes

As always, lets start with Map and Reduce functions. The goal of the sample code is to support a Map and Reduce with the following prototypes:

 Map : XElement –> seq<(string * string) * obj>

 Reduce : string * string -> seq<string> -> obj option

The idea is the Mapper takes in an XElement and projects this into a sequence of keys and values. The Reducer, as in the text streaming case, takes in a key and value pair and returns an optional reduced value.

The main difference that you will notice in this sample is that the key is a tuple, which in turn is mapped into multiple map keys. This is supported in Streaming application through the configuration value “stream.num.map.output.key.fields”. The Map in the previous samples returns a single string key value.

For the sample I was going to process a series of XML nodes with the following structure (created by running a TSQL Select from the Adventure Works Store table):

<Store>

<Name>Professional Sales and Service</Name>

<BankName>International Bank</BankName>

<Specialty>Touring</Specialty>

</StoreSurvey>

</Demographics>

</Store>

The Map code is going to extract the Sales Amount for each Business Type and Bank:

let Map (element:XElement) =

let aw = "https://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey"

let demographics = element.Element(XName.Get("Demographics")).Element(XName.Get("StoreSurvey", aw))

if not(demographics = null) then

let business = demographics.Element(XName.Get("BusinessType", aw)).Value

let bank = demographics.Element(XName.Get("BankName", aw)).Value

let key = (business, bank)

let sales = Decimal.Parse(demographics.Element(XName.Get("AnnualSales", aw)).Value)

Seq.singleton (key, sales)

else

Seq.empty

The Reduce then sums the Sales across each of the Business Types and Banks:

let Reduce (key:(string*string)) (values:seq<string>) =

let totalRevenue =

values |>

Seq.fold (fun revenue value -> revenue + Int32.Parse(value)) 0

Some(totalRevenue)

The rationale for the Map returning a sequence rather than a single value as the Map calling code allows for Map function to return multiples values per XML node.

All simple processing. So how are the Map and Reduce functions called?

Mapper and Reducer Executable

The complexity in the Mapper executable, which calls the Map function, comes in processing the XML input stream. Whereas in text streaming one gets a line per record to be processed, in the case of XML this is not the case. For XML Streaming one will get a stream of XML which will more than likely be presented over multiple lines. As such, the Mapper executable will need to parse the input, in my case line by line, and extract the required nodes for the construction of the XElement type. The requirement here then is not that different from the Java XML Input Format class.

In the sample code the construction of the XElement sequence is achieved through the XMLElements sequence definition. The full code listing is as follows:

module Controller =

let Run (args:string array) =

// Define what arguments are expected

let defs = [

{ArgInfo.Command="input"; Description="Input XML File"; Required=false };

{ArgInfo.Command="output"; Description="Output File"; Required=false } ]

// Parse Arguments into a Dictionary

let parsedArgs = Arguments.ParseArgs args defs

// Ensure Standard Input/Output and allow for debug configuration

use reader =

if parsedArgs.ContainsKey("input") then

new StreamReader(Path.GetFullPath(parsedArgs.["input"]))

else

new StreamReader(Console.OpenStandardInput())

use writer =

if parsedArgs.ContainsKey("output") then

new StreamWriter(Path.GetFullPath(parsedArgs.["output"]))

else

new StreamWriter(Console.OpenStandardOutput(), AutoFlush = true)

// Combine the name/value output into a string

let outputCollector ((outputKey1, outputKey2) , outputValue) =

let output = sprintf "%s\t%s\t%O" outputKey1 outputKey2 outputValue

writer.WriteLine(output)

// Write an counter entry

let counterReporter docType =

stderr.WriteLine (sprintf "reporter:counter:Elements Processed,%s,1" docType)

let nodename = StoreXmlElementMapper.MapNode

let nodestart = "<" + nodename + ">"

let nodeend = "</" + nodename + ">"

// Read the next line of the input stream

let readLine() =

reader.ReadLine()

// Parse the input stream into a sequence of XElement types

let elementBuilder = new StringBuilder(1024)

let rec xmlElements inContent (someContent:string option) = seq {

let line =

match someContent with

| Some(content) -> content

| None -> readLine()

if not (box line = null) then

if (inContent) then

// Try to find the end element and yield accordingly

let offset = line.IndexOf(nodeend, 0, StringComparison.InvariantCultureIgnoreCase)

if (offset > -1) then

// Found the endnode so append and start new XElement

let content = line.Substring(0, offset + nodeend.Length)

elementBuilder.Append(content) |> ignore

let nextContent =

if (offset + nodeend.Length = line.Length) then

None

else

Some(line.Substring(offset + nodeend.Length))

yield XElement.Parse(elementBuilder.ToString())

elementBuilder.Clear() |> ignore

yield! xmlElements false nextContent

else

// Just a content line so append

elementBuilder.AppendLine(line) |> ignore

yield! xmlElements true None

else

// Find the start node element and start building

let offset = line.IndexOf(nodestart, 0, StringComparison.InvariantCultureIgnoreCase)

if (offset > -1) then

yield! xmlElements true (Some(line.Substring(offset)))

else

yield! xmlElements false None

}

// Process the XElement sequence and report on successes

let elementProcessed value =

outputCollector value

counterReporter "Successfully Processed"

try

xmlElements false None

|> Seq.map StoreXmlElementMapper.Map

|> Seq.iter (Seq.iter elementProcessed)

with

| :? System.Xml.XmlException ->

// Ignore invalid xml elements but report on count

counterReporter "Invalid Elements"

// Close the streams

reader.Close()

writer.Close()

The premise of the creation of the sequence of XElement types is that a StringBuilder is used to build up the text used to create the XElement. The opening tag is located after which any content is appended to the StringBuilder. Once the end tag is located and appended to the StringBuilder, its contents are yielded as an XElement, and the process repeated. One has to remember that after finding the end tag the remainder of the line needs to be fed into the locate for the next opening tag.

As this processing is dependant on knowing the element name to process I decided to have the module containing the Map function return the node name. Other methods could be used but this seemed to be the cleanest as the Mapper needs this understanding to process the XML.

In addition to handling the XML processing this sample code also demonstrates writing out the composite key from the tuple and generating counters showing the number of elements processed; both success and failures.

With the XML previously mentioned the output from the Mapper executable will be:

BM Guardian Bank 1100000

BM International Bank 3000000

BM International Security 2700000

BM Primary Bank & Reserve 2200000

BM Primary International 1100000

BM Reserve Security 1100000

BM United Security 2900000

BS International Security 1500000

BS Primary Bank & Reserve 1500000

BS Reserve Security 800000

OS Guardian Bank 6000000

OS International Bank 4500000

OS International Security 4500000

OS Primary Bank & Reserve 10500000

OS Primary International 6000000

OS Reserve Security 6000000

OS United Security 4500000

So onto the Reducer. The Reducer, as in the previous samples, just has to read in this text stream and call the Reduce function. The full code listing is as follows:

module Controller =

let Run (args:string array) =

// Define what arguments are expected

let defs = [

{ArgInfo.Command="input"; Description="Input File"; Required=false };

{ArgInfo.Command="output"; Description="Output File"; Required=false } ]

// Parse Arguments into a Dictionary

let parsedArgs = Arguments.ParseArgs args defs

// Ensure Standard Input/Output and allow for debug configuration

use reader =

if parsedArgs.ContainsKey("input") then

new StreamReader(Path.GetFullPath(parsedArgs.["input"]))

else

new StreamReader(Console.OpenStandardInput())

use writer =

if parsedArgs.ContainsKey("output") then

new StreamWriter(Path.GetFullPath(parsedArgs.["output"]))

else

new StreamWriter(Console.OpenStandardOutput(), AutoFlush = true)

// Combine the name/value output into a string

let outputCollector (outputKey1, outputKey2) outputValue =

let output = sprintf "%s\t%s\t%O" outputKey1 outputKey2 outputValue

writer.WriteLine(output)

// Read the next line of the input stream

let readLine() =

reader.ReadLine()

// Parse the input into the required key/value pair

let parseLine (input:string) =

let keyValue = input.Split('\t')

((keyValue.[0].Trim(), keyValue.[1].Trim()), keyValue.[2].Trim())

// Converts a input line into an option

let getInput() =

let input = readLine()

if not(String.IsNullOrWhiteSpace(input)) then

Some(parseLine input)

else

None

// Creates a sequence of the input based on the provided key

let lastInput = ref None

let continueDo = ref false

let inputsByKey key firstValue = seq {

// Yield any value from previous read

yield firstValue

continueDo := true

while !continueDo do

match getInput() with

| Some(input) when (fst input) = key ->

// Yield found value and remainder of sequence

yield (snd input)

| Some(input) ->

// Have a value but different key

lastInput := Some(input)

continueDo := false

| None ->

// Have no more entries

lastInput := None

continueDo := false

}

// Controls the calling of the reducer

let rec processInput (input:((string * string) * string) option) =

if input.IsSome then

let (key, value) = input.Value

let reduced = StoreXmlElementReducer.Reduce key (inputsByKey key value)

if reduced.IsSome then

outputCollector key reduced.Value

processInput lastInput.contents

processInput (getInput())

The main difference in this code is the support for a tuple as a key. As in previous samples the Reducer has to create a sequence of sequences, as lazy evaluated groupBy, of the input data based on the key values. In this instance to find a match one needs to ensure both key elements are the same. This is achieved in F# by the fact tuple types support structural equality (check out Equality and Comparison Constraints in F# by Don Syme for more details).

Of course one could generalize this code to support N key elements; but maybe more on this at another time.

Finally onto running the job. The command to run this MapReduce job is as follows:

hadoop.cmd jar lib/hadoop-streaming-ms.jar

"-D xmlinput.element=Store" "-D stream.num.map.output.key.fields=2"

-input "/stores/demographics" -output "/stores/banking/release"

-mapper "..\..\jars\FSharp.Hadoop.MapperXml.exe"

-combiner "..\..\jars\FSharp.Hadoop.ReducerXml.exe"

-reducer "..\..\jars\FSharp.Hadoop.ReducerXml.exe"

-file "C:\bin\Release\FSharp.Hadoop.MapperXml.exe"

-file "C:\bin\Release\FSharp.Hadoop.ReducerXml.exe"

-file "C:\bin\Release\FSharp.Hadoop.MapReduce.dll"

-file "C:\bin\Release\FSharp.Hadoop.Utilities.dll"

-inputformat org.apache.mahout.classifier.bayes.XmlElementStreamingInputFormat

The configuration values one has to set are the number of output key fields, “stream.num.map.output.key” with a value of 2, and the XML element name to be processed, “xmlinput.element” with a value of “Store”.

As in the Binary Document Streaming sample, I created a new Streaming JAR called “hadoop-streaming-ms.jar” which contains the new Java classes on top of the base Hadoop streaming classes. The downloadable code has the full listing of the Java classes along with a command file to compile the source code.

Tester Application

As always I have put together a tester application for calling the Mapper and Reducer executable:

module MapReduceConsole =

let Run args =

// Define what arguments are expected

let defs = [

{ArgInfo.Command="input"; Description="Input File"; Required=true };

{ArgInfo.Command="output"; Description="Output File"; Required=true };

{ArgInfo.Command="tempPath"; Description="Temp File Path"; Required=true };

{ArgInfo.Command="mapper"; Description="Mapper EXE"; Required=true };

{ArgInfo.Command="reducer"; Description="Reducer EXE"; Required=true };

{ArgInfo.Command="nodename"; Description="XML Node"; Required=true }; ]

// Parse Arguments into a Dictionary

let parsedArgs = Arguments.ParseArgs args defs

Arguments.DisplayArgs parsedArgs

// define the executables

let mapperExe = Path.GetFullPath(parsedArgs.["mapper"])

let reducerExe = Path.GetFullPath(parsedArgs.["reducer"])

let nodename = parsedArgs.["nodename"]

let nodestart = "<" + nodename + ">"

let nodeend = "</" + nodename + ">"

Console.WriteLine()

Console.WriteLine (sprintf "The Mapper file is:\t%O" mapperExe)

Console.WriteLine (sprintf "The Reducer file is:\t%O" reducerExe)

Console.WriteLine (sprintf "Processing Nodename:\t%O" nodename)

// Get the file names

let inputfile = Path.GetFullPath(parsedArgs.["input"])

let outputfile = Path.GetFullPath(parsedArgs.["output"])

let tempPath = Path.GetFullPath(parsedArgs.["tempPath"])

let tempFile = Path.Combine(tempPath, Path.GetFileName(outputfile))

let mappedfile = Path.ChangeExtension(tempFile, "mapped")

let reducefile = Path.ChangeExtension(tempFile, "reduced")

Console.WriteLine()

Console.WriteLine (sprintf "The input file is:\t\t%O" inputfile)

Console.WriteLine (sprintf "The mapped temp file is:\t%O" mappedfile)

Console.WriteLine (sprintf "The reduced temp file is:\t%O" reducefile)

Console.WriteLine (sprintf "The output file is:\t\t%O" outputfile)

// Give the user an option to continue

Console.WriteLine()

Console.WriteLine("Hit ENTER to continue...")

Console.ReadLine() |> ignore

// Parse the input stream into a sequence of XElement types

let mapperProcess() =

use mapper = new Process()

mapper.StartInfo.FileName <- mapperExe

mapper.StartInfo.UseShellExecute <- false

mapper.StartInfo.RedirectStandardInput <- true

mapper.StartInfo.RedirectStandardOutput <- true

mapper.Start() |> ignore

use mapperInput = mapper.StandardInput

use mapperOutput = mapper.StandardOutput

// Map the reader to a background thread so processing can happen in parallel

Console.WriteLine "Mapper Processing Starting..."

let taskMapperFunc() =

use mapperWriter = File.CreateText(mappedfile)

while not mapperOutput.EndOfStream do

mapperWriter.WriteLine(mapperOutput.ReadLine())

let taskMapperWriting = Task.Factory.StartNew(Action(taskMapperFunc))

// Pass the file into the mapper process and close input stream when done

use mapperReader = new StreamReader(File.OpenRead(inputfile))

let elementBuilder = new StringBuilder(1024)

let rec xmlElements inContent (someContent:string option) = seq {

let line =

match someContent with

| Some(content) -> content

| None -> mapperReader.ReadLine()

if not (box line = null) then

if (inContent) then

// Try to find the end element and yield accordingly

let offset = line.IndexOf(nodeend, 0, StringComparison.InvariantCultureIgnoreCase)

if (offset > -1) then

// Found the endnode so always add a new line

let content = line.Substring(0, offset + nodeend.Length)

elementBuilder.AppendLine(content) |> ignore

let nextContent =

if (offset + nodeend.Length = line.Length) then

None

else

Some(line.Substring(offset + nodeend.Length))

yield elementBuilder.ToString()

elementBuilder.Clear() |> ignore

yield! xmlElements false nextContent

else

// Just a content line so append

elementBuilder.AppendLine(line) |> ignore

yield! xmlElements true None

else

// Find the start node element and start building

let offset = line.IndexOf(nodestart, 0, StringComparison.InvariantCultureIgnoreCase)

if (offset > -1) then

yield! xmlElements true (Some(line.Substring(offset)))

else

yield! xmlElements false None

}

xmlElements false None

|> Seq.iter mapperInput.WriteLine

mapperInput.Close()

taskMapperWriting.Wait()

mapperOutput.Close()

mapper.WaitForExit()

let result = match mapper.ExitCode with | 0 -> true | _ -> false

mapper.Close()

result

// Sort the mapped file by the first field - mimic the role of Hadoop

let hadoopProcess() =

Console.WriteLine "Hadoop Processing Starting..."

let unsortedValues = seq {

use reader = new StreamReader(File.OpenRead(mappedfile))

while not reader.EndOfStream do

let input = reader.ReadLine()

let keyValue = input.Split('\t')

yield (keyValue.[0].Trim(), keyValue.[1].Trim(), keyValue.[2].Trim())

reader.Close()

}

use writer = File.CreateText(reducefile)

unsortedValues

|> Seq.sortBy (fun (key1, key2, value) -> (key1, key2))

|> Seq.iter (fun (key1, key2, value) -> writer.WriteLine (sprintf "%O\t%O\t%O" key1 key2 value))

writer.Close()

// Finally call the reducer process

let reducerProcess() =

use reducer = new Process()

reducer.StartInfo.FileName <- reducerExe

reducer.StartInfo.UseShellExecute <- false

reducer.StartInfo.RedirectStandardInput <- true

reducer.StartInfo.RedirectStandardOutput <- true

reducer.Start() |> ignore

use reducerInput = reducer.StandardInput

use reducerOutput = reducer.StandardOutput

// Map the reader to a background thread so processing can happen in parallel

Console.WriteLine "Reducer Processing Starting..."

let taskReducerFunc() =

use reducerWriter = File.CreateText(outputfile)

while not reducerOutput.EndOfStream do

reducerWriter.WriteLine(reducerOutput.ReadLine())

let taskReducerWriting = Task.Factory.StartNew(Action(taskReducerFunc))

// Pass the file into the mapper process and close input stream when done

use reducerReader = new StreamReader(File.OpenRead(reducefile))

while not reducerReader.EndOfStream do

reducerInput.WriteLine(reducerReader.ReadLine())

reducerInput.Close()

taskReducerWriting.Wait()

reducerOutput.Close()

reducer.WaitForExit()

let result = match reducer.ExitCode with | 0 -> true | _ -> false

reducer.Close()

result

// Finish test

if mapperProcess() then

Console.WriteLine "Mapper Processing Complete..."

hadoopProcess()

Console.WriteLine "Hadoop Processing Complete..."

if reducerProcess() then

Console.WriteLine "Reducer Processing Complete..."

Console.WriteLine "Processing Complete..."

Console.ReadLine() |> ignore

Again, the challenge in this code is the processing of the XML. This sample code processes a single XML input file, performing the parsing of the XML into the required elements, and then streaming this into the Mapper executable. If you review the code you will see a string similarity between the Mapper executable XElement sequence generation and how the XML is parsed and streamed into the Mapper.

Finally, a required argument into the tester application is the “nodename” value. This is analogous to the job configuration parameter when running the job within a Hadoop cluster.

Conclusion

So this wraps up Streaming examples for a while. I have enjoyed putting it all together, but I hope it is some useful code.

Hadoop XML Streaming and F# MapReduce

XML Streaming Reader

Map and Reduce Classes

Mapper and Reducer Executable

Tester Application

Conclusion

Additional resources