If like me you are a .Net developer and have written some Streaming jobs it is not immediately obvious how one can do any reporting. However if you dig through the Streaming Documentation you will come across this in the FAQs:
How do I update counters in streaming applications? A streaming process can use the stderr to emit counter information.reporter:counter:<group>,<counter>,<amount> should be sent to stderr to update the counter.
- How do I update status in streaming applications? A streaming process can use the stderr to emit status information. To set a status, reporter:status:<message> should be sent to stderr.
So this does provide an easy mechanism to provide feedback from a running streaming job.
If you take my last binary streaming post, in running the code one has no idea of how many Microsoft Word, PDF, or Unknown documents have been processed.
Thus using the counter output format, one can define a simple counterReporter function:
One can then easily report on documents processed using the following slight code modification:
Thus we update the Group “Documents Processed”, with the document type, each time we process a document. Looking at the Hadoop job log we can now see:
|Documents Processed||PDF Document||1||0||1|
|File Input Format Counters||Bytes Read||2,003,157||0||2,003,157|
|Launched reduce tasks||0||0||1|
|Launched map tasks||0||0||4|
|Data-local map tasks||0||0||4|
|Map-Reduce Framework||Map output materialized bytes||98||0||98|
|Combine output records||5||0||5|
|Map input records||4||0||4|
|Map output bytes||64||0||64|
|Map input bytes||2,003,157||0||2,003,157|
|Map output records||5||0||5|
|Combine input records||5||0||5|
All nice and easy.
If you want to do some error reporting the process is the same just with a different string format.