Taking the post a little further: "Generate movie recommendations using Apache Mahout with HDInsight" .

Article
02/08/2015

The post Generate movie recommendations using Mahout is a good introduction to Mahout. I've seen similar posts on the web and in books about Mahout that also use the same GroupLens Research movie ratings data.

I won't regurgitate the same info in the tutorial linked above. Rather, let's look at the results a little more deeply.

The result file part-r-0000 contains the movie recommendation results, something like:

1 [234:5.0...

3 ...272:4.649266]

These results can be parsed as User ID, Movie ID and recommendation score. The first 2 fields are easy enough to understand. The recommendation results can be used
if you are trying to predict a user's rating for an item but aren't useful in ranking recommendations.

To provide ranked recommendations SIMILARITY_LOGLIKELIHOOD would work better. In summary, SIMILARITY_LOGLIKELIHOOD doesn't use the users item rating rather it considers overlapping and non-overlapping users and items each user did and did not interact with. To me, log likelihood is kinda like gathering user feedback; focus on what users do, not what they say they do. Did the user interact/purchase/consume an item? Yes, then that's what log likelihood uses.

More information on how log likelihood works can be found here.

To use SIMILARITY_LOGLIKELIHOOD you only need a file with User ID and ID of the item/movie/etc. that the user has interacted with. Below we'll use the same data file from the tutorial.

Once you're setup for the tutorial then there's 2 options to slightly change and re-run it.

A) In the tutorial if you used the PowerShell script in the "Run the job" section just change the jobArguments to "SIMILARITY_LOGLIKELIHOOD".

View the output file "part-r-00000" per the tutorial.

B) Alternatively, I used the following command line with my local HDInsight Emulator:

Copy the u.data file from your local file system to HDFS:

hadoop fs -put u.data u.data

Run it:

hadoop jar C:\hdp\mahout-0.9.0.2.1.3.0-1981\core\target\mahout-core-0.9.0.2.1.3.0-1981-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_LOGLIKELIHOOD --input u.data --output udata_output

Note: you may need to change the mahout version above to match your local machine.

Get the output file from HDFS and put it back into your local file system:

hadoop fs -getmerge udata_output udata_output.txt

View the file udata_output.txt

You'll notice that the recommendations are different when compared to the results you received from previously using SIMILARITY_COOCCURRENCE. In udata_output.txt
you can ignore the recommendation score and focus on the recommended item IDs.

There's a lot of data to look through but I'll let you determine whether SIMILARITY_COOCCURRENCE, SIMILARITY_LOGLIKELIHOOD or another Mahout similarity metric will best meet your needs.

I hope this post provided additional insight.

Taking the post a little further: "Generate movie recommendations using Apache Mahout with HDInsight" .

Additional resources