The post Generate movie recommendations using Mahout is a good introduction to Mahout. I’ve seen similar posts on the web and in books about Mahout that also use the same GroupLens Research movie ratings data.
I won’t regurgitate the same info in the tutorial linked above. Rather, let’s look at the results a little more deeply.
The result file part-r-0000 contains the movie recommendation results, something like:
These results can be parsed as User ID, Movie ID and recommendation score. The first 2 fields are easy enough to understand. The recommendation results can be used
if you are trying to predict a user’s rating for an item but aren’t useful in ranking recommendations.
To provide ranked recommendations SIMILARITY_LOGLIKELIHOOD would work better. In summary, SIMILARITY_LOGLIKELIHOOD doesn’t use the users item rating rather it considers overlapping and non-overlapping users and items each user did and did not interact with. To me, log likelihood is kinda like gathering user feedback; focus on what users do, not what they say they do. Did the user interact/purchase/consume an item? Yes, then that’s what log likelihood uses.
More information on how log likelihood works can be found here.
To use SIMILARITY_LOGLIKELIHOOD you only need a file with User ID and ID of the item/movie/etc. that the user has interacted with. Below we’ll use the same data file from the tutorial.
Once you’re setup for the tutorial then there’s 2 options to slightly change and re-run it.
A) In the tutorial if you used the PowerShell script in the “Run the job” section just change the jobArguments to “SIMILARITY_LOGLIKELIHOOD”.
View the output file “part-r-00000” per the tutorial.
B) Alternatively, I used the following command line with my local HDInsight Emulator:
Copy the u.data file from your local file system to HDFS:
hadoop fs -put u.data u.data
hadoop jar C:\hdp\mahout-0.9.0.2.1.3.0-1981\core\target\mahout-core-0.9.0.2.1.3.0-1981-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_LOGLIKELIHOOD –input u.data –output udata_output
Note: you may need to change the mahout version above to match your local machine.
Get the output file from HDFS and put it back into your local file system:
hadoop fs -getmerge udata_output udata_output.txt
View the file udata_output.txt
You’ll notice that the recommendations are different when compared to the results you received from previously using SIMILARITY_COOCCURRENCE. In udata_output.txt
you can ignore the recommendation score and focus on the recommended item IDs.
There’s a lot of data to look through but I’ll let you determine whether SIMILARITY_COOCCURRENCE, SIMILARITY_LOGLIKELIHOOD or another Mahout similarity metric will best meet your needs.
I hope this post provided additional insight.