Channels ▼

Open Source

Machine Learning with Apache Mahout: The Lay of the Land

First, I need to add the CSV file to the project. Create a new data subdirectory within the root folder and add the previously saved dataset1.csv file in this new subdirectory. Then, add a new Java class named GenericUserBasedRecommender1 in the src/main/java folder and include it in the com.first package. The following lines show the code for

package com.first;

import java.util.*;


public class GenericUserBasedRecommender1 {

  public static void main(String[] args) throws Exception {
	  // Create a data source from the CSV file
	  File userPreferencesFile = new File("data/dataset1.csv");
	  DataModel dataModel = new FileDataModel(userPreferencesFile);
	  UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
	  UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(2, userSimilarity, dataModel);

	  // Create a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity
	  Recommender genericRecommender =  new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);

	  // Generate a list of 3 recommended items for user 1001
	  List<RecommendedItem> itemRecommendations = genericRecommender.recommend(1001, 3);

	  // Display the item recommendations generated by the recommendation engine
	  for (RecommendedItem recommendedItem : itemRecommendations) {

Execute the mvn compile Maven command to rebuild the recently modified project.

Then, execute the mvn exec:java -Dexec.mainClass="com.first.GenericUserBasedRecommender1" Maven command to run the built project. Notice that the main class is now com.first.GenericUserBasedRecommender1.

The following lines show the last lines of the output generated by the execution.

RecommendedItem[item:9010, value:9.500863]
RecommendedItem[item:9011, value:9.499137]
RecommendedItem[item:9012, value:8.499137]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.883s
[INFO] Finished at: Wed Oct 16 23:25:14 PST 2013
[INFO] Final Memory: 14M/154M
[INFO] ------------------------------------------------------------------------

The code is easy to understand and uses many Mahout classes to recommend the following three items to user 1001 with different score values:

  • Item 9010 with a value of 9.500863.
  • Item 9011 with a value of 9.499137.
  • Item 9012 with a value of 8.499137.

Thus, the first item that the recommender engine would suggest to user 1001 based on the preferences of similar users (neighbors) is item 9010, with the highest value of 9.5009863.

How the Calculation Works

The code in the GenericUserBasedRecommender1.main method creates a data source from the data/dataset1.csv CSV file. The constructor receives the File instance containing the preferences data.

Then, the code uses the FileDataModel instance to create an instance of the class. This class provides an implementation of the Pearson correlation. For example, for two users, named user1 and user2, PearsonCorrelationSimilarity calculates the following values:

  • sumSquareUser1: Sum of the square of all the preference values for user1.
  • sumSquareUser2: Sum of the square of all the preference values for user2.
  • sumUser1XUser2: Sum of the product of the preference values for user1 and user2, for all the items that include preferences from both users.

Then, PearsonCorrelationSimilarity calculates the correlation with the following formula: sumUser1XUser2 / sqrt(sumSquareUser1 * sumY2). This way, this correlation shifts the user preference values to make each of their means equal to 0, and it is equivalent to the cosine similarity. You can interpret this correlation as the cosine of the angle between two vectors generated with the user preference values.

Next, the code uses the FileDataModel and the PearsonCorrelationSimilarity instances to create an instance of the class. This class computes a neighborhood consisting of the two nearest users to a given user because the n argument that defines the neighborhood size is set to 2. There are many other constructors for this class that allow you to specify values for additional arguments.

The code creates a generic-user-based recommender ( instance with the FileDataModel, the NearestNUserNeighborhood, and the PearsonCorrelationSimilarity instances. Then, it is simply necessary to call the recommend method for the new GenericUserBasedRecommender instance with the user ID and the desired number of recommendations to generate. This method returns a List<>. Each RecommendedItem instance encapsulates a recommended item and includes the item ID (recommendedItem.getItemID()) and a float value (recommendedItem.getValue()) that expresses the strength of the preference. A simple for loop displays each RecommendedItem in the console.

This shows how you can use one of the Mahout recommender engines with just a few lines of code. In my example, the code uses a simple CSV file as the data source, but it is just as easy to work with larger and more complex data sources. In addition, several Mahout features run on top of Apache Hadoop and take advantage of its great scalability. In the next article, I'll discuss more-advanced machine learning algorithms included in Apache Mahout — which you you can also use with just a few lines of code.

Gaston Hillar is a frequent contributor to Dr. Dobb's.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.