Channels ▼
RSS

Design

Machine Learning with Apache Mahout: Refining the Recommender


You can easily change the user-similarity component by replacing the line that assigns a value to userSimilarity. For example, you can use the EuclideanDistanceSimilarity class instead of the PearsonCorrelationSimilarity. The EuclideanDistanceSimilarity class provides a similarity metric that considers users as points in a space of as many dimensions as there are items. The coordinates are then the preference values. The class computes the Euclidean distance between two user points.

UserSimilarity userSimilarity = new EuclideanDistanceSimilarity(dataModel);

You need to replace the neighborhood calculation mechanism with the following line to see a nice number of recommendations using the Euclidean distance between two user points as the new user similarity computation:

UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(10, userSimilarity, dataModel);

Mahout includes other classes with additional algorithms to compute user similarity, such as:

  • LogLikelihoodSimilarity: Ignores preference values and determines a value of how unlikely it is for two users to have so much overlap based on the total number of items and the number of items each user has a preference for.
  • SpearmanCorrelationSimilarity: Implements the Spearman correlation, which computes a correlation based on the relative rank of preference values, instead of considering the original preference values. This correlation is a variant on the Pearson correlation and it consumes more time to compute and store the relative ranks.
  • TanimotoCoefficientSimilarity: Ignores preference values and just considers whether a user expresses a preference for an item. This implementation is based on the Tanimoto coefficient (also known as the Jaccard index) and Jaccard similarity coefficient.

As the data sets grow and the algorithms are more complex, it becomes convenient to use one of the caching wrapper implementations that Mahout provides. For example, the org.apache.mahout.cf.taste.impl.similarity.CachingUserSimilarity class caches the results from an underlying UserSimilarity implementation. The CachingUserSimilarity class delegates the similarity computation to the UserSimilarity implementation, and caches the results internally. The caching mechanism increases the memory consumption but reduces the required computation, so it's useful when the user similarity computation costs are high.

The following line changes the way userSimilarity is assigned an instance with the basic usage of CachingUserSimilarity, which includes an underlying SpearmanCorrelationsSimilarity. In this case, the cache determines its size according to properties of the data model.

UserSimilarity userSimilarity = new CachingUserSimilarity(new SpearmanCorrelationSimilarity(dataModel), dataModel);

Evaluating the Accuracy of a User-Based Recommender

Mahout provides components that enable you to evaluate the quality of the estimated preference values of the recommendations generated by a recommender. You can evaluate how closely the generated preferences match the actual preferences. You can instruct Mahout to use a part of the real data set as test data and to remove this test data from the data set that the recommender uses as the input. Then, Mahout can calculate a score based on the recommendations, comparing them with the separated test data. You can use different recommender evaluators that produce scores with diverse meanings.

Listing One shows the use of the org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator to generate a score value for one of the previous examples of the GenericUserBasedRecommender. Notice that it is necessary to implement the inner interface org.apache.mahout.cf.taste.eval.RecommenderBuilder to create a Recommender that the AverageAbsoluteDifferenceRecommenderEvaluator uses to evaluate.

Listing One

package com.first;

import java.io.*;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.model.file.*;
import org.apache.mahout.cf.taste.impl.neighborhood.*;
import org.apache.mahout.cf.taste.impl.recommender.*;
import org.apache.mahout.cf.taste.impl.similarity.*;
import org.apache.mahout.cf.taste.model.*;
import org.apache.mahout.cf.taste.neighborhood.*;
import org.apache.mahout.cf.taste.recommender.*;
import org.apache.mahout.cf.taste.similarity.*;

public class GenericUserBasedRecommender1 {

  public static void main(String[] args) throws Exception {
	  // Create a data source from the CSV file
	  File userPreferencesFile = new File("data/dataset1.csv");
	  DataModel dataModel = new FileDataModel(userPreferencesFile);
	  
	  RecommenderEvaluator recommenderEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();	  
	 	  
	  RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
		  @Override
		  public Recommender buildRecommender(DataModel dataModel) throws TasteException {
			  UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
			  UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(10, userSimilarity, dataModel);

			  // Return a new instance of a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity
			  return new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);
		  }
	  };
	  	  
	  // Build a model with 80% training percentage 
	  double score = recommenderEvaluator.evaluate(recommenderBuilder, null, dataModel, 0.80, 1.0);
	  System.out.format("The recommender evaluation score is %f%n", score);
  }
}

AverageAbsoluteDifferenceRecommenderEvaluator computes the average absolute difference between predicted and actual ratings for users (the mean average error). Thus, the lower the score values generated by AverageAbsoluteDifferenceRecommenderEvaluator, the better the recommendations. The lowest possible score value is 0 and indicates the best possible evaluation — a perfect match. The RecommenderBuilder that is passed as an argument to the AverageAbsoluteDifferenceRecommenderEvaluator builds the recommender to test on top of the given data model. It is necessary to specify the percentage of the preferences supplied by the given data model to be used as training data. The code in Listing One specifies the use of 80% of the total preferences as training data. This way, the evaluator uses 80% of each user's preferences to produce recommendations, and the rest are compared to estimated preference values to evaluate the recommender's performance.

Mahout also provides the following two recommender evaluators, which generate different scores:

  • RMSRecommenderEvaluator: Computes the root mean squared difference between predicted and actual ratings for users (the square root of the average of the difference, squared).
  • Track1RecommenderEvaluator: Computes the root mean square error (RMSE) of the validation data set against the predicted ratings from the training data set. The Track1 prefix is included in the name because the algorithm attempts to run an evaluation like the one dictated for Yahoo's KDD Cup, Track 1.

The evaluators use certain randomness to choose the test data, so results might vary on each execution. You can force the same random choices for each execution by calling the org.apache.mahout.common.RandomUtils.useTestSeed() method. When you call this method, you make all the randomness in the project predictable and repeatable, so you can compare scores as you make changes to the recommender components.

Evaluating a User-Based Recommender with Precision and Recall

Mahout also provides components that enable you to evaluate the following two aspects of a recommender that are well-known in the information-retrieval realm:

  • Precision, which is the fraction of the top recommendations that are good suggested items for the user. It is also known as" positive predictive value."
  • Recall, the fraction of good suggested items that appear in the top recommendations. It is also known as "sensitivity."

More detailed definitions of precision and recall are available, but remember that in this example, precision and recall are related to the evaluation of a recommender.

Listing Two shows the use of the org.apache.mahout.cf.taste.impl.eval.GenericRecommenderIRStatsEvaluator.GenericRecommenderIRStatsEvaluator class to compute both the precision and recall values for one of the previous examples of the GenericUserBasedRecommender. Again, it is necessary to implement the inner interface org.apache.mahout.cf.taste.eval.RecommenderBuilder to create a recommender that the RecommenderIRStatsEvaluator will use to evaluate. In this case, the call to the RandomUtils.useTestSeed() method forces the same random choices for each execution.

Listing Two

package com.first;

import java.io.*;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.IRStatistics;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderIRStatsEvaluator;
import org.apache.mahout.cf.taste.impl.eval.GenericRecommenderIRStatsEvaluator;
import org.apache.mahout.cf.taste.impl.model.file.*;
import org.apache.mahout.cf.taste.impl.neighborhood.*;
import org.apache.mahout.cf.taste.impl.recommender.*;
import org.apache.mahout.cf.taste.impl.similarity.*;
import org.apache.mahout.cf.taste.model.*;
import org.apache.mahout.cf.taste.neighborhood.*;
import org.apache.mahout.cf.taste.recommender.*;
import org.apache.mahout.cf.taste.similarity.*;
import org.apache.mahout.common.RandomUtils;

public class GenericUserBasedRecommender1 {

  public static void main(String[] args) throws Exception {
	  // Create a data source from the CSV file
	  File userPreferencesFile = new File("data/dataset1.csv");
	  RandomUtils.useTestSeed();
	  
	  DataModel dataModel = new FileDataModel(userPreferencesFile);
	  
	  RecommenderIRStatsEvaluator recommenderEvaluator = new GenericRecommenderIRStatsEvaluator();
	 	  
	  RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
		  @Override
		  public Recommender buildRecommender(DataModel dataModel) throws TasteException {
			  UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
			  UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(10, userSimilarity, dataModel);

			  // Return a new instance of a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity
			  return new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);
		  }
	  };
	  
	  IRStatistics statistics = 
			  recommenderEvaluator.evaluate(
					  recommenderBuilder, null, dataModel, 
					  null, 2, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0);
	  System.out.format("The recommender precision is %f%n", statistics.getPrecision());
	  System.out.format("The recommender recall is %f%n", statistics.getRecall());
  }
}

GenericRecommenderIRStatsEvaluator computes both the precision and recall at 2 for the recommender. The number of recommendations to consider when evaluating precision and recall is 2. The call to the evaluate method specifies 2 for the at parameter. The evaluation percentage is set to 100% (1.0). You can evaluate the precision and recall as you make changes to the recommender components and parameters in order to determine the options for your problem domain and your data model.

Mahout also provides classes that supply item-based recommenders and cluster-based recommendation features. For example, the org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender.GenericItemBasedRecommender provides an item-based recommender that works with a data model and an implementation of the org.apache.mahout.cf.taste.similarity.ItemSimilarity interface. ItemSimilarity computes the degree of similarity for two items based on the preferences that users have expressed for these items. You can explore the different recommenders by making small changes to the previous examples.

Apache Mahout makes it easy to start exploring machine learning algorithms. You can work with existing data models and test different components to generate all kinds of recommendations. Mahout components provide implementations of popular algorithms that you can easily plug-in and unplug based on your requirements. The examples introduced in this series should allow you to start working with recommenders and begin evaluating their accuracy. As your data sets increase in size, you might want to explore other recommenders, such as those built on top of Apache Hadoop.


Gaston Hillar is a frequent contributor to Dr. Dobb's.

Related Article

Machine Learning with Apache Mahout: The Lay of the Land


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video