Channels ▼
RSS

Tools

Machine Learning with Apache Mahout: The Lay of the Land


Building intelligent applications that learn from user input and data they process is becoming a popular requirement, and these applications require machine learning techniques.

Apache Mahout is a highly scalable machine learning library that enables developers to use optimized algorithms, such as collaborative filtering and random forest decision-tree-based classifiers. As such, Apache Mahout is becoming one of the most popular library for machine-learning projects. In this first of a pair of articles, I'll start explaining how to create a Mahout recommender by taking advantage of one of its collaborative filtering algorithms.

Working with Collaborative Filtering Recommenders

If you have visited e-commerce or social network websites, you've probably seen a recommender engine in action. Recommender engines try to infer tastes and preferences for a user based on his or her past actions and similarities to other users. In addition, recommender engines try to identify unknown items that might be of interest to users.

People follow patterns to like and dislike. For example, people usually tend to like things that are similar to other things they like, and they usually tend to like things that similar people like. Recommendation algorithms use these patterns to predict likes and dislikes. It is possible to generate recommendations based on either users or items.

Apache Mahout is usable in a wide range of machine-learning and-data mining algorithms. However, Mahout has a specific focus on collaborative filtering (recommender engines), clustering, and classification. Here, I'll focus on one of the recommender engines that Mahout includes out of the box. For reference, you can peruse the official instructions to check out the code and build the latest Mahout version.

"Collaborative filtering" recommenders, such as the one I'm going to look at, require you to specify a relationship between the users and the items. The collaborative filtering recommender engine doesn't need to know details about the properties for each item to produce a recommendation. Mahout provides a collaborative filtering framework that enables you to use a simple input, and generate recommendations based on this input. In addition, you can build a domain-specific content-based recommender that considers the specific attributes of either the items or the users on top of the framework that Mahout provides.

A small database with relationships between users and items makes it easy to understand how collaborative filter recommenders work in Mahout. Consider the following IDs for six users:

  • 1001
  • 1002
  • 1003
  • 1004
  • 1005
  • 1006

Each user has one or more scores that indicate their preference for each item ID. The score is a value from 1 to 10. The item IDs start with a 9 prefix to easily differentiate them from the user IDs. Figure 1 shows the six users (blue circles) with relationships to the different items (orange circles) and the score values represented by lines with different colors according to the following ranges:

  • Score value from 1 to 4: the user dislikes the item (red solid line).
  • Score value from 5 to 7: the user likes the item, but isn't excited with the item and has some criticisms (red dashed line).
  • Score value from 8 to 10: the user really likes the item (green line).

Mahout
Figure 1: Six users (1001-1006) and their how much they like items (9001-9015).

You can see that user 1001 very much likes items 9001 and 9003, but this user doesn't like item 9002. Based on the preferences of other users that have similar tastes to user 1001, I want to know the best items to recommended to user 1001.

The following data listing shows the contents of a comma-separated values (CSV) file that defines the input data represented in Figure 1, with the user IDs, the item IDs, and the score values. You should create a text file named dataset1.csv because you will use it later.

#userId, itemId, score
1001,9001,10
1001,9002,1
1001,9003,9
1002,9001,3
1002,9002,5
1002,9003,1
1002,9004,10
1003,9001,2
1003,9002,6
1003,9003,2
1003,9004,9
1003,9005,10
1003,9006,8
1003,9007,9
1004,9001,9
1004,9002,2
1004,9003,8
1004,9004,3
1004,9010,10
1004,9011,9
1004,9012,8
1005,9001,8
1005,9002,3
1005,9003,7
1005,9004,1
1005,9010,9
1005,9011,10
1005,9012,9
1005,9013,8
1005,9014,1
1005,9015,1
1006,9001,7
1006,9002,4
1006,9003,8
1006,9004,1
1006,9010,7
1006,9011,6
1006,9012,9

It is possible to use a CSV file as the input for a Mahout recommender engine and generate a specific number of recommendations for one of the users with just a few lines of code.

Creating a New Mahout Project with Maven and Eclipse

Mahout requires both Maven and Java JDK, and I assume you've already built Mahout and that you have Maven installed. Follow the next steps to create a new Mahout project with Maven. I've also added the necessary steps to work with the Eclipse IDE. You can skip the steps related to Eclipse if you are using another IDE.

  • Open a command prompt or console in your operating system.
  • Go to your Mahout folder.
  • Run the Maven command to create an empty project named firstrecommender with the package namespace com.first: mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.first -DartifactId=firstrecommender
  • Go to the firstcommender folder that Maven has created for you with the new project.
  • Execute the mvn compile Maven command to build the recently created project, which contains some code to display Hello world!
  • Now, execute the mvn exec:java -Dexec.mainClass="com.first.App" Maven command to run the built project. Notice that the main class is com.first.App. This class has a main method with a single line of code: System.out.println( "Hello World!" );.

Of course, a project that displays a Hello World! message isn't our goal. But you can use it as a template to start working with the different Mahout libraries. Import the Maven project into Eclipse or your favorite IDE to see the structure of the project (see Figure 2). The src/main/java folder includes the com.first.App.java file with the com.first.App class.

Mahout
Figure 2: The initial structure for the generated project in Eclipse.

Find the pom.xml file within the project's root folder (shown at the bottom of the file panel in Figure 2). The following lines show the initial content of this file, with just a junit dependency. In my case, the Mahout version is 0.8. If you are working with a different Mahout version, a different value will appear for version.

<?xml version="1.0"?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout</artifactId>
    <version>0.8</version>
  </parent>
  <groupId>com.first</groupId>
  <artifactId>firstrecommender</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>firstrecommender</name>
  <url>http://maven.apache.org</url>
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
</project>

Because you want to use the different Mahout libraries to create a recommender, it is necessary to include all the dependencies in the pom.xml file. The following lines show the new dependencies in the edited pom.xml that include four Mahout libraries: mahout-core, mahout-math, and mahout-utils. In addition, there is a value specified for parent/relativePath to set the relative path to the parent project.

<?xml version="1.0"?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout</artifactId>
    <version>0.8</version>
    <relativePath>../pom.xml</relativePath>    
  </parent>
  <groupId>com.first</groupId>
  <artifactId>firstrecommender</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>firstrecommender</name>
  <url>http://maven.apache.org</url>
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
     <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-core</artifactId>
      <version>0.8</version>
    </dependency>
    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-math</artifactId>
      <version>0.8</version>
    </dependency>
    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-math</artifactId>
      <version>0.8</version>
      <type>test-jar</type>
      <scope>test</scope>
    </dependency>
  </dependencies>
</project>

Working with a Generic-User-Based Recommender

The org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender class implements a recommender that uses a DataModel and a UserNeighborhood to produce recommendations. The org.apache.mahout.cf.taste.model.DataModel implementations represent a repository of information about users and their associated preferences for items. I will use the CSV file created above as the DataModel. The org.apache.mahout.cf.taste.neighborhood.UserNeighborhood implementations to compute a neighborhood of users similar to a given user and the recommender engine can use this neighborhood to compute recommendations.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video