Tim is a software engineer with a background in the development of embedded networked systems and author of the upcoming AI Application Programming (Charles River Media, 2003). He can be contacted at firstname.lastname@example.org.
Recent interest in personalization is primarily due to e-commerce. Commercial web sites, for instance, permit near real-time personalization, making it possible for sites to recommend items likely to appeal to customers.
The classic use of personalization on the Web is Amazon.com, which makes recommendations while you shop. The site suggests items other visitors purchased similar to items you're currently viewing. From the buyer's perspective, this saves time since you're automatically connected to related items. From the seller's view, it's another opportunity to sell something.
Of course, personalization has applications outside of sales, since the content provided to web users can be chosen according to a personalization algorithm. This content might include articles, images, and even advertisements. The method by which ads are extended to web users could also be a function of personalization algorithms. Since ads are an annoyance to many users, fine-tuning how they are presented can increase their effectiveness.
In this article, I examine a clustering algorithm that groups customers by their preferences. From this grouping of data, the algorithm identifies interesting features of the group that can be used as recommendations to other customers within that group.
What Is Personalization?
Personalization involves taking some set of inputs and returning recommendations to users through some type of computation. The method by which the personalization is performed depends on the type of data presented as input, the required representation of the output, and (in some cases) the speed and accuracy by which the output must be computed.
Companies use a variety of methods to achieve personalization. Most algorithms (or methods for tuning existing algorithms) are secret since they represent strategic importance to the owner. Amazon uses a method called "Collaborative Filtering," which provides recommendations based on the similarity of your purchases to other purchases. A similarity metric is used to extract a subgroup of people from the stored population whose preferences are similar. Given this subpopulation, recommendations are made based upon differences identified between the members.
Adaptive Resonance Theory
Adaptive Resonance Theory (ART1) is a clustering algorithm under the neural network genre of unsupervised learning. The algorithm takes a number of objects and organizes them into groups based upon similarities. The ART1 algorithm works with objects called "feature vectors," which are collections of binary values representing some type of information. A typical feature vector might be the purchasing history of a customer (see Figure 1). Each element identifies whether the customer purchased an item (1) or not (0). From Figure 1, you see that the customer represented by this feature vector has purchased a hammer and wrench.
Feature vectors describe customers in terms of their purchase habits by identifying items they've purchased (that you have knowledge of). You collect and apply the ART1 algorithm to the set of feature vectors to separate them into clusters. The idea is that a collection of similar customers (a cluster) yields interesting information about the common attributes of these customers.
The ART1 Algorithm
When examining the ART1 algorithm, you begin with a set of feature vectors (E1..K) and a set of initialized prototype vectors (P1..N). The number of prototype vectors, N, is the maximum number of clusters you wish to support. You initialize a vigilance parameter (r) to a value between 0.0 and 1.0, and a beta parameter to a small positive integer. Table 1 lists the working parameters.
The ART1 algorithm provides for walking through the feature vectors (Ek) and checking them against the set of prototype vectors (Pn). Since no prototype vectors exist in the beginning, I'll set the first prototype vector to the first feature vector; see Example 1(a).
The vigilance parameter determines the quality of recommendations that can be made. If the clusters are too large, recommendations might not be suitable because of the large variance of the feature vectors contained with the cluster. If the clusters are too small, the recommendations again might not be relevant because not enough feature vectors are present to find interesting similarities.
Finally, if you pass the vigilance test, you incorporate the feature vector into the current prototype vector; see Example 1(d). If you don't pass the vigilance test (or the proximity test), test it against the next prototype vector. If you exhaust all of the prototype vectors without clustering the feature vector, create a new prototype vector from the feature vector (a new cluster).
You then walk through all of the feature vectors and compare them to the entire set of prototype vectors. If a feature vector is not found to be in the correct cluster (is closer to another prototype vector than the current), then it's moved. Once you've run through all of the feature vectors without making any changes, you've converged on a clustering solution and exit.
In rcmnd.c (available electronically; see "Resource Center," page 5), note that a maximum number of iterations are also monitored as, in some cases, a feature vector will cycle between two prototype clusters. Simply cap the number of iterations in this case to force convergence.
To make recommendations, analyze the feature vector (representing the customer to whom we're going to recommend) and the sum vector. The sum vector is not part of the original ART1 algorithm, but has been added to represent the sum of the columns of the feature vectors within the cluster (Figure 2).
Say the customer you're going to make a recommendation to is represented by feature vector u, which is a member of cluster A. Look at the sum vector to identify which items are represented by the cluster (those with nonzero values). Then find the item in the sum vector with the largest value that corresponds with an element in the customer's feature vector of value zero. This represents the item that the customer has not purchased but is also popular within the cluster. This intersection is the basis for the recommendation. The assumption (or statistical prediction) is that since 66 percent of the customers within this cluster have purchased this item (given item three within the example), it's likely that the current customer will purchase it as well.
The sample output in Figures 3 and 4 is emitted from the recommender utility (available electronically). This run was compiled with DEBUG disabled, so the inner workings of the ART1 algorithm are not shown.
The first element emitted in the output is the clustering. This shows each of the prototype vectors along with the member feature vectors (Figure 3). From this data set, four prototype vectors were created (though five were available). Numerically, this data looks uninteresting.
Figure 4 is a portion of the output of the recommendation function. This applies textual names to the items both represented by the feature vector as well as the resulting recommendation choice.
In each case, the ART1 algorithm correctly segmented the customers into groups based upon items that they've purchased. Customer 0 was placed into a "candy" cluster, Customer 1 into an "office supplies" cluster, and Customer 2 into a "tools" cluster. Recommendations follow the groupings that should be a good prediction of the customer's buying habits. No recommendation could be made for Customer 3 because its vector was the same as the prototype vector (no dissimilar elements), which meant that it had purchased all items that were grouped by this cluster.
While ART1 worked in this application, other algorithms exist that have their own individual characteristics. The simplest algorithm is k-means clusteringan algorithm I originally tested for this application. The problem with k-means clustering is it is sensitive to the order of the training data. This means that the clusters that result from a data set are dependent upon the order in which the data set is presented to the algorithm. Despite this issue, the k-means algorithm is conceptually simple, fast, and easy to implement.
ART1 was extended by Carpenter and Grossberg to the ARTMAP algorithm (see "A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine," by G. Carpenter and S. Grossberg, Computer Vision, Graphics, and Image Processing 37, 1987). In ARTMAP, the ART1 algorithm is still performed on the input vectors, but the vigilance parameter is temporarily adjusted to keep the clusters pure. ART2 was also created for the clustering of continuous data.
The original intent for this algorithm was for the clustering of library users for book recommendation. Each library user was represented by a feature vector where an element of the vector was defined as a small range of books within the spectrum of the Dewey Decimal Classification System (147.01-149.90). Upon clustering, clusters of feature vectors would be further refined to identify the percentage for which a particular book was checked out within a group. This data could be used to then recommend a book to library users who may not be familiar with it.
This application worked well with test data, but it became clear that getting access to actual library data represented a privacy issue. It turned out that libraries do not keep historical records of which texts users check out. The result was an algorithm without a data set.
There is also concern associated with personalization algorithms. When they work well, they have the ability to predict a user's actions. This can be an uneasy feeling to some as they realize their behavior is being predicted solely based upon external observations.