Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Bayesian Analysis for RSS Reading


March, 2004: Bayesian Analysis For RSS Reading

Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@ simon-cozens.org.


I am very good at distracting myself from getting things done. One of my best displacement activities used to be flicking through various web news sites: the BBC news, Slashdot, use.perl.org, linuxtoday.com, and that's before we get started on the comics.

Occasionally, I have spurts of enthusiasm and try to work out ways to curb my distractions in order to free up time for more interesting forms of procrastination. Instead of having to cycle through all these pages to see if there was any new news or a new edition of the comic, wouldn't it be a lot more efficient if these news sites told me when there was something new for me to read? Then I could spend the equivalent time that I'd save playing chess or something...

Thankfully, other people have obviously had the same question, as one of the big trends in online publishing over the past two or three years has been the surge in syndication of web news sites via Remote Site Syndication(RSS) and RDF. RSS is an XML format that gives the headlines and some of the content of stories on a site in a machine-readable way. This can be read by an RSS aggregator like NetNewsWire (http://ranchero.com/ netnewswire/), which periodically grabs the RSS feeds of all the sites you visit often. By parsing the feeds, it can work out if there are any articles you haven't seen yet and give you a summary of the new news.

This was a good start, but it wasn't enough. So, we'll first have a look at how we can manipulate RSS in Perl, and then we can find out how I managed to cut down my procrastination time even more!

XML::RSS

The first handy tool to deal with RSS is the obviously named XML::RSS module. It can be used both to create new RSS files and to parse existing feeds and turn them into Perl data structures. It's this latter use that I'll discuss here.

For instance, to get a list of the current news stories on the Perl news site use.perl.org, we start by downloading its RSS feed:

use LWP::Simple;
my $xml = get("http://use.perl.org/perl-news-short.rdf");

Then we turn this into an XML::RSS object:

use XML::RSS;
my $rss = XML::RSS->new;
$rss->parse($xml);

And now we have a load of information about the feed. For instance, the title and link:

print $rss->{title}, ": ", $rss->{link}, "\n";

An RSS feed contains a list of items, which we can retrieve. Each item returns a hash reference:

for my $item (@{$rss->{items}}) {
    print $item->{title}, ": ", $item->{link}, "\n";
}

And sometimes, the content providers put a short description or the first few paragraphs of the news item in the description tag of an item:

for my $item (@{$rss->{items}}) {
    print $item->{title}, ": ", $item->{link}, "\n";
    print $item->{description}, "\n\n";
}

However, in order to create a nice news aggregator, we still have to do a lot of work ourselves: downloading the feeds, parsing them, working out which articles have already been seen, and so on. This isn't the Perl way. Hasn't someone else done this for us?

Timesink

Regular readers will know that I find it hard to get through a column without mentioning at least one of the two most interesting modules in the Perl world for me: Class::DBI and the Template Toolkit; well, this time, I get to mention both.

Richard Clamp also got the RSS bug around the same time as me, and rather than be dependent on an application running on one particular machine, he created a web-based aggregator that could keep track of what he had and hadn't read. His project, Timesink, uses XML::RSS to download and parse the feeds, Class::DBI to store the stories in a database, and the Template Toolkit to display the rendered stories to the user. It supports multiple users subscribed to a variety of different feeds, and is remarkably small and well-engineered.

On the other hand, Timesink is still very much in development and isn't very well documented yet, but I managed to get it set up and running on one of my computers. It provided me with the platform I needed to get the next stage of my procrastination optimization under way.

You see, the problem is that with the advent of RSS, it's very easy for me to subscribe to a large number of news sources, many of which will only be of marginal interest to me but occasionally throw up something worthwhile. Instead of having to trudge through a small number of web pages in search of new news, I now have to trudge through a large number of news feeds in search of interesting news.

I first got the computer to decide for me what was new; my thought was that the obvious way to proceed was to get it to decide what's interesting. My dream aggregator would read my news for me, and only present me with the news I want to read.

Based on Bayes

How can a computer tell me what's interesting? By watching what I tell it is interesting, working out the features of an "interesting" article, and then weighing up whether a given news article would be best categorized as "interesting" or "boring." This turns the question into a machine learning problem, and machine learning has been worked on by far greater minds than mine.

There are a number of text categorization modules on CPAN, many of them better than the one I ended up using. A technique called Bayesian analysis has been in vogue for text categorization in the past year or two, particularly for spam filtering, but to be honest, it's not really that good. There's a module called AI::Categorizer that implements a whole bunch of better approaches.

On the other hand, one of the advantages of Bayesian analysis being in vogue is that it has been implemented a lot, and this means that it's likely that one of its implementations has a very simple interface. Algorithm::NaiveBayes was the module I picked in the end, and it does indeed have an easy-to-use interface.

The analyzer in Algorithm::NaiveBayes takes a number of "attributes" and corresponding "weights," and associates them with a set of "labels." Typically, for a text document, the attributes are the words of the document, and the weight is the number of times each word occurs. The labels are the categories under which we want to file the document.

For instance, suppose we have the news article:

The YAPC::Taipei website is now open and calls for registration.
Welcome to Taipei and join the party with us!

We would split that up into words, count the occurrence of the relevant words, and produce a hash like this:

my $news = { 
  now => 1, taipei => 2, join => 1, party => 1, us => 1,
  registration => 1, website => 1, welcome => 1, open => 1,
  calls => 1, yapc => 1 
};

Then, we can add this article to our categorizer's set of known documents:

$categoriser->add_instance( attributes => $news, 
                            labels     => [qw( perl conference 
                                               interesting )] );

When we've added a load of documents, the categorizer has a good idea of what kind of words make up Perl stories, what kind of words suggest that the document is about current affairs, what kind of stories I would categorize as "interesting," and so on. We then ask it to do some sums, and can find out what it thinks about a new document:

$categoriser->train;
my $probs = $categoriser->predict( $new_news );

This returns a hash mapping each category to the probability that the document fits into that category. So, once we have our categorizer trained with a few hand-categorized documents, we can say:

if ( $categoriser->predict( $new_news )->{interesting} > 0.5 ) {
    # Probably interesting 
}

Hooray!

Of course, you may be asking how we turned those documents into hashes of words anyway, and what I meant when I said we only look at the "relevant" words.

Well, turning a string into a hash isn't too difficult:

my %hash;
$hash{$_}++ for split /\s+/, $string;

This doesn't do too badly, but it does have some problems. It copes badly with punctuation, for starters, it passes on things that are not words, like numbers, and it treats differently cased words ("the" versus "The") separately. We can try refining it a little:

$hash{$_}++ for map lc, split /\W+/, $string;

Or, we could try a fundamentally better approach; the Lingua::EN::Splitter module was designed for precisely this kind of task and knows to normalize case, be careful of punctuation, and so on. We can ask its words function to return just the words in a string and leave the implementation details to someone else:

$hash++ for @{words($string)};

That's neater; however, it still has two problems. First, there are a lot of words in English text that are pretty much redundant. They don't carry any semantic content, they're just there to move the story along, as it were—words like "as," "it," and "were." These words, called "stopwords," don't help us know whether or not something is about Perl or about Java, whether a text is interesting or not, so we should get rid of them.

Of course, we don't want to collect a list of such stopwords ourselves. That's a job for the author of the Lingua::EN::StopWords module, which exports the %StopWords hash. This allows us to grep out the stopwords from our list, like so:

$hash++ for grep {!$StopWords{$_}} @{words($string)};

The second problem is that, for an XML::RSS item, we should give more weight to words in the title of a news item than in the description because the title is supposed to be a concise summary of what the piece is about. Hence, we come up with two functions like this:

sub invert_item {
    my $item = shift;
    my %hash;
    invert_string($item->{title}, 2, \%hash);
    invert_string($item->{description}, 1, \%hash);
    return \%hash;
}

sub invert_string {
    my ($string, $weight, $hash) = @_;
    $hash->{$_} += $weight for
        grep { !$StopWords{$_} }
        @{words($string)};
}

This gives title words twice as much weight as the words in the body and returns a hash we can feed to the categorizer. We're only interested in two categories, "interesting" and "boring," so that makes it easier for us.

Now, we have enough parts of the equation to put together our automatic newsreading tool: A way to mark a document as interesting or boring, and a way to predict whether a new document is likely to be interesting.

Automatic Newsreading

How do we fit this into Timesink? The first modification is to provide the user the interface for categorizing a story manually. User interface stuff lives in the templates, so we'll begin by editing the templates.

After the description of the story, we add two links that the user can use to categorize the story, and we add a display showing the percentage of "interest," using a method we'll fill in soon:

     [% IF item.published %]
        <small>Published: [% date.format(item.published) %]</small>
     [% END %]
 +    [% IF user %]
 +    <p align="right">
 +        [ <a href="?feed=[%feed.id %]&item=[% item.id %]&class=interesting"> 
 +          Interesting? </a> ]
 +        [ <a href="?feed=[%feed.id %]&item=[% item.id %]&class=boring"> 
 +          Boring? </a> ]
 +
 +        ( [% user.interest(item) * 100  | format("%2.1d") %]%)
 +    </p>
 +    [% END %]

Second, if these links are clicked, we need to do something—hand off the ID of the story and the ID of the user to a routine that adds a new document to an Algorithm::NaiveBayes categorizer. Thankfully, the Template::Plugin::CGI helper makes it really easy for us to get the data back out of the query string:

     IF user;
         feeds = user.feeds;
 +       IF cgi.param('class');
 +          SET dumm = user.mark_item(cgi.param('item'), cgi.param('class'));
 +       END;
     ELSE;

Remember that the class is going to be interesting or boring depending on which link the user clicked. Finally, from the template point of view, we need to change the order in which the stories are displayed so that the most interesting stories float to the top:

      <h2><a href="./">Timesink:</a> <a href="[% feed.link %]" target="_new">[% feed.name %]</a></h2>
 -[% FOREACH item = feed.items.nsort('seq') %]
 +    [% 
 +    IF user && feed;
 +        SET items = user.sort_interest(feed.items);
 +    ELSE;
 +        SET items = feed.items.nsort('seq');
 +    END;
 +    FOREACH item = items %]

Now we have a few methods that we need to write: the mark_item method, which calls add_element given a user and a feed ID; interest, which trains the categorizer and computes the probability that an article is interesting; and the sort_interest function, which sorts the articles (unseen articles by interest, then seen articles by interest).

These methods have to be on the user object (Timesink::DBI::Subscriber) because what's considered interesting depends on who's reading the page. However, we immediately have a problem.

Where do we acquire the Algorithm::NaiveBayes categorizer that we're going to use? This somehow has to be persistent because we want the user to be able to go away and come back to the web site tomorrow and still have the system know what makes an interesting article, and it needs to work on a per-user basis.

We also need to save away the categorizer after we add a new document to the corpus, and restore it again when we need to make a decision about a new piece of news. The Algorithm::NaiveBayes object is very simple, and we can use the core Storable module to serialize and reload it:

my %bayes_cache;
sub _bayes_file { $config->bayes_toks."/".shift->id.".nb"; }

sub get_bayes {
    my ($self) = @_;
    my $user = $self->id;
    return $bayes_cache{$user} if $bayes_cache{$user};

    my $nb;
    eval { $nb = retrieve($self->_bayes_file) } if -e $self->_bayes_file;
    $nb ||= $bayes_cache{$user} = Algorithm::NaiveBayes->new(purge => 0);
}

_bayes_file turns the user's ID into a file name and get_bayes uses Storable::retrieve to get the object from that file, unless we already have an up-to-date analyzer hanging around in the %bayes_cache hash.

Similarly, when we're finished with the object, we use Storable::store to put the analyzer back to file:

sub done_with_bayes {
    my $self = shift;
    my $nb = shift;
    store($nb, $self->_bayes_file);
}

Now, filling in the missing methods becomes nice and simple. For instance, marking an item as interesting or uninteresting is:

sub mark_item {
    my ($self, $item_id, $class) = @_;
    return unless my $item = Timesink::DBI::Item->retrieve($item_id);
    my $nb = $self->get_bayes;
    $nb->add_instance( attributes => $self->invert_item($item),
                       label      => $class );
    $self->done_with_bayes($nb);
}

While interest is a thin wrapper around the routine we saw earlier:

sub interest {
    my ($self, $item) = @_;
    eval { $self->get_bayes->predict(
            attributes => $self->invert_item($item)
           )-> {interesting}; 
         };
}

Finally, to sort the items efficiently, we use the "Orcish Maneuver" to cache the things we're looking up:

sub sort_interest {
    my ($self, $items) = @_;
    my @items = @$items;
    my $nb = $self->get_bayes;
    $nb->train;
    my (%seen, %interest);
    my @sorted = sort {
        ($seen{$a} ||= $self->is_seen($a)) <=> 
            ($seen{$b} ||= $self->is_seen($b))
                           ||
        ($interest{$b} ||= $self->interest($b)) <=> 
            ($interest{$a} ||= $self->interest($a))
    } @items;
    $self->done_with_bayes($nb);
    return \@sorted;
}

(Note: The "Orcish Maneuver" is a term coined by Joseph Hall and Randal Schwartz in their book Effective Perl Programming [Addison-Wesley, 1997, ISBN 0201419750] and is so named as a pun on the phrase "or cache.")

The complicated sort block will first find articles that haven't been seen before, then will order articles by interest from highest to lowest.

As you click to signify that you find a piece interesting, it floats to the top while less interesting articles sink to the bottom. Over time, my automatic newsreader learns what I like to read and gets better at suggesting stories for me, saving me even more time and allowing me to find new and interesting ways to avoid getting on with writing The Perl Journal articles...

[Ed: We'd like to point out that since Simon's procrastination has now become highly efficient, we'll be asking him for his column one week early next month.]

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.