Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

Naive Bayesian Text Classification


Implementation

The Perl implementation (Listing One) uses the hash (associative array) %words to store the word counts for each word and for each category.

Listing One


use strict;
use DB_File;

# Hash with two levels of keys: $words{category}{word} gives count of
# 'word' in 'category'.  Tied to a DB_File to keep it persistent.

my %words;
tie %words, 'DB_File', 'words.db';

# Read a file and return a hash of the word counts in that file

sub parse_file
{
    my ( $file ) = @_;
    my %word_counts;

    # Grab all the words with between 3 and 44 letters

    open FILE, "<$file";
    while ( my $line = <FILE> ) {
        while ( $line =~ s/([[:alpha:]]{3,44})[ \t\n\r]// ) {
            $word_counts{lc($1)}++;
        }
    }
    close FILE;
    return %word_counts;
}

# Add words from a hash to the word counts for a category
sub add_words
{
    my ( $category, %words_in_file ) = @_;

    foreach my $word (keys %words_in_file) {
        $words{"$category-$word"} += $words_in_file{$word};
    }
}

# Get the classification of a file from word counts
sub classify
{
    my ( %words_in_file ) = @_;

    # Calculate the total number of words in each category and
    # the total number of words overall

    my %count;
    my $total = 0;
    foreach my $entry (keys %words) {
        $entry =~ /^(.+)-(.+)$/;
        $count{$1} += $words{$entry};
        $total += $words{$entry};
    }

    # Run through words and calculate the probability for each category

    my %score;
    foreach my $word (keys %words_in_file) {
        foreach my $category (keys %count) {
            if ( defined( $words{"$category-$word"} ) ) {
                $score{$category} += log( $words{"$category-$word"} /
                                          $count{$category} );
            } else {
                $score{$category} += log( 0.01 /
                                          $count{$category} );
            }
        }
    }
    # Add in the probability that the text is of a specific category

    foreach my $category (keys %count) {
        $score{$category} += log( $count{$category} / $total );
    }
    foreach my $category (sort { $score{$b} <=> $score{$a} } keys %count) {
        print "$category $score{$category}\n";
    }
}

# Supported commands are 'add' to add words to a category and
# 'classify' to get the classification of a file

if ( ( $ARGV[0] eq 'add' ) && ( $#ARGV == 2 ) ) {
    add_words( $ARGV[1], parse_file( $ARGV[2] ) );
} elsif ( ( $ARGV[0] eq 'classify' ) && ( $#ARGV == 1 ) ) {
    classify( parse_file( $ARGV[1] ) );
} else {
    print <<EOUSAGE;
Usage: add <category> <file> - Adds words from <file> to category <category>
       classify <file>       - Outputs classification of <file>
EOUSAGE
}

untie %words;

The hash is stored to disk using a Perl construct called a "tie" that, when used with the DB_File module, results in the hash being stored automatically in a file called "words.db" so that its contents persist between invocations.

use DB_File;
my %words;
tie %words, 'DB_File', 'words.db';

The hash keys are strings of the form category-word: For example, if the word "potato" appears in the category "veggies" with a count of three, there will be a hash entry with key "potato-veggies" and value "3." This data structure contains enough information to compute the probability of a document and do a naïve Bayesian classification.

The subroutine parse_file reads the document to be classified or trained on and fills in a hash called %words_in_file that maps words to the count of the number of times that word appeared in the document. It uses a simple regular expression to extract every 3- to 44-letter word that is followed by whitespace; in a real classifier, this word splitting could be made more complex by accounting for punctuation, digits, and hyphenated words.

sub parse_file
{
   my ( $file ) = @_;
   my %word_counts;
   open FILE, "<$file";
   while ( my $line = <FILE> ) {
      while ( $line =~
          s/([[:alpha:]]{3,44})[ \t\n\r]// ){
        $word_counts{lc($1)}++;
      }
   }
   close FILE;
   return %word_counts;
}

The output of parse_file can be used in two ways: It can be used to train the classifier by learning the word counts for a particular category and updating the %words hash, or it can be used to determine the classification of a particular document.

To train the classifier, call the add_words subroutine with the output of parse_file and a category. In the Perl code, a category is any string and the classifier is trained by passing sample documents into parse_file and then into add_words: add_words( <category>, parse_file( <sample document>));

sub add_words
{
   my ( $category, %words_in_file ) = @_;
   foreach my $word (keys %words_in_file) {
      $words{"$category-$word"} +=
         $words_in_file{$word};
   }
}

Once document training has been done, the classify subroutine can be called with the output of parse_file on a document. classify will print out the possible categories for the document in order of most likely to least likely:

classify ( parse_file( <document to classify> ) );

sub classify
   my ( %words_in_file ) = @_;
   my %count;
   my $total = 0;
   foreach my $entry (keys %words) {
      $entry =~ /^(.+)-(.+)$/;
      $count{$1} += $words{$entry};
      $total += $words{$entry};
   }
   my %score;
   foreach my $word (keys %words_in_file) {
      foreach my $category (keys %count) {
         if (defined($words{"$category-$word"})) {
            $score{$category} +=
               log( $words{"$category-$word"} /
                  $count{$category} );
         } else {
            $score{$category} +=
               log( 0.1 /
                  $count{$category} );
         }
      }
   }
   foreach my $category (keys %count) {
      $score{$category} +=
         log( $count{$category} / $total );
   }
   foreach my $category (sort { $score{$b} <=> $score
{$a} } keys %count) {
      print "$category $score{$category}\n";
   }
}

classify first calculates the total word count ($total) for all categories (which it needs to calculate P(Ci)) and the word count for each category (%count indexed by category name, which it needs to calculate P(Wj|Ci)). Then classify calculates the score for each category: The score is the value of P(Ci|D). It's preferable to call it a score for two reasons: Ignoring P(D) means that, strictly speaking, the value is being calculated incorrectly and classify uses logs to reduce overflow errors and replace multiplication by addition for speed. The score is in fact log P(Ci|D), which is:

log P(W0|Ci) + log P(W1|Ci) + ... + log P(Wm-1|Ci) + log P(Ci)

(Recall the equality log (A*B)=log A+log B). In that log form, it is still suitable for comparison. After the score has been calculated, classify calculates log P(Ci) for each category and then sorts the scores in descending order to output the classifier's opinion of the document. classify makes an estimate of the probability for a word that doesn't appear in a particular category by calculating a very small, nonzero probability for that word based on the word count for the category:

$score{$category} += log( 0.1 / $count{$category} );

</pre
<P>
<p>A small amount of Perl code wraps these three subroutines into a usable classifier that accepts commands to add a document to the word list for a category (and hence, train the classifier), and to classify a document.</p>
<P>
<p><pre class="brush: perl; html: collapse;" style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;">
if ( ( $ARGV[0] eq 'add' ) && ( $#ARGV == 2 ) ) {
   add_words( $ARGV[1], 
                     parse_file( $ARGV[2] ) );
} elsif ( ( $ARGV[0] eq 'classify' ) && ( $#ARGV == 1 )
) {
   classify( parse_file( $ARGV[1] ) );
} else {
   print <<EOUSAGE;
Usage: add <category> <file> - Adds words from <file>
to category <category>
   classify <file> - Outputs classification
of <file>
EOUSAGE
}
untie %words;

If the Perl code is stored in file bayes.pl, then the classifier is trained like this:

perl bayes.pl add veggies article-about-vegetables
perl bayes.pl add fruits article-about-fruits
perl bayes.pl add nuts article-about-nuts

</pre
<P>
<p>to create three categories (veggies, fruits, and nuts). Asking bayes.pl to classify a document will output the likelihood that the document is about vegetables, fruits, or nuts:</p>
<P>
<p><pre class="brush: perl; html: collapse;" style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;">
% perl bayes.pl classify article-I-just-wrote
fruits -4.11700258611469
nuts -6.60190923590268
veggies -11.9002266024507

Here, bayes.pl shows that the new article is most likely about fruits.

E-Mail Classification

If you are interested in classifying e-mail, there are a couple of tweaks that improve accuracy in practice: Don't fold case on values from headers and count words differently if they appear in the subject or body.

In the aforementioned Perl implementation, there is no difference between the words From, FROM, and fRoM: They are all considered to be instances of from. The parse_file subroutine lowercases the word before counting it. In practical e-mail classifiers, the names of e-mail headers turn out to be a better indicator of the type of an e-mail if case is preserved. For example, the header MIME-Version was written MiME-Version by one piece of common spamming software.

Distinguishing words found in the subject versus the body also increases the accuracy of a naïve Bayesian text classifier on e-mail. The simplest way to do this is to store a word like forward as subject:forward when it comes from the subject line, and simply forward when it is seen in the body.

Performance

The Perl code presented here isn't optimized at all. Each time classify is called, it has to recalculate the total word count for each category and it would be easy to cache the log values between invocations. The use of a Perl hash will not scale well in terms of memory usage.

However, the algorithm is simple and can be implemented in any language. A highly optimized version of this code is used in the POPFile e-mail classifier to do automatic classification. It uses a combination of Perl and SQL queries. The Bow toolkit from CMU has a fast C implementation of naïve Bayesian classification.

Uses of Text Classification

Although spam filtering is the best-known use of naïve Bayesian text classification, there are a number of other interesting uses on the horizon. IBM researcher Martin Overton has published a paper concerning the use of naïve Bayesian e-mail classification to detect e-mail-borne malware (http://arachnid.homeip.net/papers/VB2004-Canning-more-than-SPAM-1.02.pdf). In Overton's paper, presented at the Virus Bulletin 2004 conference, he demonstrated that a text classifier could accurately identify worms and viruses, such as W32.Bagle, and that it was able to spot even mutated versions of the worms. All this was done without giving the classifier any special knowledge of viruses.

The POPFile Project is a general e-mail classifier that can classify incoming e-mail into any number of categories. Users of POPFile have reported using its naïve Bayesian engine to classify mail into up to 50 different categories with good accuracy, and one journalist uses it to sort "interesting" from "uninteresting" press releases.

At LISA 2004, four Norwegian researchers presented a paper concerning a system called DIGIMIMIR, which was capable of automatically classifying requests coming into a typical IT help desk and in some cases responding automatically (http://www.digimimir.org/). They use a document clustering approach that, while not naïve Bayesian, is similar in implementation complexity and allowed the clustering together of "similar" e-mails without knowing the initial set of possible topics.


John is chief scientist at Electric Cloud, which focuses on reducing software build times. He is also the creator of POPFile. John can be contacted at [email protected].


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.