Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Naive Bayesian Text Classification

Source Code Accompanies This Article. Download It Now.

Paul Graham popularized the term "Bayesian Classification" (or more accurately "Naïve Bayesian Classification") after his "A Plan for Spam" article was published (http://www.paulgraham.com/spam.html). In fact, text classifiers based on naïve Bayesian and other techniques have been around for many years. Companies such as Autonomy and Interwoven incorporate machine-learning techniques to automatically classify documents of all kinds; one such machine-learning technique is naïve Bayesian text classification.

Naïve Bayesian text classifiers are fast, accurate, simple, and easy to implement. In this article, I present a complete naïve Bayesian text classifier written in 100 lines of commented, nonobfuscated Perl.

A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests.

The classifier I present here determines which of a set of possible categories a document is most likely to fall into and can be used in any of the ways mentioned with appropriate training. Feed it samples of spam and nonspam e-mail and it learns the difference; feed it documents on various medical fields and it distinguishes an article on, say, "heart disease" from one on "influenza." Show it samples of different types of help desk requests and it should be able to sort them so that when 50 e-mails come in informing you that the laser printer is down, you'll quickly know that they are all the same.

The Math

You don't need to know any of the underlying mathematics to use the sample classifier presented here, but it helps.

The underlying theorem for naïve Bayesian text classification is the Bayes Rule:

P(A|B) = ( P(B|A) * P(A) ) / P(B)

The probability of A happening given B is determined from the probability of B given A, the probability of A occurring and the probability of B. The Bayes Rule enables the calculation of the likelihood of event A given that B has happened. This is used in text classification to determine the probability that a document B is of type A just by looking at the frequencies of words in the document. You can think of the Bayes Rule as showing how to update the probability of event A happening given that you've observed B.

A far more extensive discussion of the Bayes Rule and its general implications can be found in the Wikipedia (http://en.wikipedia.org/wiki/Bayes%27_Theorem). For the purposes of text classification, the Bayes Rule is used to determine the category a document falls into by determining the most probable category. That is, given this document with these words in it, which category does it fall into?

A category is represented by a collection of words and their frequencies; the frequency is the number of times that each word has been seen in the documents used to train the classifier.

Suppose there are n categories C0 to Cn-1. Determining which category a document D is most associated with means calculating the probability that document D is in category Ci, written P(Ci|D), for each category Ci.

Using the Bayes Rule, you can calculate P(Ci|D) by computing:

P(Ci|D) = ( P(D|Ci ) * P(Ci) ) / P(D)

P(Ci|D) is the probability that document D is in category Ci; that is, the probability that given the set of words in D, they appear in category Ci. P(D|Ci) is the probability that for a given category Ci, the words in D appear in that category.

P(Ci) is the probability of a given category; that is, the probability of a document being in category Ci without considering its contents. P(D) is the probability of that specific document occurring.

To calculate which category D should go in, you need to calculate P(Ci|D) for each of the categories and find the largest probability. Because each of those calculations involves the unknown but fixed value P(D), you just ignore it and calculate:

P(Ci |D) = P(D|Ci ) * P(Ci)

P(D) can also be safely ignored because you are interested in the relative—not absolute—values of P(Ci|D), and P(D) simply acts as a scaling factor on P(Ci|D).

D is split into the set of words in the document, called W0 through Wm-1. To calculate P(D|Ci), calculate the product of the probabilities for each word; that is, the likelihood that each word appears in Ci. Here's the "naïve" step: Assume that words appear independently from other words (which is clearly not true for most languages) and P(D|Ci) is the simple product of the probabilities for each word:

P(D|Ci) = P(W0|Ci) * P(W1|Ci) * ... * P(Wm-1|Ci)

For any category, P(Wj|Ci) is calculated as the number of times Wj appears in Ci divided by the total number of words in Ci. P(Ci) is calculated as the total number of words in Ci divided by the total number of words in all the categories put together. Hence, P(Ci|D) is:

P(W0|Ci) * P(W1|Ci) * ... * P(W m-1|Ci) * P(Ci)

for each category, and picking the largest determines the category for document D.

A common criticism of naïve Bayesian text classifiers is that they make the naïve assumption that words are independent of each other and are, therefore, less accurate than a more complex model. There are many more complex text classification techniques, such as Support Vector Machines, k-nearest neighbor, and so on. In practice, naïve Bayesian classifiers often perform well, and the current state of spam filtering indicates that they work very well for e-mail classification.

A useful toolkit that implements different algorithms is the freely available Bow toolkit from CMU (http://www-2.cs.cmu.edu/~mccallum/bow/). It makes a useful testbed for comparing the accuracy of different techniques. A good starting point for reading more about naïve Bayesian text classification is the Wikipedia article on the subject (http://en.wikipedia.org/wiki/Naïve_Bayesian_classification).

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.