*Lynn is a software architect with Novell. He can be reached at [email protected]*

E-mail programs that offer automatic mail sorting almost always fall prey to the same problem -- it's easier to manually sort your mail than figure out how to program the mail client to handle it for you. Similar problems afflict many user-interface designs. But what if the mail program simply watched you manually sort your mail, and learned by example how to do it for you? The algorithm Lynn Monson describes this month may be one way to accomplish exactly that.

-- Tim Kientzle

With the rise of the Internet, the ability to effectively search for information is becoming increasingly important. Web users, for example, routinely use search engines, catalogs, and Internet directories to find data of interest. However, the information on the Internet is too large, too diverse, and changes too rapidly for these methods to be very effective. One technology that may help involves software agents that search for data on a user's behalf. These agents can search through a set of data and determine which particular items are likely to be interesting. The agent uses criteria it has learned to make that judgment.

In this article, I'll describe a variation of the ID3 and C4.5 algorithms that can be used to classify textual data. With this algorithm as a basis, you'll be able to understand the related literature and begin building your own information-gathering agents.

### ID3

ID3 is a supervised learning algorithm, examined by Andrew Colin in "Building Decision Tress with the ID3 Algorithm," (*DDJ*, June 1996). It is explicitly taught from a series of training examples from several classes. The algorithm builds a theory that allows it to predict the class of an item.

ID3 attempts to identify properties (or features) that differentiate one class of examples from others. ID3 requires that all features be known in advance and that each feature is "well behaved" (that is, all possible values are known in advance). This means a given property must be either a continuous number or drawn from a set of options. Age, height, temperature, and country of citizenship are all well-behaved features.

ID3 uses entropy to determine which features of the training cases are important. Entropy is used to construct a decision tree, which is then used for testing future cases. In addition, the decision tree is usually optimized using one of several techniques. The techniques discussed in this article are taken from the C4.5 algorithm.

The information, or entropy, of a set of examples is defined as:

From a set of examples, construct a probability distribution P = {p1,p2,...pn} using a classification scheme. Given that distribution, the information required to identify a given case in the distribution is shown in Figure 1(a).

This metric is the number of bits required to identify a given class from the probability distribution. For example, if I have 12 marbles, 3 of which are blue, 7 are red, and 2 are green; the distribution is {3/12,7/12,2/12} and requires 1.38 bits to represent.

### Conditional Entropy

If you partition data in some meaningful way, the total entropy of the parts will be lower than the entropy of the set you started with. ID3 uses a calculation called "conditional entropy" to determine which of several different partitions is most effective.

Start with a distribution of examples -- P, as before -- but divide the examples into groups named Xi. The conditional entropy is then defined in Figure 1(b).

To determine which features are most important, you use that feature to partition your data and compute the conditional entropy. The feature that gives the lowest conditional entropy is the most important.

In Table 1, you have three features -- lot size, income level, and age -- you can use to predict the type of lawn mower used by a homeowner. Since there are three riding lawn mowers and four push mowers, the probability distribution for the examples is P={3/7,4/7}. This gives a total entropy value of-(3/7)*log_{2}(3/7)-(4/7)*log_{2}(4/7)=0.98522.

Now suppose you divide up the examples based on lot size. This gives three new probability distributions, one for each possible lot size. The distributions are P1={0/3,3/3} for Small lots, P2={1/2,1/2} for Medium lots, and P3={2/2,0/2} for Large lots. The conditional entropy value is E(P*|Lot Size*)*=*(3/7)*E(P1)*+*(2/7)*E(P2) *+*(2/7)*E(P3)*=*0.286.

The conditional entropy for age would require separate tests -- one for each range of values in the training data: E(P*|Age<=*27)*,*E(P*|Age<=*30)*,* and so on.

ID3 tests every possible feature using conditional entropy. The feature with the lowest entropy value is taken to be the best test. ID3 then builds a node in a decision tree that identifies the given feature. The branches of the tree, coming from the node, are associated with possible outcomes of the test. If the node is labeled "Lot Size," the branches from the node would be "Small," "Medium," and "Large."

Having identified a test on a feature, ID3 then invokes itself recursively -- the list of examples is partitioned based on the identified test, the feature already tested is removed from consideration, and the algorithm is invoked for each branch of the tree. This process continues until either the remaining examples are all of one class or there are too few remaining examples. At that point, a leaf is added to the tree identifying the class of examples.

### C4.5

One limitation of ID3 is that it is overly sensitive to features with large numbers of values. This must be overcome if you are going to use ID3 as an Internet search agent. I address this difficulty by borrowing from the C4.5 algorithm, an ID3 extension.

ID3's sensitivity to features with large numbers of values is illustrated by Social Security numbers. Since Social Security numbers are unique for every individual, testing on its value will always yield low conditional entropy values. However, this is not a useful test. (Social Security numbers do not help you predict whether a future medical patient needs surgical intervention.)

To overcome this problem, C4.5 uses a metric called "information gain," which is defined by subtracting conditional entropy from the base entropy; that is, *Gain(*P|X*)=*E*(*P*)*-E*(*P|X*)*. This computation does not, in itself, produce anything new. However, it allows you to measure a gain ratio. Gain ratio, defined as *GainRatio(*P*|*X*)=Gain(*P*|*X*)/*E*(*X*), *where E*(*X*)* is the entropy of the examples relative only to the attribute X, measures the information gain of feature X relative to the "raw" information of the X distribution.

By using the gain ratio instead of the plain conditional entropy, C4.5 reduces problems from artificially low entropy values such as Social Security numbers.

### Extending ID3 and C4.5

While simple and expressive, ID3 and C4.5 aren't enough for searching the Web. ID3 assumes all properties of a test case are known in advance and that each property has a known range of values. That definition doesn't fit text. Text is open ended and can contain an infinite number of values.

Fortunately, developments in the relevant literature illustrate how ID3 (and hence C4.5) can be extended for searching text. The ideas are loosely based on concepts from information retrieval.

In many information-retrieval algorithms, a text document is compressed into a form known as a "bag of words" -- a bag contains every word in the document, but information about word ordering and sentence structure is not preserved. Sometimes each word also has a count. The assumption is that the relative frequency of words in two given word bags can be compared to determine if the documents are similar.

To extend ID3 to support text comparison, I adjust the test criteria. First, the test cases presented to ID3 are allowed to have features that are textual. To test such a feature, it is interpreted as a "document" and put into the form of a bag of words. You can then use ID3 to classify documents based on true/false tests of the form "Does bag X contain word Y?"

This test -- whether a word is present in the feature -- fits into the notion of conditional entropy tests. The ID3 algorithm can then proceed in the usual manner.

It sometimes makes sense to reverse the state of the test when testing for an outcome. For example, if all examples are drawn from two classes, you may want to construct the decision tree such that all left branches correspond to one class, while right branches correspond to the other.

In the case of searching web data, ID3 can be simplified to learning examples that have a single feature. Since that feature can now be text, you can encode many attributes as words within the text. For example, you can distinguish the word X in the title of a document by including the word "TITLE_X" within the text of the example. All that is required is some preprocessing of the examples to compress features into text. This compression simplifies the ID3 algorithm considerably.

### Implementation

Listing One illustrates the base algorithm and distinguishes a set of examples between two classes. (The complete source code and related files is available electronically; see "Availability," page 3.) The source code produces a decision tree, performs optimizations on the tree, and prints the resulting, optimized tree. The tree is readable and can be used for further processing. In this implementation, I have assumed a binary decision; the code can only distinguish between two possible predictions. This is not a serious restriction, since the algorithm can be repetitively applied -- you first separate the least common outcome from all other outcomes (a binary decision), then separate the next common outcome, and so on.

The algorithm is driven from a Java class named *ExampleSet,* which is constructed. Then, its *init()* method is invoked. The sole argument to *init()* is a *File* object from which the examples will be read. *ExampleSet.init()* assumes a particular formatting within the file; see Example 1.

A *Java.IO.StreamTokenizer* is used to parse the file. As each example is parsed, a *BagOWords* object is created. This compressed form of each example is preserved while the ID3 algorithm is run. To improve performance, the algorithm makes use of another class named *TotalWordMap*, which enumerates the word list used across all examples and can indicate how many examples used a given word. The base algorithm is driven primarily from a *TotalWordMap*. The *BagOWords* objects are used for partitioning the examples after a test, which requires that a given *TotalWordMap* object also be split.

Having parsed the input file, the *ExampleSet.Init()* method then invokes the *ExampleSet.ID3()* method, which iterates over all possible tests, finds the test with best gain ratio, and builds a node in the tree. *ExampleSet.ID3* is recursively invoked for the left and right branches of the tree. This continues until all of the training cases are covered.

### Pruning

The decision tree produced by ID3 is a good description of the training data. However, the tree is almost always too specific for predicting future cases. Therefore, it is often useful to optimize the tree by pruning back leaves or entire branches. There are numerous strategies for pruning. Generally, pruning involves either additional testing data or the use of statistical techniques for generalizing the tests.

The pruning algorithm used in the example code is a version of that used in C4.5. The idea is to use a heuristic test for "predicting" the future error rates of a given branch in the tree. If a given branch has a higher error rate than a simple leaf would, the branch is replaced with a leaf. By applying this heuristic from the bottom to the top of the tree, you prune back the tree for better future prediction.

The error prediction is based on a binomial distribution. You pretend that the number of examples covered at each node in the tree is a statistical sample, and that the errors at the node are observed events. Normally, a statistician would use the probability of observing such an individual event to calculate the probability that this particular sample was observed. In this case, however, I am going to turn that equation around and assume some probability for the sample and back calculate the probability of a single event. That calculated probability is presumed to be the error rate of the node.

Calculating the binomial probability is easy when the number of errors is zero. The error rate turns out to be a simple exponential calculation. However, when the number of errors is not zero, computing the error rate becomes much more difficult. Instead of trying to solve the error rate based on the binomial formula, I use a normal distribution for approximating the binomial error rate. This normal approximation turns out to be calculable, although for very small probabilities or very low number of cases it is not a good representation of the binomial equation. As a heuristic, though, empirical tests indicate that it is an adequate approximation.

The prediction code in the sample can be invoked via the *TreeNode.getPredictedErrs()* method, which will use either a binomial or normal approximation for the error rate. This method returns a predicted number of errors, not a probability.

The sample code prunes the decision tree by calling *ExampleSet.PruneTree,* which recurses to the bottom of the decision tree and begins the pruning process. At each nonleaf node in the tree, consideration is given to replacing the node with a single leaf. To make the determination, the error rate of the node is compared against the error rate of each possible leaf -- whichever error rate is lower determines the fate of the node. The tree is then pruned from the bottom up, and the resulting tree is returned.

To run the sample code, invoke the *rules.class* object. This will raise a window containing a simple menu and a text box. From the file menu, simply select Exit or Open. Open will let you choose a file of test cases, run the algorithm against those cases, and output both an unpruned and a pruned decision tree. All output is to the text box in the primary window.

### Example Usage

Included with the sample is a file named "Winsock.txt" (available electronically), which contains a selection of messages from the Winsock-2 mailing list. Some of the messages in this file are identified as being "service provider" related. Running the sample code against this data will result in the decision tree in Example 2.

All in all, Example 2 is an interesting decision tree. It misclassified only four of 92 cases on the training data, and has managed to capture perfectly my intention that all SSL-related messages be classified as service-provider related. Other empirical tests on this implementation show that it performs quite well. While this example does not use the full C4.5 algorithm, it does demonstrate the efficacy of the base ID3 algorithm. I have successfully used this example to classify e-mail messages and documents. My future plans are to extend this algorithm with additional optimizations and heuristics for wide-area searching of the web. Additionally, I am investigating the encapsulation of this algorithm in applet form.

### References

Cohen, William W. "Learning Trees and Rules with Set-valued Features," AT&T Laboratories. Paper from AAAI '96. Available at http://portal.research.bell-labs.com/orgs/ssr/ people/wcohen/postscript/aaai-96.ps.

Quinlan, J. Ross. *C4.5 Programs for Machine Learning*, Morgan Kaufmann, 1993.

#### Listing One

/* Calculate maximum gain test (C4.5) from list of examples, * TotalWordMap, and two class names. */ static TreeNode MaxGainTest( Vector vExamples, TotalWordMap twm, String sClassA, String sClassB ) { int iTotalMain = 0; int iTotalNotMain = 0; boolean bPosTest = true; </p> Enumeration e = vExamples.elements(); while ( e.hasMoreElements() ) { BagOWords bow = (BagOWords)e.nextElement(); if ( bow.sClass.equalsIgnoreCase( sClassA ) ) iTotalMain = iTotalMain +1; else iTotalNotMain = iTotalNotMain +1; } if ( iTotalMain<iSmallestNodeSize | iTotalNotMain<iSmallestNodeSize | (iTotalMain + iTotalNotMain < iMinExamples ) ) { int iTotal = iTotalMain + iTotalNotMain; if ( iTotalMain < iTotalNotMain ) return new TreeNode( sClassB, iTotal, iTotalMain ); else return new TreeNode( sClassA, iTotal, iTotalNotMain ); } double dbBestGain = -1.0; double dbBaseEntropy = Entropy((double)iTotalMain/(iTotalMain+iTotalNotMain), (double)iTotalNotMain/(iTotalMain+iTotalNotMain) ); e = twm.enumWords(); String sBestWord = ""; int pBest=0, nBest=0; while ( e.hasMoreElements() ) { String sWord = (String)e.nextElement(); int p=0, n=0; // Number of true positives p = twm.getClassCnt( sWord, sClassA ); // Number of false positives n = twm.getNotClassCnt( sWord, sClassA ); double dbEntropy = CondEntropy( p, n, iTotalMain-p, iTotalNotMain-n ); double dbGain = dbBaseEntropy - dbEntropy; if ( dbGain > dbBestGain ) { dbBestGain = dbGain; sBestWord = sWord; pBest = p; nBest = n; bPosTest = true; if ( p<n ) { pBest = iTotalMain-p; nBest = iTotalNotMain-n; bPosTest = false; } } } int iTotal = iTotalMain + iTotalNotMain; boolean bIsLeaf = false; if ( dbBestGain <= 0 ) bIsLeaf = true; if (pBest==0 && nBest==iTotal ) bIsLeaf = true; if ( nBest==0 && pBest==iTotal ) bIsLeaf = true; if ( bIsLeaf ) { if ( iTotalMain < iTotalNotMain ) return new TreeNode( sClassB, iTotal, iTotalMain ); else return new TreeNode( sClassA, iTotal, iTotalNotMain ); } return new TreeNode( sClassA, sBestWord, bPosTest ); } // ID3/(subset C4.5) algorithm TreeNode ID3( Vector vExamples, TotalWordMap twm, String sMainClass, String sNotMainClass ) { Vector vLeft = new Vector(); TotalWordMap twmLeft = new TotalWordMap(); String sRuleLeft, sRuleRight; TreeNode tn = MaxGainTest( vExamples, twm, sMainClass, sNotMainClass ); </p> if ( tn.bIsLeaf ) return tn; // Build the left branch data structures Enumeration e = vExamples.elements(); while ( e.hasMoreElements() ) { BagOWords b = (BagOWords)e.nextElement(); if ( b.hasWord( tn.sTest ) == tn.bPosTest ) { twmLeft.addBOW( b ); vLeft.addElement( b ); } } // Delete the word from all examples e = vExamples.elements(); while( e.hasMoreElements() ) { BagOWords b = (BagOWords)e.nextElement(); if ( b.hasWord( tn.sTest ) ) b.removeWord( tn.sTest ); } twm.deleteWord( tn.sTest ); twmLeft.deleteWord( tn.sTest ); // Remove examples from the right branch data e = vLeft.elements(); while( e.hasMoreElements() ) { BagOWords b = (BagOWords)e.nextElement(); vExamples.removeElement( b ); twm.fixupRemove( b ); } // Recursively build the left and right branches. tn.Left = ID3( vLeft, twmLeft, sMainClass, sNotMainClass ); tn.Right = ID3( vExamples, twm, sMainClass, sNotMainClass ); return tn; } // Prune a decision tree TreeNode PruneTree( TreeNode tn, String sClassA, String sClassB ) { if ( tn.bIsLeaf ) return tn; // Prune lower levels of tree first if ( !tn.Left.bIsLeaf ) tn.Left = PruneTree( tn.Left, sClassA, sClassB ); if ( !tn.Right.bIsLeaf ) tn.Right = PruneTree( tn.Right, sClassA, sClassB ); int iActualA = tn.getActualCnt( sClassA ); int iActualB = tn.getActualCnt( sClassB ); int iTotalOverall = iActualA + iActualB; double dbNumErrLeafA = tn.calcPredictedErrs( iTotalOverall, iActualB ); double dbNumErrLeafB = tn.calcPredictedErrs( iTotalOverall, iActualA ); double dbTreeRate = tn.getPredictedErrs(); // Replace with a leaf of type "A"? if ( dbNumErrLeafA <= dbNumErrLeafB && dbNumErrLeafA <= dbTreeRate ) return new TreeNode( sClassA, iTotalOverall, iActualB ); // Replace with a leaf of type "B"? if ( dbNumErrLeafB <= dbTreeRate ) return new TreeNode( sClassB, iTotalOverall, iActualA ); return tn; }

**DDJ**

*Copyright © 1997, Dr. Dobb's Journal*