# Data Aggregation and Bayes Classifiers

### Naive Bayes Classifiers

Bayes Classifiers have been recently used for filtering or classifying data and has been leveraged by a variety of software [7,8,9,10]. In particular, naive Bayes classifiers have demonstrated surprising efficacy in addressing real world problems [12]. A Naive Bayes classifier assumes that any pair of events is statistically independent [11]. The naive independence assumption implies a probabilistic orthogonality among data points. With this assumption it can be shown that:

Equation 3

where Z is a scaling factor. This provides an explicit formula for calculating the probability that a collection of elements correspond to a security level hk for k=[1,…,m]. In order to do classification we consider the ratio of probabilities r = P[hj|G]/P[hk|G]. In our scenario, hj may correspond to "Unclassified ," data while hk corresponds to "Classified" data. This then leads to the guess that if:

• r < 1 the collection G is classified in security level hk.
• r = 1 there is no conclusion
• r > 1 the collection is in security level hj.

Training of the Classifier can be done in a number of ways. One typical approach would be to use Maximum Likelihood Estimation (MLE) , sometimes referred to as Fisher's Method. In this approach the free parameters of the system are estimated using the sample mean and sample variance [7].

Coded examples of Naive Bayes Classifiers can be found in texts and on the Internet [7,8,9,10]. In discussions with our CEP vendor, a Bayesian Classifier is in under development. We will therefore leverage the CEP vendor's implementation of Bayes Classifiers or else create a plug-in [6,7,13].

### Classification Process

Probabilistic techniques are imperfect and cannot replace a human, especially in a venue such as security. However, the intent here is not to replace human evaluation of potential security threats but rather to reduce the problem to one where a human is involved in evaluating borderline scenarios. To achieve this, a combination of deterministic and probabilistic approaches is used.

Following the initial training of the classifier on sample data, the process is as follows:

1. As data aggregates are required (e.g., for reports), the CEP software checks to see if any rules exist for this aggregate.

1. If the rules indicate that a user cannot view the aggregate, the request is rejected and the reviewer is informed. The reviewer may allow or disallow on a case-by-case basis.
2. Otherwise, the aggregate is generated and the report is created.
2. If there are no existing rejection rules , the classifier is run against the metadata to see if the report corresponds to the expected security level.

1. If the report is in not in the security level expected, the requester is denied the report and it is sent to a reviewer to address on a case by case basis.
2. Otherwise, the report is created and continues to its requester.
3. In either case, the classifier adds the data aggregate to a lookup of known aggregates.

3. Allow/deny decisions from the reviewer are used to update the classifier

Figure 1 is an activity diagram for the process.

Figure 1: Process for incorporating Classifier

### Conclusion

In this article, I've described a mechanism for using Bayes Classifiers to reduce the security inferencing problem. The mechanism employs a combination of deterministic and nondeterministic elements -- the most interesting of these being the Naive Bayes Classifier. Note that this approach is fundamentally no different from Document Classification. In particular, given the success of Bayesian techniques in document classification, it is worth consideration as a mechanism for security classification. This implementation is in its nascent stages and we have yet to consider issues such as temporal constraints on metadata or staleness of the classifier's lookup. In a subsequent article, I hope to address these issues and report on the outcomes.

### References

1. Garvey, T. D. "The Inference Problem for Computer Security." Proceedings Computer Security Foundations Workshop V, Jun 16-18, 1992, pp.78-81.

2. Leino, K. Rustan M. and Joshi, Rajeev. "A Semantic Approach to Secure Information Flow." SRC Tech Note, 1997 -- 032, Digital Equipment Corporation, 1997.

3. Chang, Liwu and Moskowitz, Ira. "A Study of Inference Problems in Distributed Databases." Naval Research Lab-5540, Washington DC, USA.

4. Chang, Liwu and Moskowitz, Ira. "A Bayesian Network Schema for Lessening Database Inference." Proceedings of CIMCA-01, 2001.

5. McKay, David J. C. "Information Theory, Inference and Learning Algorithms." Cambridge University Press, 2003.

6. Witten, Ian H., Frank, Eibe. "Data Mining: Practical Machine Learning Tools and Techniques." Elsevier, Inc. 2005.The BIPS Project: http://www.astro.cornell.edu/staff/loredo/bayes.

7. Segaran, Toby. "Programming Collective Intelligence: Building Smart Web 2.0 Applications." O'Reilly Media, Inc. 2007.

8. The BUGS Project. http://www.mrc-bsu.cam.ac.uk/bugs.

9. Bayesian Filtering Library. http://www.orocos.org/bfl.

10. Bayes++. http://bayesclasses.sourceforge.net/Bayes++.html.

11. Larsen, Kim. "Generalized Naïve Bayes Classifiers." SigKDD, Volume 7, Issue 1, pp76-81. Zhang, Harry. "The Optimality of Naïve Bayes." http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf, 2004.

12. SamIam official site: http://reasoning.cs.ucla.edu/samiam/

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.