CS116 -- Artificial Intelligence
Fall, 2006
Bayesian Spam Filter
(last updated 10/31/2006)

Overview
Design and implement a Bayesian spam filter.  The goal of your program is to be able to take an email message and correctly classify it as either "spam" or "not spam".  Becuase you will be using a Bayesian approach to make the classification, you will need to train the classifier on a collection of labeled email messages.

NOTE: make sure your code will compile and run on my system.  If you have any doubts, check with me in advance.

Solution method
I want you to implement a simple Bayesian classifier along the lines we discussed in class.  The components of your assignment include: email pre-processing, converting email to feature vector representation, Bayesian update, and Bayesian prediction.

You'll need to implement a "naive Bayes" algorithm.  I suggest you write it for boolean features.  (That is, an email message will be converted into a boolean feature vector where a feature representing the word "foo" will be true for a message if the message contains one or more occurrances of the word "foo".  More on that below.  Normally, we might use numeric counts for the number of times the word occurs, but for our purposes a boolean feature vector should suffice.)  The core of the classifier is the table of counts from which we estimate conditional probabilities.  For a message marked as spam, we increment the counts of the features found in this spam message; we increment the overall count on spam messages, and we increment the overall count of all messages.  Similarly, for a message marked "not spam", we increment the counts on the relevant features, the count for the "not spam" messages, and the overall count of all messages.

A note on priors.  Because zero-probabilities give us headaches in our computations, we initialize all probabilities with a prior.  For our purposes, it should be sufficient to seed each class, spam and non-spam, with two imaginary messages, one that has all features present and one that has no features present.  In other words, when we start off we assume that every feature is equally likely to appear in a spam message as in a non-spam message.  These priors are attenuated (become insignificant) as we gather training data.

When it comes time to classify an unknown message, we convert it to a feature vector, computer the probability of it being spam conditioned on the particular features this message has, and we compute the corresponding probability for it being non spam.  Whichever probability is greater is the label we predict for the message.

You will need to determine a feature representation for your messages.  You want to ignore words, such as "the", that are probably equally likely to appear in spam as in not spam.  You may also want to ignore the capitalization of words (although, words in ALL CAPS are often a good predictor of some spam, but the spammers are getting smarter).  I suggest you pick a set on the order of 3,000 words that you will use as your feature vector.  To determine which words to use, you might pick them randomly, but you will probably get better predictive results if you process all of the emails messages and gather statistics on which words have different frequencies for spam messages and not-spam messages.

You should provide functions or programs that will (a) take a collection of training email messages (e.g., spam) and a label for that collection ("spam") and train the classifier accordingly, and (b) take as input a collection of email messages, and return a corresponding collection of classifications.  In this scenario, we are doing a kind of batch training

You should train and test your classifier on your own email -- both spam and not.  When you submit your assignment, include instructions for training and testing.  I will train and test on my own collections.

The goals of this assignment are:
  1. gain experience implementing a Bayesian classifier
  2. gain appreciation for the potential (and limitations) of naive Bayes
  3. encounter some of the issues associated with text-processing
  4. have fun and beat the pants off the Barracuda spam filter
Submission and grading
Submit your completed search agents to Eureka in the usual way (zip or tar file).  Include a readme file that explains how to train and test your classifier, etc.  I will grade this project based on (a) code completeness and correctness, (b) coding style including documentation, and (c) classification accuracy.