Overview Design and implement a Bayesian spam filter. The goal of
your program is to be able to take an email message and correctly
classify it as either "spam" or "not spam". Becuase you will be
using a Bayesian approach to make the classification, you will need to
train the classifier on a collection of labeled email messages.
NOTE: make sure your code will compile and run on my system. If
you have any doubts, check with me in advance.
I want you to implement a simple Bayesian classifier along the lines we
discussed in class. The components of your assignment include:
email pre-processing, converting email to feature vector
representation, Bayesian update, and Bayesian prediction.
You'll need to implement a "naive Bayes" algorithm. I suggest you
write it for boolean features. (That is, an email message will be
converted into a boolean feature vector where a feature representing
the word "foo" will be true for a message if the message contains one
or more occurrances of the word "foo". More on that below.
Normally, we might use numeric counts for the number of times the word
occurs, but for our purposes a boolean feature vector should
suffice.) The core of the classifier is the table of counts from
which we estimate conditional probabilities. For a message marked
as spam, we increment the counts of the features found in this spam
message; we increment the overall count on spam messages, and we
increment the overall count of all messages. Similarly, for a
message marked "not spam", we increment the counts on the relevant
features, the count for the "not spam" messages, and the overall count
of all messages.
A note on priors.
Because zero-probabilities give us headaches in our computations, we
initialize all probabilities with a prior.
For our purposes, it should be sufficient to seed each class, spam and
non-spam, with two imaginary messages, one that has all features
present and one that has no features present. In other words,
when we start off we assume that every feature is equally likely to
appear in a spam message as in a non-spam message. These priors
are attenuated (become insignificant) as we gather training data.
When it comes time to classify an unknown message, we convert it to a
feature vector, computer the probability of it being spam conditioned
on the particular features this message has, and we compute the
corresponding probability for it being non spam. Whichever
probability is greater is the label we predict for the message.
You will need to determine a feature representation for your
messages. You want to ignore words, such as "the", that are
probably equally likely to appear in spam as in not spam. You may
also want to ignore the capitalization of words (although, words in ALL
CAPS are often a good predictor of some spam, but the spammers are
getting smarter). I suggest you pick a set on the order of 3,000
words that you will use as your feature vector. To determine
which words to use, you might pick them randomly, but you will probably
get better predictive results if you process all of the emails messages
and gather statistics on which words have different frequencies for
spam messages and not-spam messages.
You should provide functions or programs that will (a) take a
collection of training email
messages (e.g., spam) and a label for that collection ("spam") and
train the classifier accordingly, and (b) take as input a collection of
email messages, and return a corresponding collection of
classifications. In this scenario, we are doing a kind of batch training.
You should train and test your classifier on your own email -- both
spam and not. When you submit your assignment, include
instructions for training and testing. I will train and test on
my own collections.
The goals of this assignment are:
gain experience implementing a Bayesian classifier
gain appreciation for the potential (and limitations)
of naive Bayes
encounter some of the issues associated with
have fun and beat the pants off the Barracuda spam
Submission and grading
Submit your completed search agents to Eureka in the usual way (zip or
tar file). Include a readme file that explains how to train and
test your classifier, etc. I will grade this
project based on (a) code completeness and correctness, (b) coding
style including documentation,
(c) classification accuracy.