- [3/24/2015] Revision of Requirement 2.
Allow me to clarify what “in a general way” means.
As you design and implement your programs,
you should be thinking toward other data sets or domains
on which you might use your naive-Bayes classifier.
This is not intended to be a strictly formal requirement.
When I look at your code,
if I find that you have unnecessarily hard-coded assumptions
then I may penalize you.
However, I do not expect that your code
should work with a random data set that I might throw at it
without modifications or configuration.
Naive Bayes Classifier
In this project, you will, among other things:
- implement a naive Bayes classifier
- improve your programming skills
- have fun
For this project, you will write a program that
forms a naive Bayes classifier from training data
and classifies unseen data using the learned model.
For our problem domain,
we will use the Titanic data set
You may implement the naive Bayes classifier in whatever programming language you would like;
Your implementation will take two command line arguments
for the training and test data, respectively.
- Download the data.
Create an account with Kaggle (if you have not previously done so)
and download the Titanic training data.
(You may want to download the test data also,
but we will not be using it within our class.)
Instead, you will split the 891 training instances that are provided
into training and test sets that you can use.
Review Chapter 5, Evaluating Hypotheses, from Mitchell's text if you need clarification.
- Build the naive Bayes model.
Write a program, train-model, that reads a training-set filename on the command line,
opens and reads the file,
and computes the marginal and conditional probabilities needed by naive Bayes.
Note: in class and in the text, we discussed only discrete attributes;
for numeric/continuous attributes,
you should assume the values are distributed according to a Gaussian distribution.
Thus, such attributes will be represented as a mean and variance.
You should write your program in a general way
that will work with other data sets with minimal changes
or needed configuration.
- Use the learned model to classify test intances.
Write a program, test-model, that reads a test-set filename
and attempts to classify each test instance without looking at the class value.
Then, an error is computed by comparing the model's classification with the actual class value.
Again, the probability that a continuous attribute will have a value x
can be computed as described in Table 5.4 of Mitchell's text
and using the sample mean and variance from the data.
- Do both training and testing.
Write a program, evaluate-naiveBayes,
that reads two filenames from the command line (for the training and test sets, respectively),
and puts the previous two steps together.
You program should display:
(a) the total number of training instances processed
and the number of training instances for each respective class;
(b) the number of test instances of each type,
and (c) the overall prediction accuracy and the accuracy on test cases of each respective class.
- Write a report.
Write a one-page (maximum) technical report
using this provided template.
Your report should describe the problem and what you did
at a level of detail that would allow a reader to replicate what you did.
You should also include the results of your program
either as tabular data or as learning curves.
Include your report as a pdf file only.
This is an individual assignment.
Appropriate expectations on working together apply.
The goal of the exercise is to learn the naive Bayes classifier
by implementing it as a program;
do not short-change your own learning experience by grabbing an implementation from the web
and then making a few changes.
Submit a README file together with your program file(s) and report (pdf) in either .tgz or .zip format.
Your README should describe how to use or run your program.
Document your code appropriately.
Describe any significant discoveries or learning experiences you encountered during the process.