• [3/24/2015] Revision of Requirement 2. Allow me to clarify what “in a general way” means. As you design and implement your programs, you should be thinking toward other data sets or domains on which you might use your naive-Bayes classifier. This is not intended to be a strictly formal requirement. When I look at your code, if I find that you have unnecessarily hard-coded assumptions then I may penalize you. However, I do not expect that your code should work with a random data set that I might throw at it without modifications or configuration.

Naive Bayes Classifier

In this project, you will, among other things:

• implement a naive Bayes classifier
• have fun

For this project, you will write a program that forms a naive Bayes classifier from training data and classifies unseen data using the learned model. For our problem domain, we will use the Titanic data set from Kaggle. You may implement the naive Bayes classifier in whatever programming language you would like; Your implementation will take two command line arguments for the training and test data, respectively.

Requirements

1. Download the data. Create an account with Kaggle (if you have not previously done so) and download the Titanic training data. (You may want to download the test data also, but we will not be using it within our class.) Instead, you will split the 891 training instances that are provided into training and test sets that you can use. Review Chapter 5, Evaluating Hypotheses, from Mitchell's text if you need clarification.
2. Build the naive Bayes model. Write a program, train-model, that reads a training-set filename on the command line, opens and reads the file, and computes the marginal and conditional probabilities needed by naive Bayes. Note: in class and in the text, we discussed only discrete attributes; for numeric/continuous attributes, you should assume the values are distributed according to a Gaussian distribution. Thus, such attributes will be represented as a mean and variance. You should write your program in a general way that will work with other data sets with minimal changes or needed configuration.
3. Use the learned model to classify test intances. Write a program, test-model, that reads a test-set filename and attempts to classify each test instance without looking at the class value. Then, an error is computed by comparing the model's classification with the actual class value. Again, the probability that a continuous attribute will have a value x can be computed as described in Table 5.4 of Mitchell's text and using the sample mean and variance from the data.
4. Do both training and testing. Write a program, evaluate-naiveBayes, that reads two filenames from the command line (for the training and test sets, respectively), and puts the previous two steps together. You program should display: (a) the total number of training instances processed and the number of training instances for each respective class; (b) the number of test instances of each type, and (c) the overall prediction accuracy and the accuracy on test cases of each respective class.
5. Write a report. Write a one-page (maximum) technical report using this provided template. Your report should describe the problem and what you did at a level of detail that would allow a reader to replicate what you did. You should also include the results of your program either as tabular data or as learning curves. Include your report as a pdf file only.

Submission

This is an individual assignment. Appropriate expectations on working together apply. The goal of the exercise is to learn the naive Bayes classifier by implementing it as a program; do not short-change your own learning experience by grabbing an implementation from the web and then making a few changes.

Submit a README file together with your program file(s) and report (pdf) in either .tgz or .zip format. Your README should describe how to use or run your program. Document your code appropriately. Describe any significant discoveries or learning experiences you encountered during the process.