Over the last number of years, several peer-reviewed journals and conferences have been embarrassed at having accepted randomly generated research papers. You can read about one example at this site (with link to actual paper). While impressive in an amusing sort of way, it is instructive to discover just how far we can get with extremely naive methods. In this exercise, you will implement your own ‘random’ text generator and we will try to get a sense of how much additional work would be necessary to get a randomly generated paper accepted. In the process, I am hoping that, among other things, you will:
Reading Chapters 22 and 23 from our textbook will provide a good foundation for this project. After digesting those, you may choose to delve deeper in one or more directions to improve the quality of the language generator that you will implement. You might read this chapter on N-grams from Speech and Language Processing by Daniel Jurafsky and James Martin.
As in other exercises, you may use the language of your choice with the provision that it run on my laptop, etc. As always, your code must be well documented.
You should write a one-page (max) report of your activities and findings associated with this assignment. Your report should serve as a stand-alone document; thus, it should describe the problem or focus, the approach that you employed, and an indication of how well it performed. Specifically, your report should describe your method of pre-processing the corpus, your model building technique, your text generation technique, and the improvements you implemented. As a standard measure of comparison, you should contrast text generated by your trigram model with and without the improvements you implemented in the final requirement. You are certainly welcome to include other comparisons and other methods for evaluating your work but minimally I want to see the trigram comparison.
I am providing a modified template file that you should use to format your one-page report. Note, this works with LibreOffice and seems to mostly work with Word. If you would rather use LaTeX, you may use the style file from the ACM that was linked previously. If using the LaTeX style file, you should make the following modifications:
Include a README file with your final submission. It should serve as an index to the files that you are submitting, and include instructions for running your program.
You should bundle your files (code, README, and report pdf) in a gzipped tar file. Name your gzipped tar file with your Westmont emailname and “P5” (no spaces); for example, someone named Eva Bailey might create a folder called “evabaileyP5” or “ebaileyP5”. When I open your submission, your files should be contained within a sub-directory easily identifiable by your name.