Exercise 4: Stochastic Language Generation

“You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model

Over the last number of years, several peer-reviewed journals and conferences have been embarrassed at having accepted randomly generated research papers. You can read about one example at this site (with link to actual paper). While impressive in an amusing sort of way, it is instructive to discover just how far we can get with extremely naive methods. In this exercise, you will implement your own ‘random’ text generator and we will try to get a sense of how much additional work would be necessary to get a randomly generated paper accepted. In the process, I am hoping that, among other things, you will:

Reading Chapters 22 and 23 from our textbook will provide a good foundation for this project. After digesting those, you may choose to delve deeper in one or more directions to improve the quality of the language generator that you will implement. You might read this chapter on N-grams from Speech and Language Processing by Daniel Jurafsky and James Martin.

Requirements

As in other exercises, you may use the language of your choice with the provision that it run on my laptop, etc. As always, your code must be well documented.

  1. Process the document corpus. I am providing a corpus of documents including: the King James version of the Bible (I have removed the verse annotations), Leo Tolstoy's War and Peace, Fyodor Dostoevsky's Crime and Punishment, and the complete works of Jane Austen. An important part of your assignment is to preprocess these documents. How you do so is up to you but will certainly influence the quality of your language generator (and your grade). Some questions to consider when deciding how to preprocess the corpus include: Should you unify the case of all characters? Should you remove or include punctuation? If including punctuation, should they be separate tokens or part of the words they follow? and so on.
  2. Build n-gram models. Implement functionality that will process a given corpus of documents and generate an n-gram model. As a result of the previous step, you may combine the corpus into a single document if you wish. Your model builder should parameterize the value of n either as an input or as a (documented) configuration value. Typically, we'll look at results for values of n between 1 and 3.
  3. Generate text. Using your model(s) from the processed corpus, implement functionality that will stochastically generate a sequence of words. The length of the sequence should again be either an argument or a configuration parameter. You should support (at least) sequences up to 50 words. Such a sequence need not be a single sentence (and would be better if it is not).
  4. Improve the linguistic quality. Using one or more techniques suggested in the text or that you find on your own, implement code that either filters or improves the readability of the text randomly generated in the previous step.
  5. Write about your system and its output.

The Report

You should write a one-page (max) report of your activities and findings associated with this assignment. Your report should serve as a stand-alone document; thus, it should describe the problem or focus, the approach that you employed, and an indication of how well it performed. Specifically, your report should describe your method of pre-processing the corpus, your model building technique, your text generation technique, and the improvements you implemented. As a standard measure of comparison, you should contrast text generated by your trigram model with and without the improvements you implemented in the final requirement. You are certainly welcome to include other comparisons and other methods for evaluating your work but minimally I want to see the trigram comparison.

I am providing a modified template file that you should use to format your one-page report. Note, this works with LibreOffice and seems to mostly work with Word. If you would rather use LaTeX, you may use the style file from the ACM that was linked previously. If using the LaTeX style file, you should make the following modifications:

When you have completed your report and its formatting, generate a pdf document to be included in your final submission.

Submission

Include a README file with your final submission. It should serve as an index to the files that you are submitting, and include instructions for running your program.

You should bundle your files (code, README, and report pdf) in a gzipped tar file. Name your gzipped tar file with your Westmont emailname and “P5” (no spaces); for example, someone named Eva Bailey might create a folder called “evabaileyP5” or “ebaileyP5”. When I open your submission, your files should be contained within a sub-directory easily identifiable by your name.