Exercise 5: Stochastic Language Generation

Over the last number of years, several peer-reviewed journals and conferences have been embarrassed at having accepted randomly generated research papers. You can read about one example at this site (with link to actual paper). While impressive in an amusing sort of way, it is instructive to discover just how far we can get with extremely naive methods. In this exercise, you will implement your own ‘random’ text generator and we will try to get a sense of how much additional work would be necessary to get a randomly generated paper accepted. In the process, I am hoping that, among other things, you will:

Requirements

As in other exercises, you may use the language of your choice with the provision that it run on my laptop, etc. As always, your code must be well documented.

  1. Process the document. We will use the King James version of the Bible to study probabilistic language models and generation. I have removed the verse annotations, but you may (at your discretion) choose to unify the case of all words and/or eliminate punctuation.
  2. Build n-gram models. Implement functionality that will generate n-gram models of the Bible for values of n of 1 to 3. Your code should parameterize the size of the n-gram model either as an argument to a function or as a (documented) configuration value.
  3. Generate text. Using your model(s) of the Bible, implement functionality that will stochastically generate a sequence of words. The length of the sequence should again be either an argument or a configuration parameter. You should support (at least) sequences up to length 50.
  4. Improve the linguistic quality. Using one or more techniques suggested in the text or that you find on your own, implement code that either filters or improves the text randomly generated in the previous step.
  5. Write about your system and its output. Write a one-page report describing what you did to improve the linguistic quality and how well that worked.

The Report

You should write a one-page report of your activities and findings associated with this assignment. Your report should serve as a stand-alone document; thus, it should describe the problem or focus, the approach that you employed, and an indication of how well it performed. However, you should weight the description toward the fourth requirement above. You may include short generated sequences for unigram, bi-gram and tri-gram models, but I am more interested in seeing a comparison of 30 or so words based on a tri-gram model in comparison to the improved text (of equal length) based on your efforts in step 4. You are welcome to include figures if you think they contribute to the report; however, make sure your picture really is worth a thousand words.

I am providing a modified template file that you should use to format your one-page report. (If you would rather use LaTeX, you may use the style file from the ACM that was linked previously.) Your affiliation should be “Westmont”; Whether you use LaTeX or Word, you should use the template with only the following modifications:

When you have completed your report and its formatting, generate a pdf document to be included in your final submission.

Submission

Include a README file with your final submission. It should serve as an index to the files that you are submitting, and include instructions for running your program.

You should bundle your files (code, README, and report pdf) in a gzipped tar file. Name your gzipped tar file with your Westmont emailname and “P5” (no spaces); for example, someone named Eva Bailey might create a folder called “evabaileyP5” or “ebaileyP5”. When I open your submission, your files should be contained within an easily identifiable sub-directory.