Project 2.A: Personal Search Engine

This is the first installment of your personal search engine project. Do not wait until the last minute to work on the project deliverables. You may suffer in many ways if you do. This is especially true because: the project will be completed in pairs. The two of you must coordinate and cooperate. Your grade will be based on how well both partners understand the entire program. As always, I encourage you to talk to your peers outside your pair or to me and to ask questions. But you must acknowledge the assistance you received and the code you submit must be your own code.

For this project, you will implement a small search engine that takes queries from a user and retrieves relevant documents from the Westmont site, from which the user may select. In this first installment, you will write a web crawler to gather information on the web pages found with the westmont.edu domain.

Spider a Website (part 1)

  1. Create a web crawler. You should start with the starter class, BaileyHTML.java, that I provide on Eureka. This code is slightly modified from the class, HTML, mentioned in Lab 10.6 from the text. You should start with this code but your web crawler, WebCrawler, should implement the WCrawlerI interface.
  2. Search to a depth. You should only visit links having the same prefix as the root of your crawler. You should also only visit documents with extensions .htm or .html. As your crawler spans the site from the root, it should only drill down a maximum number of links. See the WCrawlerI documentation for more details.
  3. Gather the links. As you visit each document, your crawler must check to prevent visiting pages that have already been processed. The result of the crawler should be a Collection of links. Other parts of your program will process these.

Gather the Corpus (part 2)

Your corpus should live within an instance of a new class, Corpus, which implements the CorpusI interface. For this portion of the project, you will want to understand and use Maps.

  1. Process each page. For each link returned by your crawler, extract the text from the corresponding page, removing HTML tags. Remove stop words found in the list we distributed earlier (stopwords2.txt). For remaining words, count their frequency within the document. Store the link and the Map of WFAssoc in a new class, WebDoc, which implements the WDocI interface.
  2. Assemble the corpus. Create a Collection of the processed pages (i.e., WebDocs) in your Corpus. Also, create a comprehensive collection of all the words found in any of the documents of your corpus.
  3. Determine inverse-document frequencies. For each word in your comprehensive collection of words, compute the inverse-document freqency. Count the number of documents in which a word appears at least once. Add one (1) to this value and divide by the total number of documents in the corpus. Take the natural log of the reciprocal. (Of course, you can simply take the log of the quotient of the corpus size and the incremented count, instead of dividing and then taking the log of the reciprocal.) This value is the inverse-document frequncey; store it in a map where the word is the key and the inverse-document frequency is the value.

Submission Instructions:

Important: Only one member of each pair should submit for this assignment! However, both names should appear in the foldername that is created and both names should appear in the comments inside each Java file. Do not forget to use the javadoc constructs for your documentation and include the html file produced by the javadoc tool in your final submission. Make sure that you create a folder named with your two Westmont email names followed by "P2A". For example, a pair of Alice Smith and Nancy Jones with email addresses "asmith" and "njones" would create a folder called "asmithnjonesP2A". Make sure that inside that folder you have: java source files and html documentation files resulting from javadoc. Finally, either tar or zip the folder so that when we extract it, the folder "<emailnames>P2A" will be created. Submit the packaged file for project 2.A on Eureka.