Project 2.B: Personal Search Engine

This is the second installment of your personal search engine project. Do not wait until the last minute to work on the project deliverables. You may suffer in many ways if you do. This is especially true because: the project will be completed in pairs. The two of you must coordinate and cooperate. Your grade will be based on how well both partners understand the entire program. As always, I encourage you to talk to your peers outside your pair or to me and to ask questions. But you must acknowledge the assistance you received and the code you submit must be your own code.

Surprise!

For this deliverable, you will continue to work in the same pairs as for the first deliverable. However, you will not continue working on the code you have written to date. Instead, you will be given the code from one of the other pairs. Your second deliverable must build on their work. One of the objectives of this portion of the assignment is for you to experience the importance of clear and elegant code, to recognize the value of appropriate documentation, and discover the difficulty of working from someone else's code. My hope is that all of the code you write, from this deliverable onward, will be more elegant and better documented. Note, you are permitted (and expected) to approach the original authors of the code you have been assigned in order to obtain explanations.

General Instructions and Reminders

Make sure you are working from the up-to-date interfaces for CorpusI, WCrawlerI, and WDocI, and that the code you are given (and that you develop) reflects any updates to those. You are welcome to make changes to the BaileyHTML starter code that was provided. As always, you are not limited to methods prescribed by the interfaces you are required to implement. You can and should introduce other public or non-public methods, and even classes, that support the modular implementation of the required functionality.

Requirements: Do No Evil

From the beginning, we should have included in the Crawler class an attention to the "robots.txt" file. Each web site may, at its discretion, specify folders that should not be visited by web crawlers such as what you are writing. You can look for such a file at your root; for example, if you are using http://www.westmont.edu as your root, your crawler should look for a document, http://www.westmont.edu/robots.txt. Technically, this is not quite correct. Suppose you were using http://www.westmont.edu/~iba as your root. In this case, you should still look in http://www.westmont.edu/robots.txt. So in other words, your crawler must process the given root to find the server name and then look for the file robots.txt at the top level.

If such a file exists, your crawler should create a map of any "Disallow" folders. During the crawl, any links into disallowed folders must be ignored. You should respond appropriately to the prescribed format of a robots.txt file. More specifically, you must respond to a "record" in a robots.txt file starting with "User-agent: *" or starting with "User-agent: CS030-spider" (that's us). For such a record, your crawler must take note of any "Disallow: ..." fields and not crawl any links starting with the root and the path given in the Disallow fields.

In addition to respecting the wishes of your "host", you should also not clobber the server with your requests. You should include a delay of at least 200 milliseconds between requests. This is much less than the recommended minimum of one second. However, as long as you are not crawling to depths beyond 10 and you are using the Westmont server, 200 milliseconds should be adequate. We will revisit this issue in the final deliverable. However, you are welcome to extend your program on your own [optional] to save the information returned by the Crawler as WebDocs to a file so that you can read it directly instead of re-crawling the site.

Requirements: Process a Query

Having implemented the basis for a seach engine in deliverable 1, now we want to be able to enter a query and get recommendations of documents from our corpus that are relevant to the given query. For this second deliverable, the input of the query and the output of the recommended documents should take place in the interactions context through standard input and output (System.out and System.in)

  1. Read a query from the user. Provide an input mechanism (perhaps in a main method of your Corpus class) by which users may enter search queries. This should prompt the user to enter a query as a single line of text.
  2. Find the ten best documents. Using the terms found in the user's query, your program should select the ten best documents. First, a document is scored by summing, for each term, t, in the query, the product of t's term-frequency in the document and the term's inverse-document frequency across the corpus; the higher this score, the more likely the document is relevant to the user's query. Compute this score for all your documents. Having computed scores for each document (and hopefully preserved the association between documents and scores), I expect you to store these document-score pairs in an instance of Java's PriorityQueue. Use the Java API to navigate the minor differences from the version presented in our text.
  3. Print the best links. Display, in decreasing order of relevance, the links to the ten best documents that you found in the previous step. You should print one link per line, together with its score.
Because of the considerable time it takes to populate the repository with your Crawler, we do not want to run the program for each query we may want to check. You should embed the last three steps of user input and recommendation response within a loop. Your program should continue reading queries and printing the ten best links for each query; that is, you should use an infinite loop.

Submission Instructions:

Important: Only one member of each pair should submit for this assignment! However, both names should appear in the foldername that is created and both names should appear in the comments inside each Java file. The names of the original authors should remain inside the files but you should add your group's names to those. Do not forget to use the javadoc constructs for your documentation and include the html file produced by the javadoc tool in your final submission. Make sure that you create a folder named with your two Westmont email names followed by "P2B". For example, a pair of Alice Smith and Nancy Jones with email addresses "asmith" and "njones" would create a folder called "asmithnjonesP2B". Make sure that inside that folder you have: java source files and html documentation files resulting from javadoc. Finally, either tar or zip the folder so that when we extract it, the folder "<emailnames>P2B" will be created. Submit the packaged file for project 2.B on Eureka.