This is the second installment of your personal search engine project. Do not wait until the last minute to work on the project deliverables. You may suffer in many ways if you do. This is especially true because: the project will be completed in pairs. The two of you must coordinate and cooperate. Your grade will be based on how well both partners understand the entire program. As always, I encourage you to talk to your peers outside your pair or to me and to ask questions. But you must acknowledge the assistance you received and the code you submit must be your own code.
For this deliverable, you will continue to work in the same pairs as for the first deliverable. However, you will not continue working on the code you have written to date. Instead, you will be given the code from one of the other pairs. Your second deliverable must build on their work. One of the objectives of this portion of the assignment is for you to experience the importance of clear and elegant code, to recognize the value of appropriate documentation, and discover the difficulty of working from someone else's code. My hope is that all of the code you write, from this deliverable onward, will be more elegant and better documented. Note, you are permitted (and expected) to approach the original authors of the code you have been assigned in order to obtain explanations.
Make sure you are working from the up-to-date interfaces for CorpusI, WCrawlerI, and WDocI, and that the code you are given (and that you develop) reflects any updates to those. You are welcome to make changes to the BaileyHTML starter code that was provided. As always, you are not limited to methods prescribed by the interfaces you are required to implement. You can and should introduce other public or non-public methods, and even classes, that support the modular implementation of the required functionality.
From the beginning, we should have included in the Crawler class an attention to the "robots.txt" file. Each web site may, at its discretion, specify folders that should not be visited by web crawlers such as what you are writing. You can look for such a file at your root; for example, if you are using http://www.westmont.edu as your root, your crawler should look for a document, http://www.westmont.edu/robots.txt. Technically, this is not quite correct. Suppose you were using http://www.westmont.edu/~iba as your root. In this case, you should still look in http://www.westmont.edu/robots.txt. So in other words, your crawler must process the given root to find the server name and then look for the file robots.txt at the top level.
If such a file exists, your crawler should create a map of any "Disallow" folders. During the crawl, any links into disallowed folders must be ignored. You should respond appropriately to the prescribed format of a robots.txt file. More specifically, you must respond to a "record" in a robots.txt file starting with "User-agent: *" or starting with "User-agent: CS030-spider" (that's us). For such a record, your crawler must take note of any "Disallow: ..." fields and not crawl any links starting with the root and the path given in the Disallow fields.
In addition to respecting the wishes of your "host", you should also not clobber the server with your requests. You should include a delay of at least 200 milliseconds between requests. This is much less than the recommended minimum of one second. However, as long as you are not crawling to depths beyond 10 and you are using the Westmont server, 200 milliseconds should be adequate. We will revisit this issue in the final deliverable. However, you are welcome to extend your program on your own [optional] to save the information returned by the Crawler as WebDocs to a file so that you can read it directly instead of re-crawling the site.
Having implemented the basis for a seach engine in deliverable 1, now we want to be able to enter a query and get recommendations of documents from our corpus that are relevant to the given query. For this second deliverable, the input of the query and the output of the recommended documents should take place in the interactions context through standard input and output (System.out and System.in)
Important: Only one member of each pair should submit for this assignment! However, both names should appear in the foldername that is created and both names should appear in the comments inside each Java file. The names of the original authors should remain inside the files but you should add your group's names to those. Do not forget to use the javadoc constructs for your documentation and include the html file produced by the javadoc tool in your final submission. Make sure that you create a folder named with your two Westmont email names followed by "P2B". For example, a pair of Alice Smith and Nancy Jones with email addresses "asmith" and "njones" would create a folder called "asmithnjonesP2B". Make sure that inside that folder you have: java source files and html documentation files resulting from javadoc. Finally, either tar or zip the folder so that when we extract it, the folder "<emailnames>P2B" will be created. Submit the packaged file for project 2.B on Eureka.