Project 2.C: Personal Search Engine

This is the third and final installment of your personal search engine project. Do not wait until the last minute to work on the project deliverables. You may suffer in many ways if you do. This final deliveralbe of the project will be completed as a single group consisting of all of you. You must figure out how to coordinate and cooperate; likewise, you must figure out how to utilize everyone effectively. Your grade will be based on how well each member of the group understands the entire program. As always, I encourage you to talk to the TA and/or the Instructor. But you must acknowledge and understand the assistance you received and the code you submit must be your own code.

General Instructions and Reminders

As a group, you should discuss the different tasks of this final deliverable and how you want to divide them among yourselves. You will also want to put your heads together and select one code-base on which you will build. More than in the past, the requirements rely on your ability to understand the problem and to search for tools that will help you get the job done.

Make sure you are working from the up-to-date interfaces for CorpusI, WCrawlerI, and WDocI, and that the code you are given (and that you develop) reflects any updates to those. You are welcome to make changes to the BaileyHTML starter code that was provided. As always, you are not limited to methods prescribed by the interfaces you are required to implement. You can and should introduce other public or non-public methods that support the modular implementation of the required functionality.

Using a browser interface

Write a Java Applet that gives users access to your program through a web browser. Users enter queries into a text box on a web page and then displays the ten best links links, giving the user the opportunity to go directly to one of the recommended documents.

  1. Query text box. Provide a text box into which users type their search query. This screen should cue the reader and provide a button that says "Search". When the button is clicked, your program should extract the user's input and perform the search within your corpus.
  2. Display the ten best links. After the button is pressed on the search query screen, have the program display the best ten (10) links. To the side of each of the links, display a button that, when pressed, will open a new browser window showing the selected page. In addition to the links and their corresponding buttons, you should also provide a button that will display the next best ten links. This button should not be displayed when there are no more results to consider. You do not need a button to go back to the previous ten links.
  3. Extract fragment containing query terms. In addition to the URLs of the ten links currently displayed, you should display a snippet of text from the actual document. You may do this by either visiting the documents as needed (i.e., when you are displaying the ten on the page), or by capturing a fragment at the time you create each WebDoc. You should think about the pros and cons of each and include in your documentation an explanation for your choice.

Improve the Document Model

There are a number of things we can do to improve our models of documents.

  1. Remove html text. [Some of you already did this.] If you inspect your model of a given document, you may find many terms resulting from html tags and their attributes. These probably do not contribute to (and could conceivably degrade) the accuracy of our document retrieval. Thus, modify your program (probably WebDoc) so that html tags are ignored.
  2. Consider key words. The one exception to the above, would be meta tags containing keywords. Alter your code so that terms appearing within the content of a meta-keyword tag are included in the model. Additionally, these terms should be weighted by a constant factor defined in your program. The weighting should have the effect of altering the effective number of times the terms appear in the document. For example, a weight of zero will ignore keywords altogether, while a weight of five will treat keywords in the meta tag as having occurred five times in addition to the number of times that the term actually appears in the document.

Improve the Politeness of the Crawler

  1. Visit each page only once. Instead of visiting a page once for links and again later for content, consolidate these visits. Add a public method to your Crawler, crawlNice(), that returns a Set of WebDocs instead of a set of links. You should modify your crawl() method so that it does not actually crawl a site but extracts the set of links from the WebDocs most recently collected via crawlNice().
  2. Store the results of the crawl. Instead of having to re-crawl the site every time you run your search engine, save the corpus so that you can re-load it later and still answer search queries. [hint] I suggest you look into the Java Serializable interface, but if you wish, you may come up with your own approach for writing the corpus to file (and reading it back again).

Submission Instructions:

Important: Only one member of the class should submit for this assignment! You should use the name "fullclass" for the foldername that is created when I extract your submission. However, the names of all participants should appear in the comments inside each Java file. Do not forget to use the javadoc constructs for your documentation and include the html file produced by the javadoc tool in your final submission. Make sure that you create a folder named "fullclassP2C". Make sure that inside that folder you have: java source files and html documentation files resulting from javadoc. Finally, either tar or zip the folder so that when we extract it, the folder "fullclassP2C" will be created. Submit the packaged file for project 2.C on Eureka.