Focused Web Crawling

This research is supported by Innovative Productivity, Inc., a non-for-profit Kentucky company that runs the National Surface Treatment Center for the U.S. Navy.
Current Web search engines have very low-precision: together with a few (somewhat) useful results, they also return a large number of irrelevant pages. Our research aims at developing a focused crawler (a web crawler that returns only pages that deal with a certain topic) with very high precision.

The challenges for this project are many: the notion of topic is informal and vague (making it complicated to determine when a page is about a topic); the Web is too large and unruly (making it impossible to apply in-depth analysis to all pages, or storing all results locally); most Web pages contain a mix of information and noise, like advertisements and menus (making it challenging to analyze the page contents). Most crawler use Information Retrieval techniques, which are too crude to determine the true content of a Web page.

RESEARCH TOPICS Results and papers will be posted as they are produced. Keep on checking this page!