Focused Web Crawling
This research is supported by Innovative
Productivity, Inc., a non-for-profit Kentucky company that runs the
National Surface Treatment Center for the U.S. Navy.
Current Web search engines have very low-precision: together with a few (somewhat) useful results, they also return a large number of irrelevant pages. Our
research aims at developing a focused crawler (a web crawler that
returns only pages that deal with a certain topic) with very high precision.
The challenges for this project are many: the notion of topic is informal
and vague (making it complicated to determine when a page is about a topic);
the Web is too large and unruly (making it impossible to apply in-depth analysis
to all pages, or storing all results locally); most Web pages contain a mix of
information and noise, like advertisements and menus (making it challenging to
analyze the page contents). Most crawler use Information Retrieval techniques,
which are too crude to determine the true content of a Web page.
RESEARCH TOPICS
- Designing and populating a thesaurus: we use a thesaurus in order to
determine what is the scope of our topic and what concepts and relations are
important on it. Regular thesauri are unsuited for the task since they express
this semantic knowledge in a very superficial (hypernomy and meronymy) and
imprecise (related-to) form.
- Scoring pages againts thesaurus: one a high precision thesaurus is available, one must still determine how to score a given web page against the theasaurus. In this context, having many words of overlap -even if each word appears
once or few times- may be better than having a few words of overlap repeated
many times. Thus, the traditional concepts of frequency (tf) and inverse
document frequency (idf) do not apply well here.
- Dealing with noise in pages: most web pages contain advertisement, menus,
links to other pages, and other material which is not properly part of their
content. Cleaning such pages to have a clear score is a very difficult task.
- Extended keyword languages: because traditional keyword search is based
on enumerating several keywords, it is difficult for a user to express his/her
informational needs. We are looking into ways to extend keyword search
by allowing users to enter more complex expressions than simple list of
keywords.
Results and papers will be posted as they are produced. Keep on checking this page!