Documents are usually called unstructured data: that's because their
structure is implicit and very flexible. Hence, database systems have trouble handling
documents. And yet, most of the information in an enterprise is in documents
(in emails, tech reports, Web pages, memorandums, etc.). Current DBMS try to handle documents using Information Retrieval techniques; however, this is not enough for many tasks.
RESEARCH TOPICS
Storing Document Information: we parse sentences in Natural Language and
break them down into their main components, which are then stored either in
relations or in XML. From there on, the information in the document is available
for integration with other information, querying or mining.
Integrating Information: Even if one manages to store all information in a
document in a more or less regular structure, semantic integration must
still take place. This task is made difficult by many challenges: the ambiguity
of natural language, problems of co-reference (the same entity in the real
world being referred to through different expressions), the fact that databases
tend to store data, while documents freely mix data, information and knowledge. Our approach is to apply the idea of unification to the structures that
result from our analysis in documents and to extraction of schema from databases.
Querying Information: Traditional query languages work over
structured data (SQL) or semistructured data (XQuery). When unstructured (data)
documents are used, keyword search is the preferred means to seek
information. However, keyword search is limited and does not allow full expression of the user's informational needs. But query languages are not
useful over unstructured data. Natural language (English) interfaces have been
tried but are difficult to build and scale. We are investigating ways to query
all information with the same language, regardless of format. This entails
dealing with some tough issues: data in documents tends to be different in nature from data in databases (roughly speaking, data in documents mixes data,
information and knowledge freely; it will not only say what is happening,
but also may say why or how. Data in databases tends to be much
more factual-based).