Dissertation Defense

Building and Querying Structured Knowledge Sources from Natural Language Text

Nikita Bhutani

We are awash in data today, particularly on the web. Much of this data is in natural language text. Users want their complex information needs, typically expressed as natural language questions, to be satisfied without having to go through all of this data. One approach is to manually or collaboratively extract key facts from the textual data in a structured knowledge base (KB) that can support complex queries efficiently. However, the diversity and complexity of queries is limited by the inherent incompleteness of the KB. This incompleteness arises because every new data source first has to be {\em curated}; and also because the information retained in the KB is only what the KB is capable of representing. To address these limitations, knowledge-based question-answering (KB-QA) systems are now exploring automatically extracted KBs that attempt to retain all the information in the textual data source. This dissertation studies the design of KB-QA systems that can support complex questions over large-scale, extracted and curated KBs.

To build such a system, we must address challenges in both knowledge acquisition and querying. First, the system needs to acquire and represent facts in a format that can be queried to answer complex questions. We describe an open information extraction method that uses a nested format to represent facts from the text, thereby retaining any contextual information critical to answering complex questions. Second, an extracted KB typically has a massive, loosely-defined schema, making it harder to query using pattern matching methods. We describe schemaless querying techniques that can obtain answers from heterogeneous knowledge representations in an extracted KB.

There do exist many large curated KBs today, and more are being created. Where a curated KB contains information on some topic, we should generally prefer it to "best effort" information automatically extracted from text. Traditional KB-QA systems use either a curated or extracted KB exclusively and miss much of the potential benefit of combining high quality curated knowledge with broad-coverage extracted facts. We describe a KB-QA system capable of collective inference over diverse relation forms across the two types of knowledge sources.

Sponsored by


Faculty Host

HV Jagadish