Dissertation Defense

Text and Network Mining for Literature-Based Scientific Discovery in Biomedicine

Arzucan Ozgur

Most of the new and important findings in biomedicine are only available in the text of the published scientific articles. The first goal of this thesis is to design
methods based on natural language processing and machine learning to extract information about genes, proteins, and their interactions from text. We introduce a
dependency tree kernel based relation extraction method to identify the interacting protein pairs in a sentence. We propose two kernel functions based on cosine
similarity and edit distance among the dependency tree paths connecting the protein names. Using these kernel functions with supervised and semi-supervised
machine learning methods, we report significant improvement (59.96% F-Measure performance over the AIMED data set) compared to the previous results in the literature.
We also address the problem of distinguishing factual information from speculative information. Unlike previous methods that formulate the problem as a sentence classification
task, we propose a two-step method to identify the speculative fragments of sentences. First, we use supervised classification to identify the speculation keywords using a
diverse set of linguistic features that represent their contexts. Next, we use the syntactic structures of the sentences to resolve their linguistic scopes. Our results show
that the method is effective in identifying speculative portions of sentences. The speculation keyword identification results are close to the upper bound of human
inter-annotator agreement.
The second goal of this thesis is to generate new scientific hypotheses using the literature-mined protein/gene interactions. We propose a literature-based discovery
approach, where we start with a set of genes known to be related to a given concept and integrate text mining with network centrality analysis to predict novel concept-related
genes. We present the application of the proposed approach to two different problems, namely predicting gene-disease associations and predicting genes that are
important for vaccine development. Our results provide new insights and hypotheses worth future investigations in these domains and show the effectiveness of the
proposed approach for literature-based discovery.

Sponsored by

Dragomir Radkov Radev