Dissertation Defense
Using Graphs and Random Walks For Discovering Latent Semantic
Add to Google Calendar
We propose a graph-based representation of text collections where the
nodes are textual units such as sentences or documents, and the edges
represent the pairwise similarity function between these units. We
show how random walks on such a graph can give us better
approximations for the latent similarities between two natural
language strings. We also derive algorithms based on random walk
models to rank the nodes in a text similarity graph to address the
text summarization problem in information retrieval. The similarity
functions used in the graphs are intentionally chosen to be very
simple and language-independent to make our methods as generic as
possible, and to show that significant improvements can be achieved
even by starting with such similarity functions. We put special
emphasis on language modeling-based similarity functions since we use
them for the first time on problems such as document clustering and
classification, and get improved results compared to the classical
similarity functions such as cosine. Our graph-based methods are
applicable to a diverse set of problems including generic and focused
summarization, document clustering, and text classication. The text
summarization system we have developed has ranked as one of the top
systems in Document Understanding Conferences over the past few years.
In document clustering and classification, using language modeling
functions performs consistently better than using the classical cosine
measure reaching
as high as 25% improvement in accuracy. Random walks on the similarity
graph achieve additional significant improvements on top of this. We
also revisit the nearest neighbor text classification methods and
derive semi-supervised versions by using random walks that rival the
state-of-the-art classification algorithms such as Suppor Vector
Machines.