Dissertation Defense

Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features

Pradeep Muthukrishnan

Relational data refers to data that contains explicit relations among objects. Nowadays, relational data are universal and have a broad appeal in many different application domains. The problem of estimating similarity between objects is a core requirement for many standard Machine Learning (ML), Natural Language Processing (NLP) and Information Retrieval (IR) problems such as clustering, classification, word sense disambiguation, etc.
Traditional machine learning approaches represent the data using
simple, concise representations such as feature vectors. While this
works very well for homogeneous data, i.e, data with a single feature type such as text, it does not exploit the availability of different feature types fully. For example, scientific publications have text, citations, authorship information, venue information. Each of the features can be used for estimating similarity. Representing such objects has been a key issue in efficient mining. In this thesis, we propose natural representations for relational data using multiple, connected layers of graphs; one for each feature type. Also, we propose novel algorithms for estimating similarity using multiple heterogeneous features. Also, we present novel algorithms for tasks like topic detection and music recommendation using the estimated similarity measure. We demonstrate superior performance of the proposed algorithms (root mean squared error of 24.81 on the Yahoo! KDD Music recommendation data set and classification accuracy of 88% on the ACL Anthology Network data set) over many of the state
of the art algorithms, like Latent Semantic Analysis (LSA), Multiple
Kernel Learning (MKL) and spectral clustering and baselines on large, standard data sets.

The key contributions of the thesis are
1) A representation model for representing objects with heterogeneous feature types for efficiently estimating similarity. The proposed model has the following advantages over feature vectors,
a) The model is generic and is capable of representing different
types of features, including nominal, discrete, real-valued and
link-based features.
b) The model is capable of representing a wide variety of
dependencies between different features. For example, if there is
information available regarding different feature types contribution to the overall similarity between objects, it can be easily incorporated.
c) The model allows learning across feature types. For example,
it can be used to learn similarity between publications using
similarity measures between authors, keywords and venues and vice-versa.
2) A regularization framework for unifying different similarity
measures and learning feature weights.
3) Completely unsupervised algorithms in the proposed framework to efficiently estimate feature weights and compute similarity between objects with many heterogeneous feature types.
4) Novel algorithms for tasks like music recommendation and topic detection using the proposed similarity measures.

Sponsored by

D. Radev