Dissertation Defense

Its Data All the Way Down: Exploring the Relationship Between Machine Learning and Data Management

Michael Anderson

Data is central to machine learning: models are trained with data, trained models infer their predictions over input data, and the resulting inferences are themselves data. This being the case, there should be a natural relationship between machine learning and data management techniques. Much of machine learning research, perhaps understandably, focusses strictly on algorithmic improvements, chasing ever-increasing state-of-the-art accuracy measurements on their task of choice. Likewise, data management research has been slow to incorporate recent machine learning breakthroughs, like deep learning, to classic data management tasks. In this dissertation, we will demonstrate this relationship between machine learning and data management with a series of projects that improve aspects of machine learning through data management or improve data management with the addition of machine learning.

Specifically, we detail two systems that use database-style methods to improve runtime issues traditionally associated with machine learning and a third project that uses recent machine learning methods to solve data quality issues. Our system Zombie shows that novel data indexing methods can greatly reduce the time needed to evaluate the effectiveness of feature engineering, thereby reducing the time needed to train accurate machine learning models. With our system Tahoma, we show that by using particular physical representations of the images used as input into convolutional neural network classifier cascades, content can be quickly extracted to support binary predicates used in a video analytics database. And our system Grover demonstrates that universal embeddings, like those used in computer vision or natural language processing, can be created for relational data, with both column and table embeddings used to improve the performance of data integration tasks.

Our work shows machine learning and data management go hand-in-hand, and taking a holistic view of both can lead to improvements in each field.

Sponsored by


Faculty Host

Michael Cafarella