Faculty Candidate Seminar
Efficient and Accurate Systems for Querying Unstructured Data
This event is free and open to the publicAdd to Google Calendar
Zoom link, passcode: 352012
Abstract: Over the past 60 years, relational databases have been a runaway success: they are deployed at every major organization and have produced hundreds of billions of dollars in market capitalization. However, there is a growing demand for analytics over unstructured data (e.g., videos, audio, text) given the rise of ML capabilities: previously, unstructured data did not fit cleanly with the relational database model (e.g., selecting pixels vs semantic content about objects in an image). Unfortunately, ML can be prohibitively expensive to deploy (e.g., 10 orders of magnitude more expensive than standard relational analytics) and can produce incorrect results. These problems are exacerbated by the scale of data. For example, the Tesla fleet of vehicles produces exabytes of data per day.
In this talk, I’ll describe my work on new ML-based query systems to tackle the cost and reliability of unstructured data analytics. My first line of work accelerates large classes of queries by orders of magnitude while providing strong guarantees on query accuracy. I accomplish this by developing novel query processing algorithms, indexing methods, and execution engines for unstructured data queries. I’ll also describe how to find errors in human labels and ML model outputs using novel data management systems. My systems can be used to automatically improve ML models and, perhaps surprisingly, have discovered a large number of errors in a popular autonomous vehicle dataset. My research has been deployed at an autonomous vehicle company and has enabled new forms of video analytics for ecologists at the Jasper Ridge biological preserve.
Bio: Daniel Kang is a sixth year PhD student in the Stanford DAWN lab, co-advised by Professors Peter Bailis and Matei Zaharia. His research focuses on systems to query unstructured data. In particular, he focuses on using cheap approximations to accelerate query processing algorithms and new programming models for ML data management. Daniel is collaborating with autonomous vehicle companies and ecologists to deploy his research. His work is supported in part by the NSF GRFP and the Google PhD fellowship.