CSE Seminar
Toward a Healthier ML Development Ecosystem, for Just Four Pizzas a Day
Add to Google Calendar
When training a machine learning model becomes fast, and model selection and hyper-parameter
tuning become automatic, will non-CS experts finally have the tool they need to build ML applications
all by themselves? In this talk, I will focus on those users who are still struggling — not because of
the speed and the lack of automation of an ML system, but because it is so powerful that it is easily
misused as an "overfitting machine." For many of these users, the quality of their ML applications
might actually decrease with these powerful tools without proper guidelines and feedback.
In particular, I will introduce two systems, ease.ml/ci and ease.ml/meter, which we built as an early
attempt at an ML system that tries to enforce the right user behavior during the development process
of ML applications. The first, ease.ml/ci, is a "continuous integration engine" for ML that gives
developers a pass/fail signal for each developed ML model depending on whether they satisfy certain
predefined properties over the "true distribution" . The second, ease.ml/meter, is a system that
continuously returns some notion of the "degree of overfitting" to the developer. From the technical
perspective, both systems build upon the classic theory of answering adaptive statistical queries. I will
also discuss a set of simple but novel optimizations specific to the application scenarios of each system
that bring down the cost by up to one order of magnitude compared with off-the-shelf results.
For many real-world use cases of both systems, providing one adaptive signal per day for a month
only requires up to 96K labeled examples, the equivalent in cost, in some applications, as low as
four 35cm Domino's Pizzas per day.
Ce is an Assistant Professor in Computer Science at ETH Zurich. He believes that by making data"”along with the processing of data"”easily accessible to non-CS users, we have the potential to make the world a better place. His current research focuses on building data systems to support machine learning and help facilitate other sciences. Before joining ETH, Ce was advised by Christopher Ré. He finished his PhD round-tripping between the University of Wisconsin-Madison and Stanford University, and spent another year as a postdoctoral researcher at Stanford. His PhD work produced DeepDive, a trained data system for automatic knowledge-base construction. He participated in the research efforts that won the SIGMOD Best Paper Award (2014) and SIGMOD Research Highlight Award (2015), and was featured in special issues including the Science magazine (2017), the Communications of the ACM (2017), "Best of VLDB" (2015), and the Nature magazine (2015).