Other Event

Off-policy Estimation in Reinforcement Learning

Bo DaiResearch ScientistGoogle Brain
3725 Beyster BuildingMap


In many real-world reinforcement learning applications, access to the underlying dynamic environment is limited to a fixed set of data that has already been collected, without information about the collecting procedure and additional interaction with the environment being available, which is usually called ‘behavior-agnostic off-policy’ setting. In this talk, we show that consistent and effective off-policy estimation remains possible in this scenario. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical state-action distributions, derived from fundamental properties of the stationary distribution. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.


Bo Dai is a research scientist in Google Brain. He is the recipient of the best paper award of AISTATS 2016 and NIPS 2017 workshop on Machine Learning for Molecules and Materials. His research interest lies in developing principled (deep) machine learning methods using tools from optimization, especially on reinforcement learning and representation learning for structured data, as well as various applications.


Stephen Reger(734) 764-2132

Faculty Host

Honglak LeeAssociate Professor