AI Seminar

Deep Learning with UCT-Data for Real-Time Atari Game Play and Advances in Multimodal Deep Representation Learning

Satinder Baveja and Honglak LeeUniversity of Michigan

The combination of modern Reinforcement Learning and Deep Learning approaches holds the promise of making significant progress on challenging applications requiring both rich perception and policy-selection. The Arcade Learning Environment (ALE) provides a set of Atari games that represent a useful benchmark set of such applications. A recent breakthrough in combining model-free reinforcement learning with deep learning, called DQN, achieves the best real-time agents thus far. Planning-based approaches achieve far higher scores than the best model-free approaches, but they exploit information that is not available to
human players, and they are orders of magnitude slower than needed for real-time play. Our main goal in this work is to build a better real-time Atari game playing agent than DQN. The central idea is to use the slow planning-based agents to provide training data for a deep-learning architecture capable of real-time play. We proposed new agents based on this idea and show that they outperform DQN.
(Based on a NIPS paper by Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard Lewis, & Xiaoshi Wang)
In multimodal representation learning, it is important to capture high-level associations between multiple data modalities with a compact set of latent variables. Deep learning has been successfully applied to this problem, with a common strategy to learning joint representations that are shared between multiple modalities at the higher layer after learning several layers of modality-specific features in the lower layers. Nonetheless, there still remains an important open question on how to learn a good association between multiple data modalities, in particular, to reason or predict the missing data modalities effectively in the testing time. In this work, I will talk about my perspectives on advances in multimodal deep learning, with applications to challenging problems in audio-visual recognition, robotic perception, and visual-textual recognition. In particular, I will talk about my recent work on a new multimodal deep learning method with a learning objective that explicitly encourages cross-modal associations. Specifically, instead of maximum likelihood learning, we train the networks to minimize the variation of information, an information theoretic measure that computes the information distance between data modalities. We provide a theoretical analysis of the proposed approach, showing that the proposed training scheme leads to maximum likelihood solution for a full generative model under some conditions. We describe our method based on restricted Boltzmann machines and propose learning algorithms based on contrastive divergence and multi-prediction training. Furthermore, we propose an extension to deep recursive networks. In experiments, we demonstrate the state-of-the-art visual-textual and visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database.

Sponsored by