Dissertation Defense

Deep Neural Networks for Visual Reasoning, Program Induction, and Text-to-Image Synthesis

Scott Reed
SHARE:

Deep neural networks have enabled transformative breakthroughs in speech and image recognition in the past several years, fueled by increases in training data and computational power in addition to algorithmic improvements. While deep networks excel at pattern recognition, often with comparable performance to humans in some tasks, the research frontier has shifted to the current weak side of neural networks: reasoning, planning and creativity. In this thesis I propose several approaches to take steps to advance along this frontier.

To investigate the reasoning capability of neural networks, I develop a model for visual analogy making: given an image analogy problem A : B :: C : ?, the network predicts the pixels of the image D that completes the analogy. For example, the analogy could involve rotating 3D shapes or animating a video game sprite. In contrast to previous works on analogy-making, this is the first end-to-end differentiable architecture for pixel-level analogy completion. We also show that by learning to disentangle the latent visual factors of variation (e.g. pose and shape), our model can more effectively relate images and perform image transformations.

For better neural-network-based planning, I design a recurrent neural network augmented with a memory of program embeddings, that learns to execute these programs from example execution traces. By exploiting compositionality, we demonstrate improved data efficiency and strong generalization compared to previous recurrent networks for program induction. We apply our model, the Neural Programmer-Interpreter (NPI), to generating execution trajectories for multi-digit addition, sorting arrays of numbers, and canonicalizing the pose of 3D car models from image renderings. Notably, a single NPI can learn and execute these programs and associated subprograms across very different environments with different affordances.

To improve the usefulness of neural nets for creative tasks, I develop several variants of Generative Adversarial Networks capable of text-to-image synthesis; i.e. generating plausible images from incomplete, informal specifications. For example, "a bright yellow bird with a black head and beak". Our system is capable of generating plausible depictions of birds, flowers and many other objects given only textual descriptions. By learning to invert the generator network, we also show how to synthesize images by transferring the style of an unseen photograph (e.g. background appearance) onto the content of a text description.

Sponsored by

Professor Honglak Lee