AI Seminar

The Vision-Language Interface in Machines and Humans

Jeffrey Mark SiskindAssociate ProfessorSchool of Electrical and Computer Engineering, Purdue University

In the first part of the talk, I will discuss a method for vision-language
interface in machines: a unified cost function that integrates object
detection, tracking, event recognition, and natural language semantics. The
roles played by participants (nouns), their characteristics (adjectives), the
actions performed (verbs), the manner of such actions (adverbs), and changing
spatial relations between participants (prepositions), in the form of
whole-sentence descriptions, can guide activity recognition, allowing more
robust object detection and tracking than is possible without sentential
guidance. A general framework scores a video-sentence pair and produces
objects tracks that delineate the participants in the video that correspond to
the sentential roles. This framework supports searching large video corpora
for clips that depict a sentential query by scoring each clip and returning a
ranked list of clips. Compositional semantics can encode subtle meaning
distinctions between two sentences that have the same words but different
meanings: `The person rode the horse' vs. `The horse rode the person.' We
demonstrate this approach by searching for 141 sentential queries involving
people and horses interacting with each other in 10 full-length Hollywood

In the second part of the talk, I will discuss investigation of the
vision-language interface in humans: the question of how the human brain
represents simple compositions of constituents: actors, verbs, objects,
directions, and locations? Subjects viewed videos during neuroimaging (fMRI)
sessions from which sentential descriptions of those videos were identified by
decoding the brain representations based only on their fMRI activation
patterns. Constituents (e.g., `fold' and `shirt') were independently decoded
from a single presentation. Independent constituent classification was then
compared to joint classification of aggregate concepts (e.g., `fold-short'});
results were similar as measured by accuracy and correlation. The brain
regions used for independent constituent classification are largely disjoint
and largely cover those used for joint classification. This allows recovery
of sentential descriptions of stimulus videos by composing the results of the
independent constituent classifiers. Furthermore, classifiers trained on the
words one set of subjects think of when watching a video can recognize
sentences a different subject thinks of when watching a different video.

Joint work with Andrei Barbu, Daniel P. Barrett, Wei Chen, N. Siddharth,
Caiming Xiong, Haonan Yu, Jason J. Corso, Christiane D. Fellbaum, Catherine
Hanson, Stephen Jose Hanson, Sebastien Helie, Evguenia Malaia, Barak
A. Pearlmutter, Thomas Michael Talavage, and Ronnie B. Wilbur.
Jeffrey M. Siskind received the B.A. degree in computer science from the
Technion, Israel Institute of Technology, Haifa, in 1979, the S.M. degree in
computer science from the Massachusetts Institute of Technology (M.I.T.),
Cambridge, in 1989, and the Ph.D. degree in computer science from M.I.T. in
1992. He did a postdoctoral fellowship at the University of Pennsylvania
Institute for Research in Cognitive Science from 1992 to 1993. He was an
assistant professor at the University of Toronto Department of Computer
Science from 1993 to 1995, a senior lecturer at the Technion Department of
Electrical Engineering in 1996, a visiting assistant professor at the
University of Vermont Department of Computer Science and Electrical
Engineering from 1996 to 1997, and a research scientist at NEC Research
Institute, Inc. from 1997 to 2001. He joined the Purdue University School of
Electrical and Computer Engineering in 2002 where he is currently an associate
professor. His research interests include machine vision, artificial
intelligence, cognitive science, computational linguistics, child language
acquisition, and programming languages and compilers.

Sponsored by