Dissertation Defense

Scale-Adaptive Video Understanding

Chenliang Xu

To reach the next level in capability, computer systems relying on visual perception
need to understand not only what action is happening in a video, but also who is doing the action
and where the action is happening. It is increasingly critical to extracting semantics from videos and,
ultimately, to interacting with humans in our complex world. However, achieving this goal is nontrivial
"“ context in video varies in both spatial scales and temporal scales. The ability to choose the
right scale for efficient video understanding remains an open question. In this talk, I will introduce
a comprehensive set of methods of adapting the scale during video understanding. I will start by
introducing a streaming video segmentation framework that generates a hierarchy of multi-scale
decompositions for videos with arbitrary length. Then I will talk about two methods regarding
the scale selection problem in this hierarchical representation. The first method flattens the entire
hierarchy into a single segmentation using quadratic integer programming that balances the
relative level of information in the field. We show that it is possible to adaptively select the scales of
video content based on various post hoc feature criteria, such as motion-ness and object-ness. The
second method combines the segmentation hierarchy with a local CRF for the task of localizing and
recognizing actors and actions in video. It defines a dynamic and continuous process of information
exchange: the local CRF influences what scales are active in the hierarchy, and these active scales, in
turn, influence the connectivity in the CRF. Experiments on a large-scale video dataset demonstrate
the effectiveness of the explicit consideration of scale selection in video understanding.

Sponsored by