Dissertation Defense

Towards Generalist Vision-Language Models for Videos in Embodied AI

Peter YuPh.D. Candidate
WHERE:
3941 Beyster BuildingMap
SHARE:

Hybrid Event: 3941 BBB / Zoom

Abstract: Recent advances in vision-language models (VLMs)–the multimodal descendants of large language models (LLMs)–have shown tremendous promise in addressing the challenges of embodied AI, particularly in open-world settings. However, existing VLMs are primarily designed for static, turn-based settings and struggle to operate in dynamic, real-time environments where inputs are continuous. In this thesis, we address these limitations by investigating three core challenges in adapting VLMs to open-world, real-time embodied AI: (1) addressing the long-tail problem of the multimodal open world, (2) processing long videos spanning minutes to hours, and (3) enabling real-time interaction in continuously changing environments.

To address the first challenge, we identify in-context learning as key to tackling the long-tail problem of the multimodal open world, and propose Emergent In-context Learning on Videos (EILeV), a novel training paradigm that induces in-context learning capabilities in VLMs over video and text. EILeV achieves this by curating a training dataset with specific distributional properties and training a VLM with architectural modifications that enable it to process inputs interleaved with video and text.

To address the second challenge, we introduce Espresso, a novel projector architecture that encodes long videos using a fixed number of tokens without sacrificing the VLM’s temporal understanding. Espresso achieves this by separately compressing spatial and temporal features, making efficient use of the fixed token budget when encoding video inputs.

Finally, to address the third challenge, we introduce Temporally-Grounded Language Generation (TGLG), a benchmark task that evaluates two critical capabilities for real-time VLMs: perceptual updating–the ability to account for environmental changes while generating a response, and contingency awareness–the ability to adjust responses based on how previous outputs affect the environment. We curate a video-text dataset for this task and propose Temporal Responsiveness and Alignment Coherence Evaluation (TRACE), a new metric for quantifying these capabilities. As a strong baseline, we present Vision-Language Models with Time-Synchronized Interleaving (VLM-TSI), which tightly interleaves vision and text tokens to model real-time interactions with high temporal fidelity.

By addressing these three challenges, this thesis advances the development of VLMs that can reason and respond fluidly in open-world, real-time environments.

 

Organizer

CSE Graduate Programs Office

Faculty Host

Prof. Joyce Chai