Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization.
VideoPASTA trains models to distinguish accurate video representations from carefully generated adversarial examples that deliberately violate spatial, temporal, or cross-frame relations. By applying Direct Preference Optimization to just 7,020 preference pairs, VideoPASTA learns robust representations that capture fine-grained spatial relationships and long-range temporal dynamics.
Experiments on standard video benchmarks show significant relative performance gains of 3.05% on VideoMME, 1.97% on NeXTQA, and 1.31% on Long VideoBench, over the baseline Qwen2.5-VL model. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without human annotation or captioning, relying on just 32-frame sampling, compared to the 96-frame, multi-GPU setups of prior work. This efficiency makes our approach a scalable, plug-and-play solution.
VideoPASTA is a DPO-based framework that aligns video-language models using structured preference optimization. It leverages a dataset D = {(V, q, r⁺, r⁻)}, where V is the video, q is the query, r⁺ is the preferred aligned response, and r⁻ is a targeted adversarial response designed to introduce misalignment.
The core process involves generating preference pairs targeting three key failure modes:
This targeted approach enables robust alignment across multiple dimensions of video understanding.
VideoPASTA demonstrates significant improvements over its foundation model (Qwen2.5-VL) and outperforms state-of-the-art methods on several benchmarks. Key relative gains include +3.05% on VideoMME, +1.97% on NeXTQA, +1.69% on MVBench, and +1.31% on Long VideoBench. It achieves these results with only 7k preference pairs, contrasting with methods like Hound-DPO (17k pairs) and TPO (10k pairs). VideoPASTA outperforms LLaVA-Hound-DPO and i-SRT on all eight benchmarks and LLaVA-Video-TPO on seven.
Further analysis validates the effectiveness of VideoPASTA's components:
Qualitative examples demonstrate VideoPASTA's improved understanding compared to the baseline Qwen2.5-VL. It correctly identifies spatial relationships (e.g., person clinging to aircraft vs. leaving), accurately captures temporal sequences in actions (e.g., cake decoration steps), and successfully connects narrative elements across frames (e.g., relating a disclaimer to illustrations of heat stress).
@article{kulkarni2025videopasta,
title={VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment},
author={Yogesh Kulkarni and Pooyan Fazli},
year={2025},
}