Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization.
VideoPASTA trains models to distinguish accurate video representations from carefully generated adversarial examples that deliberately violate spatial, temporal, or cross-frame relations. By applying Direct Preference Optimization to just 7,020 preference pairs, VideoPASTA learns robust representations that capture fine-grained spatial relationships and long-range temporal dynamics.
Experiments on standard video benchmarks show significant relative performance gains of 3.05% on VideoMME, 1.97% on NeXTQA, and 1.31% on Long VideoBench, over the baseline Qwen2.5-VL model. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without human annotation or captioning, relying on just 32-frame sampling, compared to the 96-frame, multi-GPU setups of prior work. This efficiency makes our approach a scalable, plug-and-play solution.
VideoPASTA is a DPO-based framework that aligns video-language models using structured preference optimization. It leverages a dataset D = {(V, q, r⁺, r⁻)}, where V is the video, q is the query, r⁺ is the preferred aligned response, and r⁻ is a targeted adversarial response designed to introduce misalignment.
The core process involves generating preference pairs targeting three key failure modes:
This targeted approach enables robust alignment across multiple dimensions of video understanding.
VideoPASTA demonstrates significant improvements over its foundation model (Qwen2.5-VL) and outperforms state-of-the-art methods on several benchmarks. Key relative gains include +3.05% on VideoMME, +1.97% on NeXTQA, +1.69% on MVBench, and +1.31% on Long VideoBench. It achieves these results with only 7k preference pairs, contrasting with methods like Hound-DPO (17k pairs) and TPO (10k pairs). VideoPASTA outperforms LLaVA-Hound-DPO and i-SRT on all eight benchmarks and LLaVA-Video-TPO on seven.
Further analysis validates the effectiveness of VideoPASTA's components including training dynamics, the necessity of targeted failure modes, and robustness against adversarial attacks.
DPO training effectively learns preference boundaries with stabilizing reward accuracy.
Combining all three targeted modes yields the best overall performance improvement.
VideoPASTA significantly outperforms baselines in rejecting adversarial options across all categories.
Qualitative examples demonstrate VideoPASTA's improved understanding compared to the baseline Qwen2.5-VL. It correctly identifies spatial relationships (e.g., person clinging to aircraft vs. leaving), accurately captures temporal sequences in actions (e.g., cake decoration steps), and successfully connects narrative elements across frames (e.g., relating a disclaimer to illustrations of heat stress).
@inproceedings{kulkarni-fazli-2025-videopasta,
title = "VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment",
author = "Kulkarni, Yogesh and Fazli, Pooyan",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
year = "2025",
pages = "32342--32367"}
}