VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

Abstract

Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization.

VideoPASTA trains models to distinguish accurate video representations from carefully generated adversarial examples that deliberately violate spatial, temporal, or cross-frame relations. By applying Direct Preference Optimization to just 7,020 preference pairs, VideoPASTA learns robust representations that capture fine-grained spatial relationships and long-range temporal dynamics.

Experiments on standard video benchmarks show significant relative performance gains of 3.05% on VideoMME, 1.97% on NeXTQA, and 1.31% on Long VideoBench, over the baseline Qwen2.5-VL model. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without human annotation or captioning, relying on just 32-frame sampling, compared to the 96-frame, multi-GPU setups of prior work. This efficiency makes our approach a scalable, plug-and-play solution.

Method

VideoPASTA is a DPO-based framework that aligns video-language models using structured preference optimization. It leverages a dataset D = {(V, q, r⁺, r⁻)}, where V is the video, q is the query, r⁺ is the preferred aligned response, and r⁻ is a targeted adversarial response designed to introduce misalignment.

The core process involves generating preference pairs targeting three key failure modes:

Spatial Misalignment Generation: Uses InternVL2.5-38B to generate spatial queries (e.g., occlusion, depth, relative position). Preferred responses are generated from the baseline model using dense (32fps) sampling. Adversarial responses are generated using sparse (1fps) sampling and prompts that induce spatial errors (e.g., claiming occluded objects are visible, or all objects are equidistant).
Temporal Incoherence Generation: Generates temporal queries (e.g., event order, transitions, causality). Preferred responses use native frame rate sampling. Adversarial responses use sparse (1fps) sampling and prompts that distort time (e.g., describing sequential actions as simultaneous, ignoring transitions).
Cross-frame Disconnection Generation: Generates queries about long-range continuity (e.g., object/character persistence, setting changes). Preferred responses use uniform sampling across the video. Adversarial responses use sparse (1fps) sampling and prompts that break continuity (e.g., treating the same object/character in different scenes as unrelated).
Preference Data Filtering: Generated adversarial examples (3 per preferred response) are filtered using Qwen2.5-32B to ensure they introduce genuine, targeted misalignments and are distinct from the preferred responses. Preferred responses are also sanity-checked. This yields ~7k high-quality pairs.
Optimization: The model is trained using Direct Preference Optimization (DPO) on the filtered pairs. The DPO loss is computed separately for spatial, temporal, and cross-frame subsets and combined with weights.

This targeted approach enables robust alignment across multiple dimensions of video understanding.

Results

Main Benchmark Results

VideoPASTA demonstrates significant improvements over its foundation model (Qwen2.5-VL) and outperforms state-of-the-art methods on several benchmarks. Key relative gains include +3.05% on VideoMME, +1.97% on NeXTQA, +1.69% on MVBench, and +1.31% on Long VideoBench. It achieves these results with only 7k preference pairs, contrasting with methods like Hound-DPO (17k pairs) and TPO (10k pairs). VideoPASTA outperforms LLaVA-Hound-DPO and i-SRT on all eight benchmarks and LLaVA-Video-TPO on seven.

Analysis & Ablation Studies

Further analysis validates the effectiveness of VideoPASTA's components:

DPO Training Dynamics: The DPO training effectively learns preference boundaries, with reward accuracy stabilizing and a clear gap forming between chosen (preferred) and rejected (adversarial) rewards.
Targeted Adversarial Sampling: Training on samples targeting specific failure modes (spatial, temporal, cross-frame) provides complementary benefits. Combining all three yields the best overall performance, significantly improving temporal (+29.14%) and spatial (+23.58%) reasoning over the baseline.
Robustness: VideoPASTA significantly outperforms baselines in identifying and rejecting adversarial questions and options designed to probe spatial, temporal, and cross-frame weaknesses, especially in temporal reasoning (+41.19%).

Effect of Targeted Failure Modes (Table 2)

Performance on Adversarial QA Samples (Table 6)

Qualitative Examples

Qualitative examples demonstrate VideoPASTA's improved understanding compared to the baseline Qwen2.5-VL. It correctly identifies spatial relationships (e.g., person clinging to aircraft vs. leaving), accurately captures temporal sequences in actions (e.g., cake decoration steps), and successfully connects narrative elements across frames (e.g., relating a disclaimer to illustrations of heat stress).

Qualitative Comparison Examples (Figure 3)

BibTeX

@article{kulkarni2025videopasta,
      title={VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment}, 
      author={Yogesh Kulkarni and Pooyan Fazli},
      year={2025},
}