Recent advances in video-large language models (Video-LLMs) have significantly progressed video understanding. However, current preference optimization methods often rely on costly proprietary APIs or ground-truth captions to generate preference data, hindering scalability. To address this, we introduce VideoSAVi (Self-Aligned Video Language Model), a self-training pipeline enabling Video-LLMs to improve their reasoning over video content without external supervision.
Our approach features a self-critiquing mechanism where the model identifies reasoning errors in its initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi applies Direct Preference Optimization (DPO) using this data to iteratively refine the model, enhancing both temporal and spatial reasoning.
Experiments demonstrate that VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements across other benchmarks, including a 3.9% gain on PerceptionTest and a 6.8% improvement on EgoSchema, compared to baseline models. Our model-agnostic and computationally efficient approach (requiring only 32 frames per video) offers a promising direction for self-aligned video understanding.
VideoSAVi employs an iterative self-alignment process consisting of four key stages, operating entirely without external supervision beyond the initial model and unlabeled videos:
This cycle is repeated, allowing the model to progressively improve its video understanding capabilities by learning from its own mistakes.
We evaluated VideoSAVi extensively against baseline models (built on InternVL2.5) and state-of-the-art Video-LLMs across multiple benchmarks, as shown in Table 1. VideoSAVi achieves substantial gains over the foundation InternVL2.5 model across all benchmarks: +0.8% on TempCompass, +3.9% on PerceptionTest, +3.6% on NeXTQA, +4.2% on MVBench, +6.8% on EgoSchema, and +2.0% on LongVideoBench.
While standard fine-tuning (SFT and SFT+) shows incremental gains, VideoSAVi's self-alignment approach yields significantly stronger results. Notably, it overcomes the limitations of prior preference optimization methods like Hound-DPO, which relies on text-based ranking and shows performance degradation on benchmarks like MVBench (-5.6%) and EgoSchema (-3.5%), highlighting the inadequacy of text-only preferences for video understanding. Our method also outperforms TPO, which focuses solely on temporal aspects and fails to significantly improve temporal understanding, whereas VideoSAVi delivers consistent gains across both spatial and temporal dimensions.
VideoSAVi sets a new state-of-the-art on MVBench (74.0%), surpassing models like Qwen2-VL by a large margin (+9.1%). On NeXTQA, it outperforms the previous best, LLaVA-OneVision, by +1.3%. While LLaVA-Video leads on PerceptionTest and LLaVA-OneVision on EgoSchema, VideoSAVi demonstrates superior overall generalization and consistency across the diverse set of benchmarks, redefining robust video understanding through self-alignment.
Our ablation studies highlight the effectiveness of the VideoSAVi framework:
VideoSAVi effectively corrects various reasoning errors made by the baseline model, such as temporal hallucinations, spatial misplacements, and object hallucinations.
@article{kulkarni2025videosavi,
title={VideoSAVi: Self-Aligned Video Language Models without Human Supervision},
author={Kulkarni, Yogesh and Fazli, Pooyan},
journal={arXiv preprint arXiv:2412.00624v2},
year={2025}
}