VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Abstract

Recent advances in video-large language models (Video-LLMs) have significantly progressed video understanding. However, current preference optimization methods often rely on costly proprietary APIs or ground-truth captions to generate preference data, hindering scalability. To address this, we introduce VideoSAVi (Self-Aligned Video Language Model), a self-training pipeline enabling Video-LLMs to improve their reasoning over video content without external supervision.

Our approach features a self-critiquing mechanism where the model identifies reasoning errors in its initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi applies Direct Preference Optimization (DPO) using this data to iteratively refine the model, enhancing both temporal and spatial reasoning.

Experiments demonstrate that VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements across other benchmarks, including a 3.9% gain on PerceptionTest and a 6.8% improvement on EgoSchema, compared to baseline models. Our model-agnostic and computationally efficient approach (requiring only 32 frames per video) offers a promising direction for self-aligned video understanding.

Method

VideoSAVi employs an iterative self-alignment process consisting of four key stages, operating entirely without external supervision beyond the initial model and unlabeled videos:

Question Generation: The baseline Video-LLM is prompted to generate reasoning-focused questions about a given video, targeting both spatial relationships (object locations, visual details) and temporal sequences (event order, causality, state changes).
Initial Response Generation: The same model generates initial answers (a₀) to these questions based on the video content. These answers may contain factual errors or reasoning flaws.
Self-Critique: The model acts as its own critic, evaluating the initial response (a₀) against the video content. Guided by a specific prompt, it identifies inconsistencies, factual errors (spatial or temporal), hallucinations, or omissions, producing a critique (c).
Response Refinement & Preference Pair Creation: Using the original question (q), the initial response (a₀), and the self-critique (c), the model generates a revised, improved response (a₁). The pair (a₁, a₀) forms a preference pair, where a₁ is preferred over a₀.
Optimization: These self-generated preference pairs are used to fine-tune the model using Direct Preference Optimization (DPO), teaching the model to favor the refined, more accurate responses.

This cycle is repeated, allowing the model to progressively improve its video understanding capabilities by learning from its own mistakes.

Results

Main Results (Table 1)

We evaluated VideoSAVi extensively against baseline models (built on InternVL2.5) and state-of-the-art Video-LLMs across multiple benchmarks, as shown in Table 1. VideoSAVi achieves substantial gains over the foundation InternVL2.5 model across all benchmarks: +0.8% on TempCompass, +3.9% on PerceptionTest, +3.6% on NeXTQA, +4.2% on MVBench, +6.8% on EgoSchema, and +2.0% on LongVideoBench.

While standard fine-tuning (SFT and SFT+) shows incremental gains, VideoSAVi's self-alignment approach yields significantly stronger results. Notably, it overcomes the limitations of prior preference optimization methods like Hound-DPO, which relies on text-based ranking and shows performance degradation on benchmarks like MVBench (-5.6%) and EgoSchema (-3.5%), highlighting the inadequacy of text-only preferences for video understanding. Our method also outperforms TPO, which focuses solely on temporal aspects and fails to significantly improve temporal understanding, whereas VideoSAVi delivers consistent gains across both spatial and temporal dimensions.

VideoSAVi sets a new state-of-the-art on MVBench (74.0%), surpassing models like Qwen2-VL by a large margin (+9.1%). On NeXTQA, it outperforms the previous best, LLaVA-OneVision, by +1.3%. While LLaVA-Video leads on PerceptionTest and LLaVA-OneVision on EgoSchema, VideoSAVi demonstrates superior overall generalization and consistency across the diverse set of benchmarks, redefining robust video understanding through self-alignment.

Ablation Studies

Our ablation studies highlight the effectiveness of the VideoSAVi framework:

Reasoning Errors Across Iterations (Table 2): Error analysis confirms a substantial reduction in reasoning errors across iterations, validating that the self-alignment process leads to genuine improvement.
Generalization (Table 3): VideoSAVi exhibits strong generalization to unseen questions, videos, and domains, significantly outperforming methods like Hound-DPO which suffer from large generalization gaps.
Self-Critique vs. External Critics (Table 5): Integrating the critique mechanism directly within the model (self-critique) is more effective than relying on external critics like GPT-4o.
Parameter Efficiency (Table 6): VideoSAVi enables smaller models to achieve performance comparable to or exceeding larger baseline models, demonstrating that targeted alignment is highly parameter-efficient.

Qualitative Examples

VideoSAVi effectively corrects various reasoning errors made by the baseline model, such as temporal hallucinations, spatial misplacements, and object hallucinations.

Qualitative Correction Examples (Figure 5)

BibTeX

@article{kulkarni2025videosavi,
      title={VideoSAVi: Self-Aligned Video Language Models without Human Supervision},
      author={Kulkarni, Yogesh and Fazli, Pooyan},
      journal={arXiv preprint arXiv:2412.00624v2},
      year={2025}
    }