VideoSAVi LogoVideoSAVi

Self-Aligned Video Language Models without Human Supervision

Yogesh Kulkarni

Arizona State University

Pooyan Fazli

Arizona State University

Abstract

Recent advances in vision-language models (VLMs) have significantly enhanced video understanding tasks. Instruction tuning (i.e., fine-tuning models on datasets of instructions paired with desired outputs) has been key to improving model performance. However, creating diverse instruction-tuning datasets is challenging due to high annotation costs and the complexity of capturing temporal information in videos. Existing approaches often rely on large language models to generate instruction-output pairs, which can limit diversity and lead to responses that lack grounding in the video content.

To address this, we propose VideoSAVi (Self-Aligned Video Language Model), a novel self-training pipeline that enables VLMs to generate their own training data without extensive manual annotation. The process involves three stages: (1) generating diverse video-specific questions, (2) producing multiple candidate answers, and (3) evaluating these responses for alignment with the video content. This self-generated data is then used for direct preference optimization (DPO), allowing the model to refine its own high-quality outputs and improve alignment with video content.

Our experiments demonstrate that even smaller models (0.5B and 7B parameters) can effectively use this self-training approach, outperforming previous methods and achieving results comparable to those trained on proprietary preference data. VideoSAVi shows significant improvements across multiple benchmarks: up to 28% on multi-choice QA, 8% on zero-shot open-ended QA, and 12% on temporal reasoning benchmarks. These results demonstrate the effectiveness of our self-training approach in enhancing video understanding while reducing dependence on proprietary models.

 

🎯 Motivation & Challenges

Developing high-quality video instruction datasets is expensive and often relies on proprietary models like GPT-4V. There is a pressing need for improved temporal understanding in video-language models and maintaining visual grounding in their responses. This raises key research questions: How can synthetic data be leveraged to enhance Video-LLM performance without expensive human annotations or proprietary APIs? Moreover, how can we ensure that synthetic data aligns with video content to maintain the accuracy and relevance of model responses?

 

🔬 Method: VideoSAVi Pipeline

VideoSAVi Pipeline Diagram
Overview of VideoSAVi's self-training pipeline with five key components.

 

VideoSAVi operates through a five-stage pipeline designed to enhance video-language models. First, it generates diverse video-specific questions (“What,” “Why,” and “How”) from captions, targeting aspects like visual recognition and temporal reasoning. Next, it produces multiple candidate answers at varying temperatures, balancing focused and creative responses. The model then evaluates these answers using a structured template to score relevance, accuracy, temporal grounding, and clarity. This is followed by filtering responses with CLIP similarity scores to ensure alignment with video content, adding a crucial visual grounding check. Finally, the model is fine-tuned using a CLIP-adjusted DPO loss, optimizing for both visual and linguistic quality.

 

📊 Quantitative Results

Temporal Reasoning Benchmark (TempCompass)

MethodLLMActionDirectionSpeedEventAttributeAverage
VideoSAVi-Vicuna7B67.741.439.440.349.947.7
VideoSAVi-Qwen0.5B64.242.744.744.547.948.8
VideoSAVi-Qwen7B83.242.348.446.949.654.1

Multi-Choice QA Benchmarks

Performance comparison across multiple benchmarks shows VideoSAVi-Qwen-7B outperforming baselines including Video-LLaVA, LLaVA-NeXT-DPO, LLaVA-HOUND-DPO, i-SRT and SF-LLaVA on NExTQA, EgoQA and IntentQA tasks
VideoSAVi achieves significant improvements across multiple benchmarks: up to 28% on multi-choice QA, 8% on zero-shot open-ended QA, and 12% on temporal reasoning compared to existing methods

 

🧪 Experiments

Temporal Reasoning Results

👁️ Qualitative Results

Performance comparison across multiple benchmarks shows VideoSAVi-Qwen-7B outperforming baselines including Video-LLaVA, LLaVA-NeXT-DPO, LLaVA-HOUND-DPO, i-SRT and SF-LLaVA on NExTQA, EgoQA and IntentQA tasks
Comparison of model responses across different scenarios. Left: While baseline models add unnecessary details (laps gently) or make incorrect observations (left hands), VideoSAVi provides precise and factual descriptions of the drinking patterns. Right: VideoSAVi correctly identifies the natural demonstration method (opening his own mouth) while other models incorrectly assume use of tools or hands.

 

Bibtex

    @article{kulkarni2024videosavi,
  title={VideoSAVi: Self-Aligned Video Language Models without Human Supervision},
  author={Kulkarni, Yogesh and Fazli, Pooyan},
  journal={arXiv preprint arXiv:2412.00624},
  year={2024}
}