Arizona State University
Arizona State University
Recent advances in vision-language models (VLMs) have significantly enhanced video understanding tasks. Instruction tuning (i.e., fine-tuning models on datasets of instructions paired with desired outputs) has been key to improving model performance. However, creating diverse instruction-tuning datasets is challenging due to high annotation costs and the complexity of capturing temporal information in videos. Existing approaches often rely on large language models to generate instruction-output pairs, which can limit diversity and lead to responses that lack grounding in the video content.
To address this, we propose VideoSAVi (Self-Aligned Video Language Model), a novel self-training pipeline that enables VLMs to generate their own training data without extensive manual annotation. The process involves three stages: (1) generating diverse video-specific questions, (2) producing multiple candidate answers, and (3) evaluating these responses for alignment with the video content. This self-generated data is then used for direct preference optimization (DPO), allowing the model to refine its own high-quality outputs and improve alignment with video content.
Our experiments demonstrate that even smaller models (0.5B and 7B parameters) can effectively use this self-training approach, outperforming previous methods and achieving results comparable to those trained on proprietary preference data. VideoSAVi shows significant improvements across multiple benchmarks: up to 28% on multi-choice QA, 8% on zero-shot open-ended QA, and 12% on temporal reasoning benchmarks. These results demonstrate the effectiveness of our self-training approach in enhancing video understanding while reducing dependence on proprietary models.
Developing high-quality video instruction datasets is expensive and often relies on proprietary models like GPT-4V. There is a pressing need for improved temporal understanding in video-language models and maintaining visual grounding in their responses. This raises key research questions: How can synthetic data be leveraged to enhance Video-LLM performance without expensive human annotations or proprietary APIs? Moreover, how can we ensure that synthetic data aligns with video content to maintain the accuracy and relevance of model responses?
VideoSAVi operates through a five-stage pipeline designed to enhance video-language models. First, it generates diverse video-specific questions (“What,” “Why,” and “How”) from captions, targeting aspects like visual recognition and temporal reasoning. Next, it produces multiple candidate answers at varying temperatures, balancing focused and creative responses. The model then evaluates these answers using a structured template to score relevance, accuracy, temporal grounding, and clarity. This is followed by filtering responses with CLIP similarity scores to ensure alignment with video content, adding a crucial visual grounding check. Finally, the model is fine-tuned using a CLIP-adjusted DPO loss, optimizing for both visual and linguistic quality.
Method | LLM | Action | Direction | Speed | Event | Attribute | Average |
---|---|---|---|---|---|---|---|
VideoSAVi-Vicuna | 7B | 67.7 | 41.4 | 39.4 | 40.3 | 49.9 | 47.7 |
VideoSAVi-Qwen | 0.5B | 64.2 | 42.7 | 44.7 | 44.5 | 47.9 | 48.8 |
VideoSAVi-Qwen | 7B | 83.2 | 42.3 | 48.4 | 46.9 | 49.6 | 54.1 |
@article{kulkarni2024videosavi,
title={VideoSAVi: Self-Aligned Video Language Models without Human Supervision},
author={Kulkarni, Yogesh and Fazli, Pooyan},
journal={arXiv preprint arXiv:2412.00624},
year={2024}
}