AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Abstract

Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps.

We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning.

AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4 on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating over 35% higher sample efficiency. These results demonstrate that targeted RL improvements, rather than massive architectural changes, effectively address core multimodal reasoning challenges.

Method

AVATAR is an off-policy reinforcement learning framework that enhances Group Relative Policy Optimization (GRPO) for multimodal video understanding. The framework addresses three core limitations of standard GRPO through two main innovations.

Off-Policy Architecture with Stratified Replay Buffer: AVATAR employs a stratified replay buffer divided into three dynamic tiers: Easy (25%), Medium (35%), and Hard (40%). Experiences are assigned based on the policy's moving average reward for each prompt, forming a progressive learning curriculum. This addresses data inefficiency by reusing past experiences and mitigates the vanishing advantage problem by maintaining reward diversity.

Temporal Advantage Shaping (TAS): TAS addresses uniform credit assignment by applying a U-shaped weighting curve that amplifies advantages during crucial planning (beginning) and synthesis (end) stages. For a reasoning sequence of length L, each token's position t is normalized to [0,1], and weights are computed as: w_t = 1.0 + λ_TAS · (2t̃ - 1)², where t̃ = t/(L-1).

Temporal Advantage Shaping U-shaped Curve

Three-Stage Training Pipeline: AVATAR is evaluated through a progressive curriculum: Stage 1 develops general visual reasoning, Stage 2 introduces audio-visual alignment, and Stage 3 addresses fine-grained audio-based object localization. Each stage leverages specific reward functions including format rewards, accuracy rewards, self-rewarding mechanisms, and stepwise reasoning judges.

Video-Context Reference Score (VCRS): To ensure stable learning signals, AVATAR introduces VCRS, which acts as a multiplicative factor in advantage calculation using a moving average of rewards over the last 20 processed instances, preventing zero-valued advantages that would stall learning.

Results

Audio-Visual Reasoning Performance

AVATAR significantly outperforms the Qwen2.5-Omni baseline on audio-visual reasoning tasks, with absolute gains of +4.9 on OmniBench, +3.0 on DailyOmni, and +2.3 on AV-Odyssey. The framework also outperforms other GRPO-based methods, surpassing AV-Reasoner by 0.8 and Omni-R1 by 2.2 on OmniBench alignment tasks.

Audio-Visual Benchmark Results (Table 1)

General Video Understanding Results

AVATAR achieves state-of-the-art performance on general video understanding benchmarks, including MVBench (66.4) and LVBench (38.4), and excels in complex reasoning tasks. Notable improvements include +4.5 on Video-Holmes and +5.4 on MMVU, demonstrating the effectiveness of the stratified replay buffer and TAS for multi-step causal inference.

Sample Efficiency & Ablation Studies

AVATAR demonstrates superior sample efficiency, reaching 0.75 accuracy reward in just 2,500 iterations compared to baseline GRPO plateauing at 0.4, representing a 35% efficiency gain. Component-wise analysis shows that both the replay buffer and TAS contribute complementary benefits.

The ablation studies validate the effectiveness of each component. The curriculum design shows clear progression with Stage 1 RL yielding the largest improvements over SFT, while the U-shaped TAS weighting consistently outperforms linear alternatives and uniform baselines.

Curriculum Design and TAS Ablation (Table 4)

Qualitative Examples

Qualitative examples demonstrate AVATAR's superior cross-modal integration compared to baseline GRPO. AVATAR effectively links visual cues (e.g., "tense expression, eyes darting around") with audio analysis ("hurried and tense tone"), while baseline GRPO makes disconnected observations. The framework shows improved temporal reasoning by tracking emotional progression and enhanced contextual understanding through precise dialogue interpretation.

Qualitative Comparison Examples (Figure 9)

In the examples above, AVATAR demonstrates better cross-modal integration by linking visual cues ("tense expression, his eyes darting around") with audio analysis ("hurried and tense tone when he speaks"), while baseline GRPO makes disconnected observations. AVATAR also shows improved temporal reasoning by tracking emotional progression ("tone shifting from calm to anxious") and enhanced contextual understanding through precise dialogue interpretation ("Sorry, I have a train to catch" indicating abrupt departure).

BibTeX

@misc{kulkarni2025avatarreinforcementlearningsee,
      title={AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video}, 
      author={Yogesh Kulkarni and Pooyan Fazli},
      year={2025},
      eprint={2508.03100},
      archivePrefix={arXiv},
}