AVATAR

Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni¹, Pooyan Fazli¹

¹Arizona State University
CVPR 2026

AVATAR addresses three core GRPO limitations: data inefficiency via a stratified off-policy replay buffer, the vanishing advantage problem via reward-diverse group sampling, and uniform credit assignment via Temporal Advantage Shaping (TAS).

Abstract

Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps.

We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a credit assignment strategy that emphasizes early (planning) and late (synthesis) reasoning phases.

AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4 on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes. Furthermore, it surpasses standard GRPO by +3.7 on OmniBench and +1.9 on Video-Holmes, while demonstrating 5× sample efficiency, requiring 80% fewer generated completions to reach target performance.

Method

Off-Policy Architecture with Stratified Replay Buffer

AVATAR employs a stratified replay buffer B (size 10k) divided into three tiers: Easy (25%), Medium (35%), and Hard (40%). Tier assignment is driven by each prompt's moving average reward R̄(q) — the bottom 40% by score go to the Hard tier, ensuring the model repeatedly engages with its hardest failure modes. A balanced 4 on-policy / 4 off-policy split per group empirically maximizes performance.

Hinting Mechanism

When a prompt remains in the Hard tier and KL divergence from the behavior policy drops (policy stops exploring), a pre-computed hint is injected — a short strategic suggestion (e.g., "first locate the object making the sound, then count") generated by a teacher model from full problem context. This unlocks the hardest 20% of the distribution that would otherwise yield zero reward.

Temporal Advantage Shaping (TAS)

TAS replaces GRPO's uniform credit assignment with a U-shaped parabolic weight. For a sequence of length L, each token at normalized position t̃ = t/(L−1) receives:

w_t = 1.0 + λ_TAS · (2t̃ − 1)²

Minimal weight (1.0) at the middle; maximum weight (1.0 + λ_TAS) at the beginning (planning) and end (synthesis). No learned critic required.

Figure 2. TAS U-shaped parabolic weight amplifies the planning (start) and synthesis (end) of each reasoning chain.

Three-Stage Training Pipeline

Stage 0: cold-start SFT. Stage 1: general visual reasoning. Stage 2: audio-visual alignment (audio captions from Kimi-Audio). Stage 3: fine-grained audio-based localization (stepwise judge reward from InternVL3). TAS is applied throughout all RL stages.

Figure 3. AVATAR's three-stage RL curriculum with progressively harder tasks and richer reward functions.

Results

Table 1 — Audio-Visual Reasoning

AVATAR vs. state-of-the-art audio-video understanding models. Gains shown in green with 95% CI. † not statistically significant.

Model	OmniBench	DailyOmni	AV-Counting	AV-Odyssey	WorldSense	IntentBench
Ola-7B
Baseline	45.3	52.3	17.4	25.6	44.2	59.1
+ GRPO	46.8 (+1.5)	54.1 (+1.8)	18.2 (+0.8)	27.0 (+1.4)	44.7 (+0.5)	60.3 (+1.2)
+ AVATAR	47.2 (+1.9)	55.7 (+3.4)	19.5 (+2.1)	28.8 (+3.2)	45.0 (+0.8)	61.9 (+2.8)
Qwen2.5-Omni
Baseline	44.2	44.0	22.3	29.8	44.2	63.7
+ GRPO	45.4 (+1.2)	44.8 (+0.8)	22.8 (+0.5)	31.3 (+1.5)	45.1 (+0.9)	63.8 (+0.1)†
+ AVATAR	49.1 (+4.9)	47.0 (+3.0)	23.1 (+0.8)	32.1 (+2.3)	46.0 (+1.8)	63.9 (+0.2)

Table 2 — General Video Understanding & Reasoning

AVATAR vs. state-of-the-art video models. † not statistically significant.

Model	General Video Understanding			Video Reasoning
Model	MVBench	Video-MME	LVBench	Video-Holmes	MMVU	TOMATO
Ola-7B
Baseline	40.1	59.1	35.5	40.1	56.6	25.3
+ GRPO	42.5 (+2.4)	60.2 (+1.1)	36.0 (+0.5)	41.3 (+1.2)	57.0 (+0.4)	25.9 (+0.6)
+ AVATAR	45.4 (+5.3)	61.4 (+2.3)	36.6 (+1.1)	42.4 (+2.3)	57.3 (+0.7)	26.6 (+1.3)
Qwen2.5-Omni
Baseline	66.1	58.3	37.2	40.6	60.2	29.0
+ GRPO	66.3 (+0.2)	60.5 (+2.2)	37.8 (+0.6)	43.2 (+2.6)	64.0 (+3.8)	29.2 (+0.2)†
+ AVATAR	66.4 (+0.3)†	62.8 (+4.5)	38.4 (+1.2)	45.1 (+4.5)	65.6 (+5.4)	30.8 (+1.8)

AVATAR Resolves the Vanishing Advantage Problem

GRPO's advantage distribution collapses to zero when all responses in a group receive similar rewards. AVATAR's stratified replay buffer mixes historically-hard and easy samples, shifting the distribution from a zero-centered spike to a bimodal shape.

Figure 4. Advantage distribution under GRPO (left) collapses to zero. AVATAR (right) maintains a bimodal distribution enabling effective gradient flow.

5× Sample Efficiency

AVATAR reaches 0.80 accuracy in 400 iterations (1,600 unique completions), while GRPO fails even after 1,000 iterations (8,000 completions) — 5× greater sample efficiency, requiring 80% fewer generated completions.

Figure 5. Training accuracy (left) and reward (right) over completions. AVATAR reaches target performance in 5× fewer samples than GRPO.

Table 3 — Component-Wise Ablation

Each component addresses a specific GRPO limitation. Replay buffer resolves data inefficiency and vanishing advantages; TAS improves credit assignment; hinting escapes local optima.

Qwen2.5-Omni	Audio-Visual			Video Reasoning
Qwen2.5-Omni	OmniBench	DailyOmni	AV-Odyssey	Video-MMMU	VSI-Bench	Video-TT
Baseline	44.2	44.0	29.8	46.8	25.4	41.8
+ GRPO	45.4 (+1.2)	44.8 (+0.8)	31.3 (+1.5)	48.1 (+1.3)	25.9 (+0.5)	43.0 (+1.2)
+ TAS Only (w/ On-Policy GRPO)	45.1 (+0.9)	45.4 (+1.4)	31.4 (+1.6)	49.0 (+2.2)	26.5 (+1.1)	43.8 (+2.0)
+ Replay Buffer (w/ Uniform Credit)	47.8 (+3.6)	45.9 (+1.9)	31.6 (+1.8)	48.5 (+1.7)	26.1 (+0.7)	43.3 (+1.5)
+ AVATAR (Full)	49.1 (+4.9)	47.0 (+3.0)	32.1 (+2.3)	49.4 (+2.6)	26.8 (+1.4)	44.2 (+2.4)

Table 4 — Training Curriculum & Advantage Shaping Ablation

Staged curriculum shows consistent stage-wise improvements. TAS (parabolic) outperforms all alternative weighting schemes including uniform, linear, and inverse parabolic.

Setting	OmniBench	DailyOmni	AV-Odyssey	Video-Holmes	MMVU	TOMATO
Training Curriculum (Qwen2.5-Omni baseline: 44.2 / 44.0 / 29.8 / 40.6 / 60.2 / 29.0)
SFT Only	45.8 (+1.6)	45.2 (+1.2)	30.1 (+0.3)	41.8 (+1.2)	62.1 (+1.9)	28.9 (-0.1)
SFT + Stage 1 RL	47.2 (+3.0)	45.8 (+1.8)	30.6 (+0.8)	43.5 (+2.9)	64.2 (+4.0)	29.4 (+0.4)
SFT + Stages 1–2 RL	48.6 (+4.4)	46.7 (+2.7)	31.8 (+2.0)	44.2 (+3.6)	65.1 (+4.9)	30.1 (+1.1)
SFT + Stages 1–3 RL (AVATAR)	49.1 (+4.9)	47.0 (+3.0)	32.1 (+2.3)	45.1 (+4.5)	65.6 (+5.4)	30.8 (+1.8)
Advantage Shaping Strategy
Uniform (GRPO)	47.8 (+3.6)	45.9 (+1.9)	31.6 (+1.8)	43.7 (+3.1)	64.8 (+4.6)	29.3 (+0.3)
Inverse Parabolic	46.5 (+2.3)	45.1 (+1.1)	30.8 (+1.0)	42.8 (+2.2)	63.5 (+3.3)	29.1 (+0.1)
TAS — Parabolic (Ours)	49.1 (+4.9)	47.0 (+3.0)	32.1 (+2.3)	45.1 (+4.5)	65.6 (+5.4)	30.8 (+1.8)

TAS Gains Scale with Reasoning Length

TAS consistently improves performance across reasoning lengths, with gains amplifying as sequences grow longer. The advantage is most pronounced for extended sequences (400+ tokens), particularly on challenging benchmarks like MMVU and Video-Holmes, where uniform credit would dilute the reward signal most severely.

Figure 6. TAS improvement over GRPO vs. reasoning sequence length on audio-visual (left) and video reasoning (right) benchmarks.

Qualitative Examples

AVATAR demonstrates superior cross-modal integration — linking visual cues ("tense expression, eyes darting around") with audio analysis ("hurried and tense tone when he speaks") — while baseline GRPO makes disconnected observations. AVATAR tracks emotional progression ("tone shifting from calm to anxious") and interprets precise dialogue ("Sorry, I have a train to catch" indicating abrupt departure), rather than falling back on surface-level genre classification.

Qualitative comparison of AVATAR vs. baseline GRPO on an audio-visual reasoning example from Video-Holmes.

BibTeX

@inproceedings{kulkarni2026avatar,
  title={AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video},
  author={Kulkarni, Yogesh and Fazli, Pooyan},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}