Reasoning about intentions and actions from a first-person (egocentric) perspective remains a fundamental challenge for multimodal large language models (MLLMs). Unlike third-person (exocentric) videos that capture scenes from an outside observer, egocentric videos reflect the actor's continuously changing viewpoint, introducing partial observability, limited field of view, and self-referenced motion.
We introduce EgoVITA, a reinforcement learning framework that enables MLLMs to reason through structured planning and verification. Built on Group Relative Policy Optimization (GRPO), EgoVITA alternates between two stages: (1) an egocentric planning phase, where the model reasons from a first-person viewpoint to predict a step-by-step plan of future actions, and (2) an exocentric verification phase, where it switches to a third-person perspective to check the visual and logical consistency of that plan.
Through GRPO, the model learns to make plans that are causally predictive of upcoming visual observations, leading to more coherent and visually grounded reasoning. EgoVITA achieves significant gains on egocentric reasoning tasks, outperforming the baseline Qwen2.5-VL-7B by +7.7 on EgoBlind and +4.4 on EgoOrient, while maintaining strong generalization on exocentric video tasks.
EgoVITA is a reinforcement learning framework built on Group Relative Policy Optimization (GRPO) that addresses egocentric video reasoning through structured planning and verification. The framework separates reasoning into two complementary components that operate from different viewpoints.
Stage I: Supervised Fine-Tuning (SFT): The policy model is initialized to learn the structured output format with three components: egocentric planning, exocentric verification, and final answer generation. This stage establishes a stable base policy capable of generating structured outputs.
Stage II: GRPO Optimization: The model generates multiple reasoning trajectories for each video-query pair and scores them using a composite reward function. The policy is then refined based on relative performance within each group, enabling exploration of diverse reasoning paths while maintaining stability.
ACMG is a novel dense reward mechanism that ensures generated plans are temporally predictive and visually grounded. Each plan clause is projected into visual space via a trainable Anticipation Head and compared to future frames using cosine similarity. The reward measures how well the predicted visual embedding matches any of the next N=16 frames, encouraging the model to anticipate what will happen next rather than just describing the current scene.
The temporal grounding heatmap (right) shows how different plan clauses align with future frames. Earlier clauses ground to near-future frames while later clauses align with more distant events, demonstrating that the model learns temporally structured anticipation.
EgoVITA uses a weighted combination of four reward components: (1) Format Reward, (2) Answer Reward, (3) ACMG Reward, and (4) Confidence Reward. To prevent catastrophic forgetting, we periodically interleave GRPO updates with lightweight exocentric regularization on held-out VideoQA data.
EgoVITA consistently improves egocentric video understanding across multiple foundation models. On Qwen2.5-VL-7B, it achieves substantial gains of +7.7 on EgoBlind, +3.7 on EgoThink, and +4.4 on EgoOrient. Importantly, EgoVITA not only improves egocentric reasoning but also maintains or enhances performance on exocentric benchmarks.
EgoVITA outperforms recent egocentric reasoning models including EgoThinker and EgoVLM. Despite using far fewer training samples (47k vs. 5M for EgoThinker), EgoVITA achieves higher accuracy, demonstrating the effectiveness of dense multimodal rewards.
t-SNE visualization of the ACMG embedding space reveals semantically structured clusters. Strong alignment between text clauses, predicted visual embeddings, and actual matched frames confirms that EgoVITA learns meaningful task semantics.
Comprehensive studies validate the importance of each component and the stability of the framework.
The ACMG and Confidence rewards are complementary, with the full system achieving greater improvements than either alone.
Removing either the egocentric planning or exocentric verification stage degrades performance on their respective task types.
Comparison with present-frame grounding confirms that anticipatory grounding substantially outperforms describing only the current scene.
Regularization (solid red line) reverses the catastrophic forgetting observed in standard SFT (dotted red line), recovering performance.
Qualitative comparisons demonstrate EgoVITA's superior reasoning capabilities. For a blind person crossing an intersection, EgoVITA generates sequential, safety-critical actions grounded in specific visual frames.
@article{kulkarni2025egovita,
title={EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning},
author={Yogesh Kulkarni and Pooyan Fazli},
journal={arXiv preprint arXiv:{2511.18242}},
year={2025}
}