FrameOracle

Learning What to See and How Much to See in Videos

ICML 2026

Chaoyu Li1,2, Tianzhi Li1,3, Fei Tao1, Zhenyu Zhao1, Ziqian Wu1, Maozheng Zhao1, Juntong Song1, Cheng Niu1, Pooyan Fazli2
1NewsBreak   2Arizona State University   3Carnegie Mellon University
FrameOracle teaser figure

FrameOracle selects the frames most relevant to a video question and dynamically predicts the number of frames needed by the downstream VLM.

Abstract

Vision-language models are increasingly capable of video understanding, but they still operate under tight computational budgets. Their performance often depends on whether a small set of input frames contains the evidence needed to answer a question. Existing uniform or fixed-budget sampling strategies do not adapt to content density or task complexity. FrameOracle addresses this limitation with a lightweight, plug-and-play frame selector that predicts both which frames are relevant and how many frames should be retained for each video-query pair. It is trained through a curriculum that progresses from weak proxy signals to stronger supervision from FrameOracle-41K, a large-scale VideoQA dataset with validated keyframe annotations specifying minimal sufficient frames. Across five VLMs and six benchmarks, FrameOracle reduces 16-frame inputs to 10.4 frames on average without accuracy loss, and reduces 64-frame candidates to 13.9 frames while improving accuracy by 1.5%.

Why This Matters

Video VLMs cannot always process all frames in a long video. A fixed number of frames is easy to use, but it ignores the fact that some questions require only a few key moments while others require broader temporal coverage.

FrameOracle turns frame selection into a learned, query-conditioned decision. Instead of asking the downstream VLM to reason over redundant or distracting frames, it provides a compact subset that is more likely to contain the necessary visual evidence.

Core Idea

FrameOracle has two complementary outputs:

  • Rank Head: predicts frame relevance for the given question.
  • K Head: predicts how many frames should be kept.

Together, these heads let the selector adapt both the content and the size of the visual input before the frames are passed into a downstream VLM.

FrameOracle-41K Dataset

FrameOracle data figure

FrameOracle-41K is generated through agent-based keyframe mining followed by cross-model verification with three independent VLMs.

FrameOracle-41K contains 40,992 video-question pairs with keyframe annotations. Unlike standard VideoQA datasets that provide only final answers, FrameOracle-41K records the minimal sufficient frames needed to answer each question.

The dataset is built in two stages. First, an agent explores each video and mines candidate keyframes with relevance scores. Then, candidate keyframes are filtered and verified by three independent VLMs; an example is retained only when all three models answer correctly using the selected frames. A human check over 4,000 random examples reports 94% inter-annotator agreement and 93.3% verified accuracy.

Dataset Statistics

FrameOracle data stat figure

FrameOracle-41K covers 16 question types and shows large variation in the number of frames required across question categories, motivating adaptive frame-count prediction.

Method Overview

FrameOracle method figure

FrameOracle is a lightweight pre-processing module. It receives candidate frames and a textual question, then outputs a compact frame subset for the downstream VLM.

FrameOracle first uniformly samples a candidate frame set from the input video. It then encodes the frames and query, projects them into a shared latent space, and uses a cross-modal Transformer encoder to model query-frame interactions.

The Rank Head scores each candidate frame by relevance, while the K Head predicts how many frames should be retained. The final selected subset contains the top-ranked frames according to the predicted K and can be passed to any downstream VLM without retraining the VLM backbone.

Training Curriculum

Stage 1

Text-Visual Alignment

Use SigLIP similarity as weak supervision to initialize query-frame alignment.

Stage 2

Rank Head Optimization

Train frame ranking with leave-one-out downstream VLM loss signals.

Stage 3

K Head Optimization

Learn the accuracy-efficiency trade-off by selecting target frame counts.

Stage 4

Keyframe SFT

Fine-tune with FrameOracle-41K annotations for both frame indices and K.

Main Highlights

10.4
Average frames kept from 16-frame inputs without accuracy loss
13.9
Average frames kept from 64-frame candidates
+1.5%
Average accuracy gain when using 64-frame candidates
5 × 6
Evaluated across five VLMs and six benchmarks

Main Results

Model Frames NExTQA Perception LVB Video-MME EgoSchema MLVU Avg.
OE_val OE_test MC
(1) State-of-the-Art Models
VideoChat2-7B 16 - - - - 39.3 39.5 63.6 44.5 -
VideoLLaMA2-7B 16 - - 45.4 54.9 53.1 47.9 53.1 - -
Video-XL-7B 256 - - - - 49.5 64.0 - 64.9 -
Video-XL-2-8B ∼10k - - - - 61.0 66.6 - 74.8 -
(2) FrameOracle on Different Baselines
Qwen2.5-VL-3B 32 25.1 29.6 75.4 65.9 54.1 58.4 53.4 59.4 52.7
+ FrameOracle 32→20.9 25.6 30.5 74.8 66.7 54.3 58.5 53.8 58.4 52.8
+ FrameOracle 128→27.8 26.0 31.7 76.1 67.8 54.8 59.7 54.5 61.6 54.0
LLaVA-OneVision-7B 16 14.6 16.7 78.2 56.4 55.0 56.1 60.8 60.9 49.8
+ FrameOracle 16→10.4 16.1 17.8 77.6 56.5 55.5 56.0 62.4 60.2 50.3
+ FrameOracle 64→13.9 16.5 19.0 78.5§ 56.9 56.5 58.1 63.4 63.7 51.6
LLaVA-Video-7B 16 27.3 32.4 81.0 64.3 55.8 59.8 54.2 61.7 54.6
+ FrameOracle 16→10.4 27.8 33.0 80.4 64.7 56.3 59.6 54.6 60.8 54.7
+ FrameOracle 64→13.9 28.8 33.9 81.6 65.1 57.8 61.6 55.2 64.3 56.0
VideoLLaMA3-7B 16 27.8 32.3 82.3 72.3 56.1 61.2 61.4 50.9 55.5
+ FrameOracle 16→10.4 28.3 32.9 81.2 72.0 56.0 61.4 61.8 52.8 55.8
+ FrameOracle 64→13.9 28.9 33.6 82.0§ 72.8 56.9 61.8 62.4 54.1 56.6
Qwen3-VL-8B 32 26.0 31.1 76.6 67.5 63.3 66.9 70.8 63.6 58.2
+ FrameOracle 32→20.9 26.6 32.3 76.1 68.2 64.0 67.3 71.4 62.9 58.6
+ FrameOracle 128→27.8 28.1 33.8 77.3 69.0 65.2 69.1 72.3 66.3 60.1

FrameOracle vs. SOTA VLMs. “Frames” shows M→K̄: FrameOracle starts from M uniformly sampled frames and reduces to an average of K̄ frames. LVB = LongVideoBench validation set. § denotes non-significant changes under paired bootstrap testing.

Comparison with Keyframe Selection Methods

Model Frames NExTQA LVB Video-MME EgoSchema MLVU
(1) Jointly Trained Keyframe Selection Methods
SeViLA 8 63.6 - - 25.7 -
VideoAgent 8.4 71.3 - - 60.2 -
FFS 8.6 66.7 - - - -
AKS 64 - 62.7 65.3 - -
(2) Plug-and-Play Keyframe Selection Methods
LLaVA-OneVision-7B 8 77.4 54.3 53.8 62.0 58.4
+ Frame-Voyager 128→8 73.9 - 57.5 - 65.6
+ BOLT 1fps→8 77.4 55.6 56.1 62.2 63.4
+ KFC 1fps→8 - 55.6 55.4 - 65.0
+ FrameOracle 64→8 77.8 56.0 57.5 62.8 62.9
LLaVA-Video-7B 8 75.6 54.2 55.9 51.8 60.5
+ BOLT 1fps→8 - - 58.6 - -
+ KFC 1fps→8 - 56.5 57.6 - 66.9
+ FrameOracle 64→8 76.5 56.9 58.9 53.0 63.4
Qwen2.5-VL-7B 8 69.2 53.6 54.1 56.8 54.5
+ ViaRL 128→8 - - 57.3 - 58.2
+ K-frames 256→8 - 57.7 57.4 - 60.4
+ FrameOracle 128→8 71.7 59.1 57.4 59.5 59.6

FrameOracle vs. SOTA keyframe selection methods. NExTQA reports MCQ. Methods using more frames or larger LLMs are shown in gray. LVB = LongVideoBench validation set. For fair comparison under a fixed budget, FrameOracle’s K Head is disabled and only the Rank Head is used to select the top-8 frames.

Efficiency Comparison

ModelFramesTotal TFLOPs ↓Latency (s) ↓Visual Tokens ↓Avg. Acc. ↑
LLaVA-Video-7B16184.380.61511,644.054.6
LLaVA-Video-7B32405.641.14023,290.056.2
LLaVA-Video-7B64792.832.62246,584.056.6
+ FrameOracle16→10.4110.980.3637,581.654.7
+ FrameOracle64→13.9167.670.55610,133.156.0

FrameOracle reduces the effective visual input while preserving or improving downstream accuracy.

Qualitative Examples

Use the arrow buttons to switch between example cases where FrameOracle selects compact, question-relevant evidence frames.

BibTeX

@inproceedings{li2026frameoracle,
  title     = {FrameOracle: Learning What to See and How Much to See in Videos},
  author    = {Li, Chaoyu and Li, Tianzhi and Tao, Fei and Zhao, Zhenyu and Wu, Ziqian and Zhao, Maozheng and Song, Juntong and Niu, Cheng and Fazli, Pooyan},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}