FrameOracle: Learning What to See and How Much to See in Videos

FrameOracle selects the frames most relevant to a video question and dynamically predicts the number of frames needed by the downstream VLM.

Abstract

Vision-language models are increasingly capable of video understanding, but they still operate under tight computational budgets. Their performance often depends on whether a small set of input frames contains the evidence needed to answer a question. Existing uniform or fixed-budget sampling strategies do not adapt to content density or task complexity. FrameOracle addresses this limitation with a lightweight, plug-and-play frame selector that predicts both which frames are relevant and how many frames should be retained for each video-query pair. It is trained through a curriculum that progresses from weak proxy signals to stronger supervision from FrameOracle-41K, a large-scale VideoQA dataset with validated keyframe annotations specifying minimal sufficient frames. Across five VLMs and six benchmarks, FrameOracle reduces 16-frame inputs to 10.4 frames on average without accuracy loss, and reduces 64-frame candidates to 13.9 frames while improving accuracy by 1.5%.

Why This Matters

Video VLMs cannot always process all frames in a long video. A fixed number of frames is easy to use, but it ignores the fact that some questions require only a few key moments while others require broader temporal coverage.

FrameOracle turns frame selection into a learned, query-conditioned decision. Instead of asking the downstream VLM to reason over redundant or distracting frames, it provides a compact subset that is more likely to contain the necessary visual evidence.

Core Idea

FrameOracle has two complementary outputs:

Rank Head: predicts frame relevance for the given question.
K Head: predicts how many frames should be kept.

Together, these heads let the selector adapt both the content and the size of the visual input before the frames are passed into a downstream VLM.

FrameOracle-41K Dataset

FrameOracle-41K is generated through agent-based keyframe mining followed by cross-model verification with three independent VLMs.

FrameOracle-41K contains 40,992 video-question pairs with keyframe annotations. Unlike standard VideoQA datasets that provide only final answers, FrameOracle-41K records the minimal sufficient frames needed to answer each question.

The dataset is built in two stages. First, an agent explores each video and mines candidate keyframes with relevance scores. Then, candidate keyframes are filtered and verified by three independent VLMs; an example is retained only when all three models answer correctly using the selected frames. A human check over 4,000 random examples reports 94% inter-annotator agreement and 93.3% verified accuracy.

Dataset Statistics

FrameOracle-41K covers 16 question types and shows large variation in the number of frames required across question categories, motivating adaptive frame-count prediction.

Method Overview

FrameOracle is a lightweight pre-processing module. It receives candidate frames and a textual question, then outputs a compact frame subset for the downstream VLM.

FrameOracle first uniformly samples a candidate frame set from the input video. It then encodes the frames and query, projects them into a shared latent space, and uses a cross-modal Transformer encoder to model query-frame interactions.

The Rank Head scores each candidate frame by relevance, while the K Head predicts how many frames should be retained. The final selected subset contains the top-ranked frames according to the predicted K and can be passed to any downstream VLM without retraining the VLM backbone.

Training Curriculum

Stage 1

Text-Visual Alignment

Use SigLIP similarity as weak supervision to initialize query-frame alignment.

Stage 2

Rank Head Optimization

Train frame ranking with leave-one-out downstream VLM loss signals.

Stage 3

K Head Optimization

Learn the accuracy-efficiency trade-off by selecting target frame counts.

Stage 4

Keyframe SFT

Fine-tune with FrameOracle-41K annotations for both frame indices and K.

Main Highlights

10.4

Average frames kept from 16-frame inputs without accuracy loss

13.9

Average frames kept from 64-frame candidates

+1.5%

Average accuracy gain when using 64-frame candidates

5 × 6

Evaluated across five VLMs and six benchmarks

Main Results

Model	Frames	NExTQA			Perception	LVB	Video-MME	EgoSchema	MLVU	Avg.
Model	Frames	OE_val	OE_test	MC	Perception	LVB	Video-MME	EgoSchema	MLVU	Avg.
(1) State-of-the-Art Models
VideoChat2-7B	16	-	-	-	-	39.3	39.5	63.6	44.5	-
VideoLLaMA2-7B	16	-	-	45.4	54.9	53.1	47.9	53.1	-	-
Video-XL-7B	256	-	-	-	-	49.5	64.0	-	64.9	-
Video-XL-2-8B	∼10k	-	-	-	-	61.0	66.6	-	74.8	-
(2) FrameOracle on Different Baselines
Qwen2.5-VL-3B	32	25.1	29.6	75.4	65.9	54.1	58.4	53.4	59.4	52.7
+ FrameOracle	32→20.9	25.6	30.5	74.8	66.7	54.3	58.5	53.8	58.4	52.8
+ FrameOracle	128→27.8	26.0	31.7	76.1	67.8	54.8	59.7	54.5	61.6	54.0
LLaVA-OneVision-7B	16	14.6	16.7	78.2	56.4	55.0	56.1	60.8	60.9	49.8
+ FrameOracle	16→10.4	16.1	17.8	77.6	56.5	55.5	56.0	62.4	60.2	50.3
+ FrameOracle	64→13.9	16.5	19.0	78.5^§	56.9	56.5	58.1	63.4	63.7	51.6
LLaVA-Video-7B	16	27.3	32.4	81.0	64.3	55.8	59.8	54.2	61.7	54.6
+ FrameOracle	16→10.4	27.8	33.0	80.4	64.7	56.3	59.6	54.6	60.8	54.7
+ FrameOracle	64→13.9	28.8	33.9	81.6	65.1	57.8	61.6	55.2	64.3	56.0
VideoLLaMA3-7B	16	27.8	32.3	82.3	72.3	56.1	61.2	61.4	50.9	55.5
+ FrameOracle	16→10.4	28.3	32.9	81.2	72.0	56.0	61.4	61.8	52.8	55.8
+ FrameOracle	64→13.9	28.9	33.6	82.0^§	72.8	56.9	61.8	62.4	54.1	56.6
Qwen3-VL-8B	32	26.0	31.1	76.6	67.5	63.3	66.9	70.8	63.6	58.2
+ FrameOracle	32→20.9	26.6	32.3	76.1	68.2	64.0	67.3	71.4	62.9	58.6
+ FrameOracle	128→27.8	28.1	33.8	77.3	69.0	65.2	69.1	72.3	66.3	60.1

FrameOracle vs. SOTA VLMs. “Frames” shows M→K̄: FrameOracle starts from M uniformly sampled frames and reduces to an average of K̄ frames. LVB = LongVideoBench validation set. § denotes non-significant changes under paired bootstrap testing.

Comparison with Keyframe Selection Methods

Model	Frames	NExTQA	LVB	Video-MME	EgoSchema	MLVU
(1) Jointly Trained Keyframe Selection Methods
SeViLA	8	63.6	-	-	25.7	-
VideoAgent	8.4	71.3	-	-	60.2	-
FFS	8.6	66.7	-	-	-	-
AKS	64	-	62.7	65.3	-	-
(2) Plug-and-Play Keyframe Selection Methods
LLaVA-OneVision-7B	8	77.4	54.3	53.8	62.0	58.4
+ Frame-Voyager	128→8	73.9	-	57.5	-	65.6
+ BOLT	1fps→8	77.4	55.6	56.1	62.2	63.4
+ KFC	1fps→8	-	55.6	55.4	-	65.0
+ FrameOracle	64→8	77.8	56.0	57.5	62.8	62.9
LLaVA-Video-7B	8	75.6	54.2	55.9	51.8	60.5
+ BOLT	1fps→8	-	-	58.6	-	-
+ KFC	1fps→8	-	56.5	57.6	-	66.9
+ FrameOracle	64→8	76.5	56.9	58.9	53.0	63.4
Qwen2.5-VL-7B	8	69.2	53.6	54.1	56.8	54.5
+ ViaRL	128→8	-	-	57.3	-	58.2
+ K-frames	256→8	-	57.7	57.4	-	60.4
+ FrameOracle	128→8	71.7	59.1	57.4	59.5	59.6

FrameOracle vs. SOTA keyframe selection methods. NExTQA reports MCQ. Methods using more frames or larger LLMs are shown in gray. LVB = LongVideoBench validation set. For fair comparison under a fixed budget, FrameOracle’s K Head is disabled and only the Rank Head is used to select the top-8 frames.

Efficiency Comparison

Model	Frames	Total TFLOPs ↓	Latency (s) ↓	Visual Tokens ↓	Avg. Acc. ↑
LLaVA-Video-7B	16	184.38	0.615	11,644.0	54.6
LLaVA-Video-7B	32	405.64	1.140	23,290.0	56.2
LLaVA-Video-7B	64	792.83	2.622	46,584.0	56.6
+ FrameOracle	16→10.4	110.98	0.363	7,581.6	54.7
+ FrameOracle	64→13.9	167.67	0.556	10,133.1	56.0

FrameOracle reduces the effective visual input while preserving or improving downstream accuracy.

Qualitative Examples

Use the arrow buttons to switch between example cases where FrameOracle selects compact, question-relevant evidence frames.

1 / 2

BibTeX

@inproceedings{li2026frameoracle,
  title     = {FrameOracle: Learning What to See and How Much to See in Videos},
  author    = {Li, Chaoyu and Li, Tianzhi and Tao, Fei and Zhao, Zhenyu and Wu, Ziqian and Zhao, Maozheng and Song, Juntong and Niu, Cheng and Fazli, Pooyan},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}