ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

In training MLLMs, ReGATE identifies important textual tokens and selectively propagates them, while skipping unimportant ones to reduce training cost.

Abstract

The computational cost of training multimodal large language models grows rapidly with the number of processed tokens. Existing efficiency methods mainly focus on inference, leaving training cost largely unaddressed. ReGATE (Reference-Guided Adaptive Token Elision) is an adaptive token pruning framework for accelerating MLLM training. It uses a teacher-student design in which a frozen text-only teacher provides per-token reference losses, and the student contributes an exponential moving average of token difficulty. These two signals are fused to decide which tokens are most informative and should be kept in the forward and backward pass. Across VideoChat2, VideoLLaMA2, and InternVL3.5, ReGATE matches the peak accuracy of standard training on MVBench up to 2× faster while using only 38% of the tokens on average. With longer training, it surpasses the baseline across multiple multimodal benchmarks while reducing total token usage by more than 41%.

Why This Matters

Training MLLMs is expensive because self-attention scales poorly with long multimodal sequences, especially for video. Most prior token compression methods reduce inference cost only, so the full token sequence is still processed during training.

ReGATE targets the training bottleneck directly. Instead of using fixed heuristics, it adaptively prioritizes tokens that are still hard for the student or genuinely require visual grounding.

Core Idea

For each target token, ReGATE combines two signals:

Reference loss: can a frozen text-only teacher predict this token from text alone?
Student difficulty: has this token remained hard for the student across training?

Tokens with higher combined difficulty are kept active. Easier and less informative tokens are skipped, saving compute while preserving performance.

Method Overview

Overview of ReGATE. A frozen text-only teacher provides per-token reference loss, which is combined with the student’s historical difficulty to gate token computation during training.

Step 1. Build a frozen text-only teacher from the same LLM backbone by removing the visual encoder and projector.

Step 2. Compute per-token reference loss by masking visual tokens and asking the teacher to predict each output token.

Step 3. Maintain an EMA-based student difficulty score for each token during training.

Step 4. Fuse the two scores, rank tokens by importance, and keep only the top subset during sparse phases.

Step 5. Apply the resulting binary mask inside the transformer decoder so self-attention and MLP computation are performed only on active tokens.

Main Highlights

2×

Faster to reach baseline peak accuracy on MVBench

38%

Average token usage when matching baseline peak accuracy

41%+

Total token reduction while surpassing baseline performance

Results

ReGATE improves or matches performance across image, short-video, and long-video benchmarks while using substantially fewer tokens. It works across different backbones and fine-tuning settings, including full fine-tuning and LoRA-based training.

Model	Tokens	Video-MME	MLVU	MVBench	Perception
VideoChat2	3.93B	26.0	36.0	55.7	48.4
VideoChat2-ReGATE	2.22B	32.7	40.5	56.6	50.0
VideoLLaMA2	83.82M	53.7	53.2	52.0	53.0
VideoLLaMA2-ReGATE	49.27M	54.5	54.5	53.6	54.1
InternVL3.5	3.96B	62.4	63.7	68.3	65.3
InternVL3.5-ReGATE	2.32B	63.0	64.2	69.6	66.7

Training efficiency curves for ReGATE versus standard fine-tuning. ReGATE reaches competitive accuracy with substantially fewer processed tokens and less training time.

Efficiency Comparison

Model	Tokens ↓	Teacher Cost (GPU-h) ↓	Train Time (GPU-h) ↓	Avg. Mem/GPU (GB) ↓	Avg. Acc. ↑
VideoLLaMA2	83.82M	-	129.6	69.1	48.2
VideoLLaMA2-ReGATE	49.27M	2.1	107.6	61.3	48.9
VideoChat2	3.93B	-	148.8	70.8	46.1
VideoChat2-ReGATE	2.22B	10.0	130.0	63.7	47.8
InternVL3.5	3.96B	-	435.2	58.3	61.8
InternVL3.5-ReGATE	2.32B	11.3	374.4	51.9	62.2

BibTeX

@article{li2025ReGATE,
  title   = {ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs},
  author  = {Li, Chaoyu and Kulkarni, Yogesh and Fazli, Pooyan},
  journal = {arXiv preprint arXiv:2507.21420},
  year    = {2025}
}