ReGATE

Learning Faster and Better with Fewer Tokens in MLLMs

ACL 2026

Chaoyu Li1, Yogesh Kulkarni1, Pooyan Fazli1
1Arizona State University
ReGATE teaser figure

In training MLLMs, ReGATE identifies important textual tokens and selectively propagates them, while skipping unimportant ones to reduce training cost.

Abstract

The computational cost of training multimodal large language models grows rapidly with the number of processed tokens. Existing efficiency methods mainly focus on inference, leaving training cost largely unaddressed. ReGATE (Reference-Guided Adaptive Token Elision) is an adaptive token pruning framework for accelerating MLLM training. It uses a teacher-student design in which a frozen text-only teacher provides per-token reference losses, and the student contributes an exponential moving average of token difficulty. These two signals are fused to decide which tokens are most informative and should be kept in the forward and backward pass. Across VideoChat2, VideoLLaMA2, and InternVL3.5, ReGATE matches the peak accuracy of standard training on MVBench up to 2× faster while using only 38% of the tokens on average. With longer training, it surpasses the baseline across multiple multimodal benchmarks while reducing total token usage by more than 41%.

Why This Matters

Training MLLMs is expensive because self-attention scales poorly with long multimodal sequences, especially for video. Most prior token compression methods reduce inference cost only, so the full token sequence is still processed during training.

ReGATE targets the training bottleneck directly. Instead of using fixed heuristics, it adaptively prioritizes tokens that are still hard for the student or genuinely require visual grounding.

Core Idea

For each target token, ReGATE combines two signals:

  • Reference loss: can a frozen text-only teacher predict this token from text alone?
  • Student difficulty: has this token remained hard for the student across training?

Tokens with higher combined difficulty are kept active. Easier and less informative tokens are skipped, saving compute while preserving performance.

Method Overview

ReGATE method overview

Overview of ReGATE. A frozen text-only teacher provides per-token reference loss, which is combined with the student’s historical difficulty to gate token computation during training.

Step 1. Build a frozen text-only teacher from the same LLM backbone by removing the visual encoder and projector.

Step 2. Compute per-token reference loss by masking visual tokens and asking the teacher to predict each output token.

Step 3. Maintain an EMA-based student difficulty score for each token during training.

Step 4. Fuse the two scores, rank tokens by importance, and keep only the top subset during sparse phases.

Step 5. Apply the resulting binary mask inside the transformer decoder so self-attention and MLP computation are performed only on active tokens.

Main Highlights

Faster to reach baseline peak accuracy on MVBench
38%
Average token usage when matching baseline peak accuracy
41%+
Total token reduction while surpassing baseline performance

Results

ReGATE improves or matches performance across image, short-video, and long-video benchmarks while using substantially fewer tokens. It works across different backbones and fine-tuning settings, including full fine-tuning and LoRA-based training.

Model Tokens Video-MME MLVU MVBench Perception
VideoChat2 3.93B 26.0 36.0 55.7 48.4
VideoChat2-ReGATE 2.22B 32.7 40.5 56.6 50.0
VideoLLaMA2 83.82M 53.7 53.2 52.0 53.0
VideoLLaMA2-ReGATE 49.27M 54.5 54.5 53.6 54.1
InternVL3.5 3.96B 62.4 63.7 68.3 65.3
InternVL3.5-ReGATE 2.32B 63.0 64.2 69.6 66.7
ReGATE training efficiency curves

Training efficiency curves for ReGATE versus standard fine-tuning. ReGATE reaches competitive accuracy with substantially fewer processed tokens and less training time.

Efficiency Comparison

Model Tokens ↓ Teacher Cost (GPU-h) ↓ Train Time (GPU-h) ↓ Avg. Mem/GPU (GB) ↓ Avg. Acc. ↑
VideoLLaMA2 83.82M - 129.6 69.1 48.2
VideoLLaMA2-ReGATE 49.27M 2.1 107.6 61.3 48.9
VideoChat2 3.93B - 148.8 70.8 46.1
VideoChat2-ReGATE 2.22B 10.0 130.0 63.7 47.8
InternVL3.5 3.96B - 435.2 58.3 61.8
InternVL3.5-ReGATE 2.32B 11.3 374.4 51.9 62.2

BibTeX

@article{li2025ReGATE,
  title   = {ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs},
  author  = {Li, Chaoyu and Kulkarni, Yogesh and Fazli, Pooyan},
  journal = {arXiv preprint arXiv:2507.21420},
  year    = {2025}
}