Training MLLMs is expensive because self-attention scales poorly with long multimodal sequences, especially for video. Most prior token compression methods reduce inference cost only, so the full token sequence is still processed during training.
ReGATE targets the training bottleneck directly. Instead of using fixed heuristics, it adaptively prioritizes tokens that are still hard for the student or genuinely require visual grounding.