Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91x speedup in prefilling and a 10x reduction in FLOPs, while retaining 95.4% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.
To address visual redundancy in token representations, recent studies have proposed text-agnostic approaches that retain visual tokens with high [CLS] attention at the output layer of the ViT-based visual encoder. While effective to some extent, this strategy raises an important question: Is relying solely on output [CLS] attention truly sufficient to capture all task-relevant visual information?
Our observations are as follows: (1) In the shallow layers, the [CLS] attention maps capture fine-grained local details across the image. In contrast, in the deeper layers, the attention becomes increasingly concentrated on the main entities, reflecting their global semantic relevance; (2) The self-attention maps for representative visual tokens reveal a similar local-to-global trend: in the shallow layers, these tokens primarily attend to nearby regions with similar semantic meaning, while in the deeper layers, their attention becomes more dispersed, integrating context from the entire image. These findings highlight a gradual transition in the visual encoder from capturing low-level local details to modeling high-level, globally relevant semantics, suggesting that relying solely on the output layer may overlook the rich local information encoded in the shallow layers.
While prior studies have proposed effective text-aware approaches for pruning visual tokens at early layers during LLM decoding, a critical question remains: Are early layers the optimal stage for pruning visual tokens to minimize their impact on the model’s final response? To investigate this, we conduct three empirical studies on POPE and GQA, analyzing how the model’s knowledge and predictions evolve during the decoding process:
As shown in Figure 3 (Left), early layers (e.g., layers 2 and 8) tend to select tokens at the bottom of the image, reflecting an inherent LLM position bias, as the last instruction token primarily attends to nearby tokens and focuses on local context, and flattened visual tokens from the bottom of the image are positioned closest to the instruction tokens in the sequence. As the LLM layers deepen, this undesirable position bias diminishes and the focus shifts toward the center of the image, which is more intuitive since the center of the image typically carries the most informative and task-relevant features.
We visualize the sum of attention received by all visual tokens from the last instruction token across different LLM layers using LLaVA-1.5-7B and Qwen-2.5-VL-7B in Figure 3 (Right). The red curve in each plot highlights the layer-wise attention patterns directed towards visual information. We observe that the middle LLM layers are primarily responsible for interacting with the visual tokens, whereas the early and deep layers focus predominantly on processing textual information.
We observe that in more challenging open-ended tasks like GQA, the next-token predictions stabilize around LLM layer 20, whereas in simpler yes/no tasks such as POPE, the predictions converge earlier, around LLM layer 16. Our findings indicate that early layers are still forming core cross-modal semantics, and pruning them risks disrupting essential grounding. In contrast, by the middle layers, next-token predictions have largely stabilized, meaning that these layers contribute diminishing semantic change. This directly motivates pruning in the middle-to-late layers rather than the early layers.
@article{zhang2026vscan,
title={VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models},
author={Ce Zhang and Kaixin Ma and Tianqing Fang and Wenhao Yu and Hongming Zhang and Zhisong Zhang and Haitao Mi and Dong Yu},
journal={Transactions on Machine Learning Research},
year={2026},
url={https://openreview.net/forum?id=KZYhyilFnt},
}