Journal of our experiments on VLM token pruning

I and @oldpilluwu have been keenly interested in how to make Large Vision Models (VLM) work and wanted to know if these really large models can be made more efficient. Because, let’s face it, the current AI landscape is dominated by these massive mode…


This content originally appeared on DEV Community and was authored by Mushahid Intesum

I and @oldpilluwu have been keenly interested in how to make Large Vision Models (VLM) work and wanted to know if these really large models can be made more efficient. Because, let's face it, the current AI landscape is dominated by these massive models which can be properly utilized through API calls and using these API can get really expensive really quickly. For that, we started poking into Llava-7B [1] to observe it's characteristics and try to find any possible avenues for efficiency and redundancy removal. This post is about the findings we have found so far in this independent research endeavor.

Background on Llava Model:

Llava[1] is a family of vision language models introduced in 2023 with many size variants. For our experiments, we use the Llava-7B variant. Llava has 2 parts: a vision tower and an LLM. The vision tower processes the image tokens. It does so by breaking the input image into multiple tokens (577 including CLS token), encoding positional information and then passing them through the attention heads in the vision tower. CLIP[3] is used as the vision tower for Llava which has 24 attention heads.

The text is tokenized separately and the both the image and text tokens are passed to the LLM for cross-modal fusion which produces the final output. Llama-7B is used as the LLM for Llava-7B.

Test Bed

For experimentation, we utilized the Llava-7B implementation from transformers[2]. Since LLava 7B requires atleast 24GB VRAM to run, I utilized the GPU resources available at Kaggle. This was a bottleneck because of limited availability. We experimented on different images with various prompts to see the characteristics. We selected 4 random images from the Coco validation and test sets and tried 4 different prompts for our experiments. These are the images being used for testing:

Figure 1: Images used for testing. Image ordering goes from left to right and top to bottom

Finding 1: Significant drop-off in attention entropy in Vision Tower

First, we analyzed the per‐head entropy within the vision tower to quantify each head’s dispersion of focus across image tokens. Entropy, defined as:

represents the normalized attention weight to token (i), serves as a proxy for the breadth of a head’s receptive field: high entropy implies a diffuse attention spread over many tokens, whereas low entropy indicates concentrated focus on a few salient regions. Across a diverse set of natural and synthetic images, we observed a pronounced and reproducible drop in entropy values between attention heads 10 and 11, and head 12, as illustrated in Figure 1. This inflection point suggests that heads 1–11 progressively refine their contextual aggregation, but head 12—and those beyond—begin to collapse onto a narrower subset of tokens. Notably, this pattern held regardless of image content or complexity, implying an architectural or learned bias in the middle‐to‐late vision layers. We hypothesize that low‐entropy heads may prematurely discard secondary but potentially relevant features, thereby propagating suboptimal token representations into deeper layers. In the next section, we systematically probe individual token attention scores to determine whether these low‐entropy dynamics indeed lead to persistent information loss during forward propagation.

Entropy graph of vision tower attention heads

Figure 1: Entropy graph of vision tower attention heads. The trend shows a drop in attention scores from the halfway mark.

This was reflected in the CLS-token attention map of each layer. The first half of the layers show a wide distribution of attention whereas the latter ones have sparse distribution.


Figure 2: CLS-token attention maps of vision tower layer. Attention distribution clearly shows that sparse distribution of tokens in latter half of layers across different images

Finding 2: Llava can tolerate visual token pruning pre cross-modal fusion, especially in the latter attention heads of vision tower

Based on the previous findings, we conducted experiments to see how many tokens are below a certain threshold in the vision tower attention heads and if pruning image tokens based on those low scores affect inference performance of the model.

Let n be the number of attention heads we wish to scan and t be the maximum attention score threshold. We want to find what percentage of attention scores in the last n heads are below threshold t.

We tested the percentage of attention scores below certain thresholds and present the results in the table below:

Percentage of Attention Scores Below Various Thresholds

Layer < 0.00005 < 0.0003 < 0.0004 < 0.0005 < 0.005 < 0.05
1 0.82% 13.96% 19.76% 25.38% 99.65% 100.00%
2 8.08% 30.77% 35.95% 40.95% 97.32% 100.00%
3 6.46% 32.52% 40.81% 48.30% 98.82% 100.00%
4 5.84% 28.39% 37.76% 47.07% 99.31% 100.00%
5 4.76% 30.78% 38.34% 45.55% 97.53% 100.00%
6 2.20% 27.84% 36.09% 43.13% 97.56% 100.00%
7 0.00% 9.57% 17.20% 25.62% 98.51% 100.00%
8 0.03% 6.16% 10.36% 15.82% 98.62% 100.00%
9 0.04% 7.73% 12.86% 18.47% 97.81% 100.00%
10 2.08% 15.49% 21.95% 28.01% 95.00% 100.00%
11 0.04% 10.74% 16.95% 23.33% 95.74% 99.98%
12 0.17% 14.88% 22.18% 28.58% 94.89% 99.98%
13 0.66% 21.53% 31.55% 40.64% 96.86% 99.72%
14 5.89% 47.40% 57.06% 63.80% 97.89% 99.36%
15 3.75% 40.18% 51.04% 60.34% 98.30% 99.35%
16 11.05% 44.39% 53.24% 61.02% 98.13% 99.35%
17 9.72% 46.95% 56.74% 64.59% 97.84% 99.31%
18 15.25% 40.60% 48.07% 54.99% 97.60% 99.36%
19 20.20% 46.34% 53.83% 59.77% 97.49% 99.38%
20 18.83% 43.90% 51.63% 58.17% 97.02% 99.41%
21 23.20% 47.12% 54.46% 60.34% 97.01% 99.45%
22 4.73% 45.62% 52.39% 57.87% 96.78% 99.50%
23 24.97% 51.91% 57.64% 62.88% 95.82% 99.67%
24 19.57% 52.44% 58.51% 63.48% 95.58% 99.53%

Table 1: Attention score percentage under different threshold for all layers in vision tower. All the values are in the range of (0.00005, 0.005)

As the preceding table shows, most of the attention scores are between 0.00005 and 0.005, excluding. Based on this finding, we experiment using 0.0003, 0.0004 and 0.0005 as pruning thresholds.

Patch-wise Pruning:

In this technique, we compute each patch’s importance by averaging its attention to the CLS token over all layers and a selected subset of the final N heads. From the full attention tensor of shape [L, B, H, T, T], we extract CLS-to-patch scores (yielding [L, B, H, P]), retain only the last N heads ([L, B, H′, P]), and collapse layers and heads to produce a [B, P] matrix of mean attention values. Patches whose scores fall below a fixed threshold t are deemed uninformative; their corresponding feature vectors in the [B, P, hidden_dim] image tensor are zeroed out, while all others remain intact.

The results of pruning out tokens using the previously mentioned thresholds are given in the following table:

Image Prompt t = 0003 t = 0004 t = 0005
Image 1 Are there any sort of cutlery in this image? No, there are no cutlery items like forks or knives in the image. The woman is holding two donuts on a paper plate No, there are no cutlery items like forks or knives in the image. The woman is holding two donuts on a paper plate Yes, there is a knife in the image.
Image 1 Describe this image The image features a woman wearing a hat and glasses, standing in front of a doughnut shop. She is holding two plates, The image features a woman standing in front of a doughnut shop, holding two plates with doughnuts on them. She is sm The image features a woman wearing a hat and glasses, standing in front of a doughnut shop. She is holding two plates,
Image 1 Is it night or day in this image? It is night in the image. It is daytime in the image. It is daytime in the image.
Image 1 List all the objects here 1. Donuts 2. Ice cream cones 3. A woman 4. A doughnut 5. A The objects in the image include a woman holding two donuts, a doughnut on a plate, a doughnut on a napkin, 1. Donuts 2. Ice cream 3. Woman 4. Donuts 5. Ice cream 6. Woman
Image 2 Are there any sort of cutlery in this image? Yes, there is a pair of scissors in the image. No, there are no knives or any other sort of cutlery visible in the image. The focus is on the woman holding a kite Yes, there is a woman in the image who is holding a pair of scissors.
Image 2 Describe this image The image features a woman standing in a grassy field, holding a kite in her hand. She is wearing a red shirt and blue The image features a woman standing in a grassy field, holding a kite in her hand. She is wearing a red shirt and appears The image features a woman standing in a grassy field, holding a kite in her hand. She is wearing a red shirt and blue
Image 2 Is it night or day in this image? Day It is daytime in the image. It is daytime in the image.
Image 2 List all the objects here [Empty response with newlines] In the image, there are several objects, including a woman holding a kite, a car, a truck, a bus, a boat, [Empty response with newlines]
Image 3 Are there any sort of cutlery in this image? Yes, there is a knife in the image. No, there are no cutlery items in this image. The image features a red and white ambulance and a red bus parked next Yes, there is a cutlery in the image.
Image 3 Describe this image The image depicts a city street at night, with a red and white ambulance parked next to a red and black bus. The The image depicts a city street at night, with a red and white bus driving down the road. A white ambulance is parked The image shows a city street at night, with a red and white ambulance parked next to a red and black bus. The bus is
Image 3 Is it night or day in this image? It is night in this image. It is night in this image. It is night in this image.
Image 3 List all the objects here 1. Ambulance 2. Bus 3. Tree 4. Firemen 5. Fire truck 6. Bus In the image, there are two buses, a truck, and a fire hydrant. The buses are parked next to each other 1. Ambulance 2. Bus 3. tree 4. firemen 5. fire truck 6.
Image 4 Are there any sort of cutlery in this image? Yes, there are two cats in the image. Yes, there is a knife in the image. Yes, there are two cats in the image.
Image 4 Describe this image The image features two cats lying on a pink couch. One cat is located on the left side of the couch, while the other The image features two cats lying on a pink couch. One cat is positioned on the left side of the couch, while the The image features two cats lying on a pink couch. One cat is on the left side of the couch, while the other cat
Image 4 Is it night or day in this image? It is daytime in the image. It is night in this image, as the two cats are sleeping on the couch. It is daytime in the image.
Image 4 List all the objects here 1. Cat 2. Remote 3. Couch The objects in the image include a cat, a remote control, and a couch. The objects in the image include a cat, a remote control, a couch, and a blanket.

Table 2: Patch-wise pruning on different thresholds

Experimentally, we found that pruning the last 8 heads yielded the best results. Pruning more than 8 caused significant performance degradation and pruning any less did not give any performance enhancements. Experiments showed that different pruning thresholds gave good or bad results based on prompts. For example, for the prompt "Are there any cutlery in the image?" for Image 2, only pruning threshold 0.0004 was able to answer properly, whereas the others failed. For the same prompt, all pruning thresholds caused the model to hallucinate in some cases. There were also some instances where only threshold of 0.0004 was able to generate a response and the others were not able to.

Layer-wise Pruning:

In this layer‐wise pruning scheme, for each layer l we extract the CLS‐to‐patch attention scores from the full tensor of shape [L, B, H, T, T], yielding [L, B, H, P], and retain only the final N heads to form [L, B, H′, P]. We then average over those N heads within each layer to produce a [B, P] matrix of mean attention values per layer, apply a fixed threshold to identify uninformative patch–head pairs, and expand the resulting binary mask into a full [B, P, hidden_dim] feature mask by assigning each head to its contiguous hidden‐dimension segment. Finally, this mask is applied element‐wise to the image features [B, P, hidden_dim], zeroing out only those head‐specific features that fall below the threshold while preserving all others.

The table below shows the generation results for different threshold values:

Image Prompt t = 0003 t = 0004 t = 0005
Image 1 Are there any sort of cutlery in this image? No, there are no cutlery items in the image. The focus is on the two donuts being held by the person. No, there are no cutlery items in the image. The focus is on the two donuts being held by the person. No, there are no cutlery items in the image. The focus is on the two donuts being held by the person.
Image 1 Describe this image The image features a woman standing in front of a doughnut stand, holding up two doughnuts for display. One doughnut is The image features a woman standing in front of a doughnut stand, holding up two doughnuts for display. One doughnut is The image features a woman standing in front of a doughnut stand, holding up two doughnuts for display. One doughnut is
Image 1 Is it night or day in this image? It is daytime in the image. It is daytime in the image. It is daytime in the image.
Image 1 List all the objects here 1. Donuts 2. Doughnuts 3. Doughnuts 4. Doughnuts 5. 1. Donuts 2. Doughnuts 3. Doughnuts 4. Doughnuts 5. 1. Donuts 2. Doughnuts 3. Doughnuts 4. Doughnuts 5.
Image 2 Are there any sort of cutlery in this image? Yes, there is a knife in the image. Yes, there is a knife in the image. Yes, there is a pair of scissors in the image.
Image 2 Describe this image The image features a woman standing in a grassy field, holding a kite that is flying high in the sky. She is wearing a red The image features a woman standing in a grassy field, holding a kite that is flying high in the sky. She is wearing a red The image features a woman wearing a red and blue dress, standing on a grassy field. She is holding a kite, which is flying
Image 2 Is it night or day in this image? It is daytime in the image. It is daytime in the image. It is daytime in the image.
Image 2 List all the objects here 1. Kite 2. Woman 3. Pants 4. Shirt 5. Sunglasses 6. 1. Kite 2. Woman 3. Pants 4. Shirt 5. Sunglasses 6. The objects present in the image include a woman wearing a red shirt and holding a kite, a kite, a field, a par
Image 3 Are there any sort of cutlery in this image? Yes, there is a knife in the image, which is located near the bus. Yes, there is a knife in the image, which is located near the bus. Yes, there is a knife in the image, which is located on the side of the road.
Image 3 Describe this image The image depicts a city street at night, with a large red bus parked on the side of the road. The bus is positioned The image depicts a city street at night, with a large red bus parked on the side of the road. The bus is positioned The image depicts a city street at night, with a large red bus parked on the side of the road. The bus is positioned
Image 3 Is it night or day in this image? It is night in this image. It is night in this image. It is night in this image.
Image 3 List all the objects here 1. Bus 2. Tree 3. Fire hydrant 4. Person 5. Bus stop 6. Fire truck 1. Bus 2. Tree 3. Fire hydrant 4. Person 5. Bus stop 6. Fire truck 1. Bus 2. Tree 3. Fire hydrant 4. Bus stop 5. Fire truck 6. Bus
Image 4 Are there any sort of cutlery in this image? Yes, there is a knife in the image, which is placed on the couch along with the two cats. Yes, there is a knife in the image, which is placed on the couch along with the two cats. No, there are no cutlery items in the image. The image features a cat and a remote control, with the cat lying on a p
Image 4 Describe this image The image features a couch with two cats lying on it. One cat is positioned towards the left side of the couch, while the The image features a couch with two cats lying on it. One cat is positioned towards the left side of the couch, while the The image features a couch with two cats lying on it. One cat is positioned towards the left side of the couch, while the
Image 4 Is it night or day in this image? It is daytime in the image. It is daytime in the image. It is daytime in the image.
Image 4 List all the objects here 1. Cat 2. Cat 3. Remote control 4. Remote control 5. Cat 6. Cat 1. Cat 2. Cat 3. Remote control 4. Remote control 5. Cat 6. Cat 1. Cat 2. Cat 3. Remote control 4. Remote control 5. Cat 6. Cat

Table 3: Layer-wise pruning on different thresholds

For layer-wise pruning, we found that pruning the last 9 attention heads yielded the best results. Unlike patch-wise pruning all thresholds were able to produce responses, whether it be correct or incorrect. Responses were more or less consistent in all cases. However, just like patch-wise pruning, the model hallucinated responses in all cases.

In the following table, we provide generation time across unmodified model and our two pruning strategies:

Image 1

Prompt No Change (s) Layer-wise Pruning Patch-wise Pruning
Are there any sort of cutlery in this image? 19.50 17.20 7.54
Describe this image 18.95 18.19 18.62
Is it night or day in this image? 5.70 5.44 5.48
List all the objects here 19.64 18.30 18.58

Image 2

Prompt No Change (s) Layer-wise Pruning Patch-wise Pruning
Are there any sort of cutlery in this image? 18.45 9.12 12.53
Describe this image 19.21 18.28 18.62
Is it night or day in this image? 5.64 5.46 5.56
List all the objects here 19.13 18.44 18.56

Image 3

Prompt No Change (s) Layer-wise Pruning Patch-wise Pruning
Are there any sort of cutlery in this image? 18.65 13.31 7.83
Describe this image 19.51 18.38 19.11
Is it night or day in this image? 5.44 4.92 4.92
List all the objects here 20.37 18.62 18.76

Image 4

Prompt No Change (s) Layer-wise Pruning Patch-wise Pruning
Are there any sort of cutlery in this image? 32.89 32.13 13.68
Describe this image 17.90 17.73 26.79
Is it night or day in this image? 12.05 5.28 5.31
List all the objects here 11.63 18.07 13.70

Table 4: Comparison of generation time across all models for different prompt per image

From the table, we can see that patch-wise pruning achieved 17.7% faster generation times compared to the baseline, with some individual queries running up to 4x faster. Even the more conservative layer-wise pruning approach showed 4.4% improvements over baseline processing.

Not all tasks benefit equally from masking:

-Object Detection Queries: Patch-wise pruning dominated, reducing "cutlery detection" times from 22.37s (baseline) to just 10.40s on average
-Simple Binary Questions: "Is it night or day?" queries were consistently fast across all approaches (5-6 seconds)
-Complex Description Tasks: Showed more mixed results, with layer-wise pruning often performing best
-Object Listing: Layer-wise pruning provided the most consistent performance

While patch-wise pruning won on raw speed, layer-wise pruning proved most reliable, winning 62.5% of individual comparisons. This approach rarely produced the slowest times and maintained steady performance across different image types and question complexity.

Future Directions

This is an ongoing experiment. This blog serves as a journal of our progress so far and will continually be updated once new findings are found. A lot more things need to be added; a more comprehensive reference section and more thorough experimentation are the two things that come to mind. Hopefully we'll be able to get a paper through this independent endeavor of ours.

References

[1] H. Liu, C. Li, Q. Wu, Y. J. Lee, “Visual Instruction Tuning,” presented at the A. Oh, T. Naumann, A. Globerson, K. Saenko, M.Hardt, S. Levine (Eds.), Advances in Neural Information Processing Systems, vol. 36, pp. 34892–34916 (2023).

[2] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, “Transformers: State-of-the-Art Natural Language Processing,” presented at the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.38–45 (2020 Oct.).

[3] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” (2021), URL https://arxiv.org/abs/2103.00020.


This content originally appeared on DEV Community and was authored by Mushahid Intesum


Print Share Comment Cite Upload Translate Updates
APA

Mushahid Intesum | Sciencx (2025-08-02T04:14:39+00:00) Journal of our experiments on VLM token pruning. Retrieved from https://www.scien.cx/2025/08/02/journal-of-our-experiments-on-vlm-token-pruning/

MLA
" » Journal of our experiments on VLM token pruning." Mushahid Intesum | Sciencx - Saturday August 2, 2025, https://www.scien.cx/2025/08/02/journal-of-our-experiments-on-vlm-token-pruning/
HARVARD
Mushahid Intesum | Sciencx Saturday August 2, 2025 » Journal of our experiments on VLM token pruning., viewed ,<https://www.scien.cx/2025/08/02/journal-of-our-experiments-on-vlm-token-pruning/>
VANCOUVER
Mushahid Intesum | Sciencx - » Journal of our experiments on VLM token pruning. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/02/journal-of-our-experiments-on-vlm-token-pruning/
CHICAGO
" » Journal of our experiments on VLM token pruning." Mushahid Intesum | Sciencx - Accessed . https://www.scien.cx/2025/08/02/journal-of-our-experiments-on-vlm-token-pruning/
IEEE
" » Journal of our experiments on VLM token pruning." Mushahid Intesum | Sciencx [Online]. Available: https://www.scien.cx/2025/08/02/journal-of-our-experiments-on-vlm-token-pruning/. [Accessed: ]
rf:citation
» Journal of our experiments on VLM token pruning | Mushahid Intesum | Sciencx | https://www.scien.cx/2025/08/02/journal-of-our-experiments-on-vlm-token-pruning/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.