Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs

Details MIVPG experiments across single- and multi-image scenarios. Model uses frozen LLM and Visual Encoder, updating only the MIVPG for efficiency.


This content originally appeared on HackerNoon and was authored by Instancing

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

4. Experiments

To assess the effectiveness of our proposed approach, we conduct evaluations across various scenarios:

\

  1. where each sample comprises a single image, and patches are naturally considered as instances;

    \

  2. where each sample includes multiple instances, but we use a general embedding for each image;

    \

  3. where each sample contains multiple images, with each image containing multiple patches.

4.1. General Setup

We initialize our model using BLIP2 [22] with FLAN-T5- XL. MIVPG is initialized with weights from QFormer. The model consists of a frozen language model and a frozen visual model. During training, we only update the MIVPG. The visual encoder, ViT-G, is employed to encode images into patches of embeddings, and the images are resized to dimensions of 224 × 224. In our experiments, we observed that unfreezing the visual encoder does not lead to additional improvements in datasets with small sizes. Further details can be found in the supplementary C.1.

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\


This content originally appeared on HackerNoon and was authored by Instancing


Print Share Comment Cite Upload Translate Updates
APA

Instancing | Sciencx (2025-11-15T03:12:01+00:00) Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs. Retrieved from https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/

MLA
" » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs." Instancing | Sciencx - Saturday November 15, 2025, https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/
HARVARD
Instancing | Sciencx Saturday November 15, 2025 » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs., viewed ,<https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/>
VANCOUVER
Instancing | Sciencx - » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/
CHICAGO
" » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs." Instancing | Sciencx - Accessed . https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/
IEEE
" » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs." Instancing | Sciencx [Online]. Available: https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/. [Accessed: ]
rf:citation
» Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs | Instancing | Sciencx | https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.