Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs

This content originally appeared on HackerNoon and was authored by Instancing

Table of Links

Abstract and 1 Introduction

Related Work

2.1. Multimodal Learning

2.2. Multiple Instance Learning
Methodology

3.1. Preliminaries and Notations

3.2. Relations between Attention-based VPG and MIL

3.3. MIVPG for Multiple Visual Inputs

3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios
Experiments and 4.1. General Setup

4.2. Scenario 1: Samples with Single Image

4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study
Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

4. Experiments

To assess the effectiveness of our proposed approach, we conduct evaluations across various scenarios:

where each sample comprises a single image, and patches are naturally considered as instances;

\
where each sample includes multiple instances, but we use a general embedding for each image;

\
where each sample contains multiple images, with each image containing multiple patches.

4.1. General Setup

We initialize our model using BLIP2 [22] with FLAN-T5- XL. MIVPG is initialized with weights from QFormer. The model consists of a frozen language model and a frozen visual model. During training, we only update the MIVPG. The visual encoder, ViT-G, is employed to encode images into patches of embeddings, and the images are resized to dimensions of 224 × 224. In our experiments, we observed that unfreezing the visual encoder does not lead to additional improvements in datasets with small sizes. Further details can be found in the supplementary C.1.

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

This content originally appeared on HackerNoon and was authored by Instancing

Print Share Comment Cite Upload Translate Updates

APA

Instancing | Sciencx (2025-11-15T03:12:01+00:00) Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs. Retrieved from https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/

MLA

" » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs." Instancing | Sciencx - Saturday November 15, 2025, https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/

HARVARD

Instancing | Sciencx Saturday November 15, 2025 » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs., viewed ,<https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/>

VANCOUVER

Instancing | Sciencx - » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/

CHICAGO

" » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs." Instancing | Sciencx - Accessed . https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/

IEEE

" » Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs." Instancing | Sciencx [Online]. Available: https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/. [Accessed: ]

rf:citation

» Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs | Instancing | Sciencx | https://www.scien.cx/2025/11/15/evaluating-visual-adapters-mivpg-performance-on-single-and-multi-image-inputs/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

4. Experiments

4.1. General Setup

Related Posts