Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning

This content originally appeared on HackerNoon and was authored by Instancing

Table of Links

Abstract and 1 Introduction

Related Work

2.1. Multimodal Learning

2.2. Multiple Instance Learning
Methodology

3.1. Preliminaries and Notations

3.2. Relations between Attention-based VPG and MIL

3.3. MIVPG for Multiple Visual Inputs

3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios
Experiments and 4.1. General Setup

4.2. Scenario 1: Samples with Single Image

4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study
Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

Next, we evaluate our method in scenarios involving multiple images, where each image contributes only one embedding as its representation. Specifically, we utilize the PatchGastricADC22[36] dataset, which is a Whole Slide Image (WSI) dataset. This dataset includes 991 WSIs of H&Estained gastric adenocarcinoma specimens, accompanied by diagnostic captions extracted directly from existing medical reports. The dataset encompasses a total of 262,777 medical patches, with each WSI containing up to 1860 patches. Each medical patch has a size of 300 × 300, which will be encoded by the visual encoder after resizing. The dataset is partitioned into training, validation, and test subsets using the methodology outlined in [36], with a split ratio of 0.7, 0.1, and 0.2, respectively. We compare the proposed method against baselines in [36], which are a combination of a visual model (DenseNet121[15] or EfficientNetB3[35]) and an LSTM[12] as the language model. To ensure a fair comparison, we conduct three experiments with different random seeds and follow the same data augmentation in [36]. In a medical patch, the focus is typically on global information rather than local details. Additionally, given that a WSI can comprise a large number of patches, we aim to reduce computational overhead. Therefore, we choose to use only the [CLS] token output by ViT as the representation for the entire medical patch. In this case, P = 1.

\ As demonstrated in Table 1, our method outperforms the baselines significantly. This result highlights the effectiveness of employing large-scale models in downstream tasks. Moreover, the experiments indicate that the model performs even better when considering correlations among instances, underscoring the effectiveness of our CSA module. Furthermore, we are interested in observing how captions generated by the LLM evolve as the number of training epochs increases. Given the substantial domain gap between medical images and natural images, we believe that existing MLLMs have rarely been trained on medical images, rendering them less domain-specific in medical analysis. As depicted in Figure 5, under the zero-shot setting, BLIP2 struggles to generate detailed captions for the provided WSIs. However, with an increasing number of training epochs, the model acquires domain-specific knowledge and produces more relevant captions. Similar to the process of human learning, a discernible trend is observed, where the model initially generates very general captions and gradually incorporates more and more details as the number of epochs increases.

\ Figure 5. Visualization of Inference Results on PatchGastricADC22. We highlight details that should be focused on the reference. Zero-shot inference is performed using the pretrained BLIP2[22]. As the number of epochs increases, the model acquires more domain knowledge.

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

[1] For consistency, we opted for metrics implemented in https://github.com/salaniz/pycocoevalcap.

This content originally appeared on HackerNoon and was authored by Instancing

Print Share Comment Cite Upload Translate Updates

APA

Instancing | Sciencx (2025-11-18T02:06:06+00:00) Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning. Retrieved from https://www.scien.cx/2025/11/18/gigapixel-pathology-mivpg-outperforms-baselines-in-medical-captioning/

MLA

" » Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning." Instancing | Sciencx - Tuesday November 18, 2025, https://www.scien.cx/2025/11/18/gigapixel-pathology-mivpg-outperforms-baselines-in-medical-captioning/

HARVARD

Instancing | Sciencx Tuesday November 18, 2025 » Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning., viewed ,<https://www.scien.cx/2025/11/18/gigapixel-pathology-mivpg-outperforms-baselines-in-medical-captioning/>

VANCOUVER

Instancing | Sciencx - » Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/18/gigapixel-pathology-mivpg-outperforms-baselines-in-medical-captioning/

CHICAGO

" » Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning." Instancing | Sciencx - Accessed . https://www.scien.cx/2025/11/18/gigapixel-pathology-mivpg-outperforms-baselines-in-medical-captioning/

IEEE

" » Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning." Instancing | Sciencx [Online]. Available: https://www.scien.cx/2025/11/18/gigapixel-pathology-mivpg-outperforms-baselines-in-medical-captioning/. [Accessed: ]

rf:citation

» Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning | Instancing | Sciencx | https://www.scien.cx/2025/11/18/gigapixel-pathology-mivpg-outperforms-baselines-in-medical-captioning/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

Related Posts