Visual Prompt Generation: Cross-Attention in Q-Former

Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.


This content originally appeared on HackerNoon and was authored by Instancing

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\


This content originally appeared on HackerNoon and was authored by Instancing


Print Share Comment Cite Upload Translate Updates
APA

Instancing | Sciencx (2025-11-19T16:00:08+00:00) Visual Prompt Generation: Cross-Attention in Q-Former. Retrieved from https://www.scien.cx/2025/11/19/visual-prompt-generation-cross-attention-in-q-former-2/

MLA
" » Visual Prompt Generation: Cross-Attention in Q-Former." Instancing | Sciencx - Wednesday November 19, 2025, https://www.scien.cx/2025/11/19/visual-prompt-generation-cross-attention-in-q-former-2/
HARVARD
Instancing | Sciencx Wednesday November 19, 2025 » Visual Prompt Generation: Cross-Attention in Q-Former., viewed ,<https://www.scien.cx/2025/11/19/visual-prompt-generation-cross-attention-in-q-former-2/>
VANCOUVER
Instancing | Sciencx - » Visual Prompt Generation: Cross-Attention in Q-Former. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/19/visual-prompt-generation-cross-attention-in-q-former-2/
CHICAGO
" » Visual Prompt Generation: Cross-Attention in Q-Former." Instancing | Sciencx - Accessed . https://www.scien.cx/2025/11/19/visual-prompt-generation-cross-attention-in-q-former-2/
IEEE
" » Visual Prompt Generation: Cross-Attention in Q-Former." Instancing | Sciencx [Online]. Available: https://www.scien.cx/2025/11/19/visual-prompt-generation-cross-attention-in-q-former-2/. [Accessed: ]
rf:citation
» Visual Prompt Generation: Cross-Attention in Q-Former | Instancing | Sciencx | https://www.scien.cx/2025/11/19/visual-prompt-generation-cross-attention-in-q-former-2/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.