what is the mathematical realization of attention maps from multiple heads?

This content originally appeared on DEV Community and was authored by Henri Wang

The mathematical realization of attention maps from multiple heads in a Vision Transformer (ViT) like ViT-S/8 trained with DINO involves computing the self-attention scores for each head and then visualizing them, often for a specific query (e.g., the [CLS] token). Here's a step-by-step breakdown:

1. Self-Attention in Multi-Head Attention (MHA)

In a transformer, the input embeddings are split into ( H ) heads (e.g., ( H = 6 ) for ViT-S). For each head ( h ), the self-attention is computed as:

[
\text{Attention}_h(Q_h, K_h, V_h) = \text{softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_k}}\right) V_h
]

where:

( Q_h = X W_h^Q ) (queries for head ( h )),
( K_h = X W_h^K ) (keys for head ( h )),
( V_h = X W_h^V ) (values for head ( h )),
( X ) is the input embedding (including positional encoding),
( W_h^Q, W_h^K, W_h^V ) are learned projection matrices for head ( h ),
( d_k ) is the dimension of the key vectors (typically ( d_k = d_{\text{model}} / H )).

2. Attention Scores for `[CLS]` Query

When visualizing attention for the [CLS] token (often used as a global image representation), we focus on the attention scores where the query is the [CLS] token. For head ( h ), the attention scores are:

[
A_h = \text{softmax}\left(\frac{q_{\text{[CLS]},h} K_h^T}{\sqrt{d_k}}\right)
]

where ( q_{\text{[CLS]},h} ) is the query vector for the [CLS] token in head ( h ).

3. Reshaping Attention Maps for Visualization

The attention scores ( A_h ) are of shape ( (1, N+1) ), where ( N ) is the number of patches + 1 (for [CLS]). To visualize how the [CLS] token attends to image patches:

Discard the attention score for [CLS] itself (since it’s trivial).
Reshape the remaining ( N ) scores into a 2D grid corresponding to the spatial layout of patches (e.g., ( \sqrt{N} \times \sqrt{N} ) for a square image).

4. Multi-Head Attention Visualization

Each head ( h ) produces a distinct attention map ( A_h ), highlighting different regions of the image. These maps are often:

Color-coded: Each head is assigned a unique color (e.g., red, blue, green).
Overlaid: The maps are combined (e.g., averaged or max-pooled) to show collective attention.

5. Example in ViT-S/8 Trained with DINO

In DINO (self-distillation with no labels), the [CLS] token’s attention maps often correspond to semantically meaningful regions (e.g., object boundaries). The last layer’s heads may capture:

Head 1: Focus on object centers.
Head 2: Focus on edges.
Head 3: Attend to background context.

6. Mathematical Summary

For ( H ) heads, the attention maps for [CLS] are:

[
{A_h}_{h=1}^H, \quad A_h \in \mathbb{R}^{1 \times N}
]

These are reshaped into ( H ) 2D maps, each showing where the [CLS] token "looks" in head ( h ).

Pseudocode (PyTorch-like)

# X: Input embeddings (shape: [B, N+1, D])
# W_Q, W_K: Projection matrices (shape: [D, D])
# H: Number of heads

B, N_plus_1, D = X.shape
d_k = D // H

# Split into heads
q = X @ W_Q  # [B, N+1, D]
k = X @ W_K  # [B, N+1, D]

q = q.view(B, N_plus_1, H, d_k).transpose(1, 2)  # [B, H, N+1, d_k]
k = k.view(B, N_plus_1, H, d_k).transpose(1, 2)  # [B, H, N+1, d_k]

# Compute attention scores for [CLS] (query index 0)
attn_scores = (q[:, :, 0, :] @ k.transpose(-2, -1)) / (d_k ** 0.5)  # [B, H, N+1]
attn_weights = torch.softmax(attn_scores, dim=-1)  # [B, H, N+1]

# Extract attention maps for visualization (exclude [CLS] self-attention)
cls_attn_maps = attn_weights[:, :, 1:]  # [B, H, N]
cls_attn_maps = cls_attn_maps.reshape(B, H, h_patches, w_patches)  # Reshape to spatial dims

Visualization

Each head’s map is upsampled to the image size and overlaid (e.g., as a heatmap).
Colors represent different heads, showing diverse focus regions.

This is how "different heads, materialized by different colors" are realized mathematically and visually.

This content originally appeared on DEV Community and was authored by Henri Wang

Print Share Comment Cite Upload Translate Updates

APA

Henri Wang | Sciencx (2025-06-27T03:06:04+00:00) what is the mathematical realization of attention maps from multiple heads?. Retrieved from https://www.scien.cx/2025/06/27/what-is-the-mathematical-realization-of-attention-maps-from-multiple-heads/

MLA

" » what is the mathematical realization of attention maps from multiple heads?." Henri Wang | Sciencx - Friday June 27, 2025, https://www.scien.cx/2025/06/27/what-is-the-mathematical-realization-of-attention-maps-from-multiple-heads/

HARVARD

Henri Wang | Sciencx Friday June 27, 2025 » what is the mathematical realization of attention maps from multiple heads?., viewed ,<https://www.scien.cx/2025/06/27/what-is-the-mathematical-realization-of-attention-maps-from-multiple-heads/>

VANCOUVER

Henri Wang | Sciencx - » what is the mathematical realization of attention maps from multiple heads?. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/06/27/what-is-the-mathematical-realization-of-attention-maps-from-multiple-heads/

CHICAGO

" » what is the mathematical realization of attention maps from multiple heads?." Henri Wang | Sciencx - Accessed . https://www.scien.cx/2025/06/27/what-is-the-mathematical-realization-of-attention-maps-from-multiple-heads/

IEEE

" » what is the mathematical realization of attention maps from multiple heads?." Henri Wang | Sciencx [Online]. Available: https://www.scien.cx/2025/06/27/what-is-the-mathematical-realization-of-attention-maps-from-multiple-heads/. [Accessed: ]

rf:citation

» what is the mathematical realization of attention maps from multiple heads? | Henri Wang | Sciencx | https://www.scien.cx/2025/06/27/what-is-the-mathematical-realization-of-attention-maps-from-multiple-heads/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.