Multi-Head Attention

Transformers in Deep Learning - Part 6
Sibling notes: Self Attention in Transformers - The Deep Dive · Scaled Dot-Product Attention - The Deep Dive

Self-attention lets a token build a new representation by looking at other tokens in the same sentence.

Multi-head attention asks a stronger question:

Why look at the sentence from only one perspective?

A single attention head can learn one pattern of relationships. But language often contains many relationships at the same time:

  • subject and verb
  • object and location
  • possession
  • time
  • cause and effect
  • long-range dependency
  • ambiguity between multiple meanings

Multi-head attention gives the model several attention heads in parallel, so different heads can learn different relationship patterns.


1. Why One Attention Head Is Not Enough

Self-attention solves an important problem:

The same word can mean different things in different contexts.

For example:

Sentence Meaning of "Apple"
I love Apple phones. Apple as a technology company
I love apple juice. apple as a fruit

A single self-attention head can shift the representation of Apple toward the correct meaning by looking at nearby context.

But language is more complex than one contextual shift.

A sentence can contain several relationships at once.


2. Ambiguity Requires Multiple Attention Patterns

Consider:

She saw the man with the telescope.

This sentence has at least two possible interpretations.

Interpretation 1

The man has the telescope.

She saw [the man with the telescope].

Here, the important dependency is:

mantelescope\text{man} \leftrightarrow \text{telescope}

The phrase "with the telescope" describes the man.

Interpretation 2

She used the telescope to see the man.

She saw the man [using the telescope].

Here, the important dependency is:

shetelescope\text{she} \leftrightarrow \text{telescope}

The telescope is the instrument used by the subject.

flowchart TD
    S["She saw the man with the telescope"] --> I1["Interpretation 1"]
    S --> I2["Interpretation 2"]

    I1 --> R1["man - telescope"]
    I2 --> R2["she - telescope"]

One sentence can support more than one valid attention pattern.

If the model has only one attention head, it has only one attention map at that layer. That may not be enough to represent several useful interpretations at the same time.


3. Multiple Relationships in One Sentence

Consider:

The keys on the table belong to Sara.

To understand keys, the model should capture more than one relationship:

Relationship Meaning
keys -> table spatial relationship: where the keys are
keys -> Sara possession: who owns the keys

A single attention head can assign attention weights across all tokens, but it may struggle to strongly represent many different types of relationships at once.

One head might focus on location. Another might focus on ownership. Another might focus on grammar.

That is the central idea:

Multiple heads allow multiple attention patterns to exist in parallel.


4. The Filter Analogy

In a convolutional neural network, one image is processed by many filters.

Different filters may detect different visual features:

Filter Possible feature
filter 1 vertical edges
filter 2 horizontal edges
filter 3 curves
filter 4 color patterns

Each filter gives a different view of the same image.

Multi-head attention works similarly for language:

Attention head Possible language feature
head 1 subject-verb relationship
head 2 object-location relationship
head 3 possession
head 4 long-range reference
head 5 alternative interpretation

The input sentence is the same, but each head learns a different way to inspect it.

flowchart LR
    X["same input sentence"] --> H1["head 1: grammar"]
    X --> H2["head 2: location"]
    X --> H3["head 3: ownership"]
    X --> H4["head 4: long-range link"]

    H1 --> C["combined representation"]
    H2 --> C
    H3 --> C
    H4 --> C

The heads are not manually assigned these jobs. They learn useful patterns during training.


5. Single-Head Attention Recap

For one attention head:

Q=XWQQ = XW_Q K=XWKK = XW_K V=XWVV = XW_V

Then:

head=softmax(QKTdk)V\text{head} = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

This produces a contextual representation for every token.

For the sentence:

love Apple phones

the output contains:

  • a new representation for love
  • a new representation for Apple
  • a new representation for phones

One head gives one contextual view of the sentence.


6. Multi-Head Attention: The Main Idea

Multi-head attention does not change the attention mechanism itself.

It repeats the same attention operation several times in parallel, each time with its own learned matrices:

W1Q,W1K,W1VW_1^Q, W_1^K, W_1^V W2Q,W2K,W2VW_2^Q, W_2^K, W_2^V \dots WhQ,WhK,WhVW_h^Q, W_h^K, W_h^V

Each head gets the same input XX, but because each head has different learned matrices, each head produces different QQ, KK, and VV.

For head ii:

headi=Attention(XWiQ,XWiK,XWiV)\text{head}_i = \text{Attention} ( XW_i^Q, XW_i^K, XW_i^V )

So with hh heads:

head1,head2,,headh\text{head}_1, \text{head}_2, \dots, \text{head}_h

Each head gives a different contextual version of the same tokens.


7. What Each Head Produces

Suppose the model has 8 attention heads.

For the word Apple, the model does not produce one intermediate contextual vector. It produces 8 intermediate contextual vectors:

Apple1\text{Apple}_1' Apple2\text{Apple}_2' \dots Apple8\text{Apple}_8'

Each one may emphasize a different relationship.

For a longer sentence, possible head behaviors might look like this:

Head Possible focus
head 1 Apple and iPhone form one product phrase
head 2 the iPhone is on a shelf
head 3 Jake is holding the iPhone
head 4 the iPhone replaces an old phone

These examples are intuitive, not guaranteed assignments. In real trained models, attention heads can specialize in messy, overlapping, and sometimes surprising ways.


8. Concatenating the Heads

After all heads run, their outputs are concatenated.

If there are 8 heads, each token now has 8 head outputs side by side:

[head1;head2;;head8][\text{head}_1;\text{head}_2;\dots;\text{head}_8]

For Apple, this means:

[Apple1;Apple2;;Apple8][ \text{Apple}_1'; \text{Apple}_2'; \dots; \text{Apple}_8' ]

This long vector contains multiple contextual views of Apple.

flowchart LR
    H1["head 1 output"] --> CAT["concatenate"]
    H2["head 2 output"] --> CAT
    H3["head 3 output"] --> CAT
    H8["head 8 output"] --> CAT
    CAT --> Z["combined multi-head vector"]

Concatenation collects the perspectives. It does not yet decide how to blend them into one final representation.


9. The Output Projection

After concatenation, the Transformer applies one more learned linear transformation:

MultiHead(X)=Concat(head1,,headh)WO\text{MultiHead}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O

WOW^O is the output projection matrix.

Its job is to combine the information from all heads into one unified representation.

Why is this needed?

Because the concatenated vector is like a pile of features. Some head outputs may be very useful for the current token. Some may be less useful. Some may need to be mixed together.

The output projection learns how to combine them.

flowchart LR
    CAT["concatenated head outputs"] --> WO["output projection W^O"]
    WO --> OUT["final contextual representation"]

The projection can:

  • keep useful features
  • reduce irrelevant features
  • mix signals from different heads
  • return the representation to the model's expected dimension

10. Dimension Walkthrough

Use the original Transformer-style dimensions:

dmodel=512d_{\text{model}} = 512 h=8h = 8 dk=dv=64d_k = d_v = 64

Suppose the input sentence has 3 tokens:

XR3×512X \in \mathbb{R}^{3 \times 512}

For each head:

WiQ,WiK,WiVR512×64W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{512 \times 64}

So:

Qi,Ki,ViR3×64Q_i, K_i, V_i \in \mathbb{R}^{3 \times 64}

Each head outputs:

headiR3×64\text{head}_i \in \mathbb{R}^{3 \times 64}

With 8 heads, concatenation gives:

Concat(head1,,head8)R3×512\text{Concat}(\text{head}_1,\dots,\text{head}_8) \in \mathbb{R}^{3 \times 512}

because:

8×64=5128 \times 64 = 512

Then the output projection uses:

WOR512×512W^O \in \mathbb{R}^{512 \times 512}

So the final output is:

R3×512\mathbb{R}^{3 \times 512}

The model starts with:

3×5123 \times 512

and ends with:

3×5123 \times 512

But the output is no longer a static embedding. It is a contextual representation built from multiple attention perspectives.


11. Shape Summary

Object Shape
Input XX 3×5123 \times 512
One head's WiQW_i^Q 512×64512 \times 64
One head's QiQ_i 3×643 \times 64
One head's KiK_i 3×643 \times 64
One head's ViV_i 3×643 \times 64
One head output 3×643 \times 64
8 concatenated heads 3×5123 \times 512
Output projection WOW^O 512×512512 \times 512
Final multi-head output 3×5123 \times 512

12. Why Not Just Make One Huge Head?

A natural question:

Why use many smaller heads instead of one large head?

One large head still produces one attention pattern.

Multiple heads produce multiple attention patterns.

That distinction matters.

Design Attention patterns
one large head one score matrix
many heads many score matrices

Each head has its own:

QiKiTQ_iK_i^T

So each head can choose a different set of token-to-token relationships.

The benefit is not just more parameters. The benefit is parallel relational views.


13. Common Confusions

Note

No. Each head receives the same input, but uses different learned projection matrices.

Note

No. These are intuitive examples. The heads learn their own patterns during training.

Note

Not quite. The concatenated heads are passed through the output projection WOW^O to create the final representation.

Note

No. If the input has 3 tokens, the output still has 3 token representations. The feature dimension is managed through projection.

Note

Not necessarily. More heads increase compute and may reduce the dimension per head if dmodeld_{\text{model}} stays fixed. The number of heads is a design choice.


14. The Complete Formula

The full multi-head attention formula is:

MultiHead(X)=Concat(head1,,headh)WO\text{MultiHead}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O

where:

headi=Attention(XWiQ,XWiK,XWiV)\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)

For self-attention, the query, key, and value projections all come from the same input sequence XX.

Using the scaled dot-product formula:

headi=softmax((XWiQ)(XWiK)Tdk)(XWiV)\text{head}_i = \text{softmax} \left( \frac{(XW_i^Q)(XW_i^K)^T}{\sqrt{d_k}} \right) (XW_i^V)

Then:

MultiHead(X)=Concat(head1,,headh)WO\text{MultiHead}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O

15. One-Line Interview Answer

Multi-head attention runs several self-attention heads in parallel, each with its own learned query, key, and value projections. Each head can learn a different attention pattern over the same input sequence. The head outputs are concatenated and passed through an output projection to produce one unified contextual representation.


16. Final Intuition

A single attention head asks:

Which tokens matter to this token from one learned perspective?

Multi-head attention asks:

Which tokens matter from several learned perspectives at once?

One head may capture grammar. Another may capture location. Another may capture possession. Another may capture a long-range reference.

The final projection then combines these perspectives into one representation that the rest of the Transformer can use.

That is why multi-head attention is more powerful than a single attention head:

It lets the model read the same sentence through multiple relational lenses at the same time.