Multi-Head Attention

Transformers in Deep Learning - Part 6
Sibling notes: Self Attention in Transformers - The Deep Dive · Scaled Dot-Product Attention - The Deep Dive

Self-attention lets a token build a new representation by looking at other tokens in the same sentence.

Multi-head attention asks a stronger question:

Why look at the sentence from only one perspective?

A single attention head can learn one pattern of relationships. But language often contains many relationships at the same time:

subject and verb
object and location
possession
time
cause and effect
long-range dependency
ambiguity between multiple meanings

Multi-head attention gives the model several attention heads in parallel, so different heads can learn different relationship patterns.

1. Why One Attention Head Is Not Enough

Self-attention solves an important problem:

The same word can mean different things in different contexts.

For example:

Sentence	Meaning of "Apple"
I love Apple phones.	Apple as a technology company
I love apple juice.	apple as a fruit

A single self-attention head can shift the representation of Apple toward the correct meaning by looking at nearby context.

But language is more complex than one contextual shift.

A sentence can contain several relationships at once.

2. Ambiguity Requires Multiple Attention Patterns

Consider:

She saw the man with the telescope.

This sentence has at least two possible interpretations.

Interpretation 1

The man has the telescope.

She saw [the man with the telescope].

Here, the important dependency is:

\text{man} \leftrightarrow \text{telescope}

The phrase "with the telescope" describes the man.

Interpretation 2

She used the telescope to see the man.

She saw the man [using the telescope].

Here, the important dependency is:

\text{she} \leftrightarrow \text{telescope}

The telescope is the instrument used by the subject.

flowchart TD
    S["She saw the man with the telescope"] --> I1["Interpretation 1"]
    S --> I2["Interpretation 2"]

    I1 --> R1["man - telescope"]
    I2 --> R2["she - telescope"]

One sentence can support more than one valid attention pattern.

If the model has only one attention head, it has only one attention map at that layer. That may not be enough to represent several useful interpretations at the same time.

3. Multiple Relationships in One Sentence

Consider:

The keys on the table belong to Sara.

To understand keys, the model should capture more than one relationship:

Relationship	Meaning
keys -> table	spatial relationship: where the keys are
keys -> Sara	possession: who owns the keys

A single attention head can assign attention weights across all tokens, but it may struggle to strongly represent many different types of relationships at once.

One head might focus on location. Another might focus on ownership. Another might focus on grammar.

That is the central idea:

Multiple heads allow multiple attention patterns to exist in parallel.

4. The Filter Analogy

In a convolutional neural network, one image is processed by many filters.

Different filters may detect different visual features:

Filter	Possible feature
filter 1	vertical edges
filter 2	horizontal edges
filter 3	curves
filter 4	color patterns

Each filter gives a different view of the same image.

Multi-head attention works similarly for language:

Attention head	Possible language feature
head 1	subject-verb relationship
head 2	object-location relationship
head 3	possession
head 4	long-range reference
head 5	alternative interpretation

The input sentence is the same, but each head learns a different way to inspect it.

flowchart LR
    X["same input sentence"] --> H1["head 1: grammar"]
    X --> H2["head 2: location"]
    X --> H3["head 3: ownership"]
    X --> H4["head 4: long-range link"]

    H1 --> C["combined representation"]
    H2 --> C
    H3 --> C
    H4 --> C

The heads are not manually assigned these jobs. They learn useful patterns during training.

5. Single-Head Attention Recap

For one attention head:

Q = XW_Q

K = XW_K

V = XW_V

Then:

\text{head} = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

This produces a contextual representation for every token.

For the sentence:

love Apple phones

the output contains:

a new representation for love
a new representation for Apple
a new representation for phones

One head gives one contextual view of the sentence.

6. Multi-Head Attention: The Main Idea

Multi-head attention does not change the attention mechanism itself.

It repeats the same attention operation several times in parallel, each time with its own learned matrices:

W_1^Q, W_1^K, W_1^V

W_2^Q, W_2^K, W_2^V

\dots

W_h^Q, W_h^K, W_h^V

Each head gets the same input $X$ , but because each head has different learned matrices, each head produces different $Q$ , $K$ , and $V$ .

For head $i$ :

\text{head}_i = \text{Attention} ( XW_i^Q, XW_i^K, XW_i^V )

So with $h$ heads:

\text{head}_1, \text{head}_2, \dots, \text{head}_h

Each head gives a different contextual version of the same tokens.

7. What Each Head Produces

Suppose the model has 8 attention heads.

For the word Apple, the model does not produce one intermediate contextual vector. It produces 8 intermediate contextual vectors:

\text{Apple}_1'

\text{Apple}_2'

\dots

\text{Apple}_8'

Each one may emphasize a different relationship.

For a longer sentence, possible head behaviors might look like this:

Head	Possible focus
head 1	Apple and iPhone form one product phrase
head 2	the iPhone is on a shelf
head 3	Jake is holding the iPhone
head 4	the iPhone replaces an old phone

These examples are intuitive, not guaranteed assignments. In real trained models, attention heads can specialize in messy, overlapping, and sometimes surprising ways.

8. Concatenating the Heads

After all heads run, their outputs are concatenated.

If there are 8 heads, each token now has 8 head outputs side by side:

[\text{head}_1;\text{head}_2;\dots;\text{head}_8]

For Apple, this means:

[ \text{Apple}_1'; \text{Apple}_2'; \dots; \text{Apple}_8' ]

This long vector contains multiple contextual views of Apple.

flowchart LR
    H1["head 1 output"] --> CAT["concatenate"]
    H2["head 2 output"] --> CAT
    H3["head 3 output"] --> CAT
    H8["head 8 output"] --> CAT
    CAT --> Z["combined multi-head vector"]

Concatenation collects the perspectives. It does not yet decide how to blend them into one final representation.

9. The Output Projection

After concatenation, the Transformer applies one more learned linear transformation:

\text{MultiHead}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O

$W^O$ is the output projection matrix.

Its job is to combine the information from all heads into one unified representation.

Why is this needed?

Because the concatenated vector is like a pile of features. Some head outputs may be very useful for the current token. Some may be less useful. Some may need to be mixed together.

The output projection learns how to combine them.

flowchart LR
    CAT["concatenated head outputs"] --> WO["output projection W^O"]
    WO --> OUT["final contextual representation"]

The projection can:

keep useful features
reduce irrelevant features
mix signals from different heads
return the representation to the model's expected dimension

10. Dimension Walkthrough

Use the original Transformer-style dimensions:

d_{\text{model}} = 512

h = 8

d_k = d_v = 64

Suppose the input sentence has 3 tokens:

X \in \mathbb{R}^{3 \times 512}

For each head:

W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{512 \times 64}

So:

Q_i, K_i, V_i \in \mathbb{R}^{3 \times 64}

Each head outputs:

\text{head}_i \in \mathbb{R}^{3 \times 64}

With 8 heads, concatenation gives:

\text{Concat}(\text{head}_1,\dots,\text{head}_8) \in \mathbb{R}^{3 \times 512}

because:

8 \times 64 = 512

Then the output projection uses:

W^O \in \mathbb{R}^{512 \times 512}

So the final output is:

\mathbb{R}^{3 \times 512}

The model starts with:

3 \times 512

and ends with:

3 \times 512

But the output is no longer a static embedding. It is a contextual representation built from multiple attention perspectives.

11. Shape Summary

Object	Shape
Input $X$	$3 \times 512$
One head's $W_i^Q$	$512 \times 64$
One head's $Q_i$	$3 \times 64$
One head's $K_i$	$3 \times 64$
One head's $V_i$	$3 \times 64$
One head output	$3 \times 64$
8 concatenated heads	$3 \times 512$
Output projection $W^O$	$512 \times 512$
Final multi-head output	$3 \times 512$

12. Why Not Just Make One Huge Head?

A natural question:

Why use many smaller heads instead of one large head?

One large head still produces one attention pattern.

Multiple heads produce multiple attention patterns.

That distinction matters.

Design	Attention patterns
one large head	one score matrix
many heads	many score matrices

Each head has its own:

Q_iK_i^T

So each head can choose a different set of token-to-token relationships.

The benefit is not just more parameters. The benefit is parallel relational views.

13. Common Confusions

Note

No. Each head receives the same input, but uses different learned projection matrices.

Note

No. These are intuitive examples. The heads learn their own patterns during training.

Note

Not quite. The concatenated heads are passed through the output projection $W^O$ to create the final representation.

Note

No. If the input has 3 tokens, the output still has 3 token representations. The feature dimension is managed through projection.

Note

Not necessarily. More heads increase compute and may reduce the dimension per head if $d_{\text{model}}$ stays fixed. The number of heads is a design choice.

14. The Complete Formula

The full multi-head attention formula is:

\text{MultiHead}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O

where:

\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)

For self-attention, the query, key, and value projections all come from the same input sequence $X$ .

Using the scaled dot-product formula:

\text{head}_i = \text{softmax} \left( \frac{(XW_i^Q)(XW_i^K)^T}{\sqrt{d_k}} \right) (XW_i^V)

Then:

\text{MultiHead}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O

15. One-Line Interview Answer

Multi-head attention runs several self-attention heads in parallel, each with its own learned query, key, and value projections. Each head can learn a different attention pattern over the same input sequence. The head outputs are concatenated and passed through an output projection to produce one unified contextual representation.

16. Final Intuition

A single attention head asks:

Which tokens matter to this token from one learned perspective?

Multi-head attention asks:

Which tokens matter from several learned perspectives at once?

One head may capture grammar. Another may capture location. Another may capture possession. Another may capture a long-range reference.

The final projection then combines these perspectives into one representation that the rest of the Transformer can use.

That is why multi-head attention is more powerful than a single attention head:

It lets the model read the same sentence through multiple relational lenses at the same time.