Multi-Head Attention
Transformers in Deep Learning - Part 6
Sibling notes: Self Attention in Transformers - The Deep Dive · Scaled Dot-Product Attention - The Deep Dive
Self-attention lets a token build a new representation by looking at other tokens in the same sentence.
Multi-head attention asks a stronger question:
Why look at the sentence from only one perspective?
A single attention head can learn one pattern of relationships. But language often contains many relationships at the same time:
- subject and verb
- object and location
- possession
- time
- cause and effect
- long-range dependency
- ambiguity between multiple meanings
Multi-head attention gives the model several attention heads in parallel, so different heads can learn different relationship patterns.
1. Why One Attention Head Is Not Enough
Self-attention solves an important problem:
The same word can mean different things in different contexts.
For example:
| Sentence | Meaning of "Apple" |
|---|---|
| I love Apple phones. | Apple as a technology company |
| I love apple juice. | apple as a fruit |
A single self-attention head can shift the representation of Apple toward the correct meaning by looking at nearby context.
But language is more complex than one contextual shift.
A sentence can contain several relationships at once.
2. Ambiguity Requires Multiple Attention Patterns
Consider:
She saw the man with the telescope.
This sentence has at least two possible interpretations.
Interpretation 1
The man has the telescope.
She saw [the man with the telescope].
Here, the important dependency is:
The phrase "with the telescope" describes the man.
Interpretation 2
She used the telescope to see the man.
She saw the man [using the telescope].
Here, the important dependency is:
The telescope is the instrument used by the subject.
flowchart TD
S["She saw the man with the telescope"] --> I1["Interpretation 1"]
S --> I2["Interpretation 2"]
I1 --> R1["man - telescope"]
I2 --> R2["she - telescope"]
One sentence can support more than one valid attention pattern.
If the model has only one attention head, it has only one attention map at that layer. That may not be enough to represent several useful interpretations at the same time.
3. Multiple Relationships in One Sentence
Consider:
The keys on the table belong to Sara.
To understand keys, the model should capture more than one relationship:
| Relationship | Meaning |
|---|---|
| keys -> table | spatial relationship: where the keys are |
| keys -> Sara | possession: who owns the keys |
A single attention head can assign attention weights across all tokens, but it may struggle to strongly represent many different types of relationships at once.
One head might focus on location. Another might focus on ownership. Another might focus on grammar.
That is the central idea:
Multiple heads allow multiple attention patterns to exist in parallel.
4. The Filter Analogy
In a convolutional neural network, one image is processed by many filters.
Different filters may detect different visual features:
| Filter | Possible feature |
|---|---|
| filter 1 | vertical edges |
| filter 2 | horizontal edges |
| filter 3 | curves |
| filter 4 | color patterns |
Each filter gives a different view of the same image.
Multi-head attention works similarly for language:
| Attention head | Possible language feature |
|---|---|
| head 1 | subject-verb relationship |
| head 2 | object-location relationship |
| head 3 | possession |
| head 4 | long-range reference |
| head 5 | alternative interpretation |
The input sentence is the same, but each head learns a different way to inspect it.
flowchart LR
X["same input sentence"] --> H1["head 1: grammar"]
X --> H2["head 2: location"]
X --> H3["head 3: ownership"]
X --> H4["head 4: long-range link"]
H1 --> C["combined representation"]
H2 --> C
H3 --> C
H4 --> C
The heads are not manually assigned these jobs. They learn useful patterns during training.
5. Single-Head Attention Recap
For one attention head:
Then:
This produces a contextual representation for every token.
For the sentence:
love Apple phones
the output contains:
- a new representation for love
- a new representation for Apple
- a new representation for phones
One head gives one contextual view of the sentence.
6. Multi-Head Attention: The Main Idea
Multi-head attention does not change the attention mechanism itself.
It repeats the same attention operation several times in parallel, each time with its own learned matrices:
Each head gets the same input , but because each head has different learned matrices, each head produces different , , and .
For head :
So with heads:
Each head gives a different contextual version of the same tokens.
7. What Each Head Produces
Suppose the model has 8 attention heads.
For the word Apple, the model does not produce one intermediate contextual vector. It produces 8 intermediate contextual vectors:
Each one may emphasize a different relationship.
For a longer sentence, possible head behaviors might look like this:
| Head | Possible focus |
|---|---|
| head 1 | Apple and iPhone form one product phrase |
| head 2 | the iPhone is on a shelf |
| head 3 | Jake is holding the iPhone |
| head 4 | the iPhone replaces an old phone |
These examples are intuitive, not guaranteed assignments. In real trained models, attention heads can specialize in messy, overlapping, and sometimes surprising ways.
8. Concatenating the Heads
After all heads run, their outputs are concatenated.
If there are 8 heads, each token now has 8 head outputs side by side:
For Apple, this means:
This long vector contains multiple contextual views of Apple.
flowchart LR
H1["head 1 output"] --> CAT["concatenate"]
H2["head 2 output"] --> CAT
H3["head 3 output"] --> CAT
H8["head 8 output"] --> CAT
CAT --> Z["combined multi-head vector"]
Concatenation collects the perspectives. It does not yet decide how to blend them into one final representation.
9. The Output Projection
After concatenation, the Transformer applies one more learned linear transformation:
is the output projection matrix.
Its job is to combine the information from all heads into one unified representation.
Why is this needed?
Because the concatenated vector is like a pile of features. Some head outputs may be very useful for the current token. Some may be less useful. Some may need to be mixed together.
The output projection learns how to combine them.
flowchart LR
CAT["concatenated head outputs"] --> WO["output projection W^O"]
WO --> OUT["final contextual representation"]
The projection can:
- keep useful features
- reduce irrelevant features
- mix signals from different heads
- return the representation to the model's expected dimension
10. Dimension Walkthrough
Use the original Transformer-style dimensions:
Suppose the input sentence has 3 tokens:
For each head:
So:
Each head outputs:
With 8 heads, concatenation gives:
because:
Then the output projection uses:
So the final output is:
The model starts with:
and ends with:
But the output is no longer a static embedding. It is a contextual representation built from multiple attention perspectives.
11. Shape Summary
| Object | Shape |
|---|---|
| Input | |
| One head's | |
| One head's | |
| One head's | |
| One head's | |
| One head output | |
| 8 concatenated heads | |
| Output projection | |
| Final multi-head output |
12. Why Not Just Make One Huge Head?
A natural question:
Why use many smaller heads instead of one large head?
One large head still produces one attention pattern.
Multiple heads produce multiple attention patterns.
That distinction matters.
| Design | Attention patterns |
|---|---|
| one large head | one score matrix |
| many heads | many score matrices |
Each head has its own:
So each head can choose a different set of token-to-token relationships.
The benefit is not just more parameters. The benefit is parallel relational views.
13. Common Confusions
No. Each head receives the same input, but uses different learned projection matrices.
No. These are intuitive examples. The heads learn their own patterns during training.
Not quite. The concatenated heads are passed through the output projection to create the final representation.
No. If the input has 3 tokens, the output still has 3 token representations. The feature dimension is managed through projection.
Not necessarily. More heads increase compute and may reduce the dimension per head if stays fixed. The number of heads is a design choice.
14. The Complete Formula
The full multi-head attention formula is:
where:
For self-attention, the query, key, and value projections all come from the same input sequence .
Using the scaled dot-product formula:
Then:
15. One-Line Interview Answer
Multi-head attention runs several self-attention heads in parallel, each with its own learned query, key, and value projections. Each head can learn a different attention pattern over the same input sequence. The head outputs are concatenated and passed through an output projection to produce one unified contextual representation.
16. Final Intuition
A single attention head asks:
Which tokens matter to this token from one learned perspective?
Multi-head attention asks:
Which tokens matter from several learned perspectives at once?
One head may capture grammar. Another may capture location. Another may capture possession. Another may capture a long-range reference.
The final projection then combines these perspectives into one representation that the rest of the Transformer can use.
That is why multi-head attention is more powerful than a single attention head:
It lets the model read the same sentence through multiple relational lenses at the same time.