Self Attention in Transformers - The Deep Dive
Q: What Is Self-Attention, Really?
Manish: Create a beautiful, intuitive book from the transcript about self-attention in Transformers.
Response: Let us build self-attention from the ground up. Not as a formula first. Not as a mysterious Transformer trick. We will begin with a simple problem:
The same word can mean different things in different contexts.
Self-attention is the mechanism that lets a model look at the other words in a sentence and reshape each word's representation according to its context.
Book Map
flowchart TD
A["Problem: static word meanings"] --> B["Need: contextual word representation"]
B --> C["Idea: mix each word with other words"]
C --> D["Question: how much should each word matter?"]
D --> E["Similarity: dot product"]
E --> F["Probability: softmax"]
F --> G["Learnable attention: Q, K, V"]
G --> H["Matrix equation"]
H --> I["Transformer self-attention"]
By the end, the formula
will feel less like a spell and more like a compact sentence.
1. The Real Problem: One Word, Many Meanings
Imagine the word light.
It can mean visible radiation:
The light entered the room.
It can mean low weight:
This bag is light.
It can describe color:
She wore a light blue shirt.
It can even act like a verb:
Light the candle.
The spelling is the same, but the meaning changes.
The same thing happens with Apple:
| Sentence | Meaning of Apple |
|---|---|
| I love Apple phones. | Apple as a technology company |
| I love apple juice. | Apple as a fruit |
If a model gives the word apple the same fixed vector in both sentences, then something important is missing. The model needs a way to say:
"Apple" means one thing near "phones" and another thing near "juice."
That is the need self-attention answers.
2. Static Word Embeddings Are Not Enough
A word embedding is a vector: a list of numbers that represents a word.
For example, the word apple might contain features like:
| Feature direction | Rough meaning |
|---|---|
| technology | company, phone, device, software |
| fruit | food, juice, tree, sweet |
| object | thing, noun, concrete entity |
The problem is not that embeddings are useless. They are very useful. The problem is that a normal word embedding is static.
It does not automatically change from sentence to sentence.
flowchart LR
A["apple embedding"] --> B["I love Apple phones"]
A --> C["I love apple juice"]
B --> D["Same vector used"]
C --> D
D --> E["Context is lost"]
If the original embedding of apple is 50 percent technology and 50 percent fruit, then it is too vague for both sentences.
For Apple phones, we want the representation to move toward technology.
For apple juice, we want the representation to move toward fruit.
This new, sentence-aware vector is called a contextual word representation.
A static embedding says: "What does this word usually mean?"
A contextual representation says: "What does this word mean here?"
3. The First Leap: Let Every Word Borrow Meaning From Other Words
How do we make the representation of apple aware of the sentence?
The other words provide the context.
In:
I love Apple phones.
the word phones tells us that Apple probably refers to the technology company.
In:
I love apple juice.
the word juice tells us that apple probably refers to the fruit.
So the first idea is:
Create a new representation of a word by mixing it with the other words in the same sentence.
For the sentence:
love Apple phones
we might say:
This does not mean mixing the written words. It means mixing their vectors.
flowchart LR
L["love vector"] -- "0.1" --> A2["new Apple vector"]
A["Apple vector"] -- "0.5" --> A2
P["phones vector"] -- "0.4" --> A2
This new Apple vector keeps much of the original Apple meaning, but it also borrows meaning from phones. So it moves toward the technology side.
For:
love apple juice
we could write:
Now the vector borrows meaning from juice, so it moves toward fruit.
4. The Numbers Are Attention Weights
The numbers 0.1, 0.5, and 0.4 are not random decoration.
They answer this question:
When creating the new representation of this word, how much should I pay attention to each word in the sentence?
For Apple in "love Apple phones":
| Word being considered | Question | Possible weight |
|---|---|---|
| love | How related is Apple to love? | 0.1 |
| Apple | How related is Apple to itself? | 0.5 |
| phones | How related is Apple to phones? | 0.4 |
These are called attention weights.
They must behave like probabilities:
So self-attention needs two things:
- A way to score how related two words are.
- A way to convert those scores into probabilities.
That gives us the next two ingredients: dot product and softmax.
5. Similarity: The Dot Product
In embedding space, related words tend to live closer together or point in similar directions.
So to decide how much one word should attend to another, we need a similarity score.
One common similarity score is the dot product.
If two vectors point in similar directions, their dot product is high. If they are unrelated or opposite, the dot product is lower.
For two vectors:
their dot product is:
Tiny Example
Suppose:
Then:
If:
then:
So Apple is more similar to phones than to love in this simplified space.
flowchart TD
A["Apple vector"] --> D1["dot product with love = -2"]
A --> D2["dot product with Apple = 8"]
A --> D3["dot product with phones = 8"]
D1 --> S["softmax"]
D2 --> S
D3 --> S
S --> W["attention weights"]
6. Softmax: Turning Scores Into Percentages
Dot products can produce any number:
- large positive numbers
- small positive numbers
- zero
- negative numbers
But attention weights should be probabilities. They should be non-negative and sum to 1.
That is exactly what softmax does.
First Principle: Turning Preference Into Share
Imagine you have a fixed attention budget of 1.
For the word Apple, the model may give raw relevance scores to the surrounding words:
| Word | Raw score |
|---|---|
| love | 1 |
| Apple | 3 |
| phones | 6 |
These scores already tell us the ordering:
phones matters most, Apple matters next, love matters least.
But they are not yet usable as mixing weights. They are just raw preference numbers. We cannot directly say:
because this would make the size of the new vector depend on the size of the raw scores. We need percentages instead.
Softmax does this in two moves:
- It makes every score positive by applying an exponential.
- It divides each positive value by the total.
So every word receives a share of the attention budget:
flowchart LR
S["raw scores: 1, 3, 6"] --> E["exponentiate: positive strengths"]
E --> N["normalize by total"]
N --> P["probabilities that sum to 1"]
The biggest score gets the biggest share, but smaller scores are not completely deleted.
That is why it is called soft max:
It behaves like choosing the maximum, but softly. Instead of selecting only one winner, it distributes attention across all options.
Given scores:
softmax turns them into something close to:
The exact first value is not literally zero, but it is so small that the intuition is correct.
Dot product answers: "How strongly are these two words related?"
Softmax answers: "Compared with all other words, what percentage of attention should this word receive?"
Now we can build a first version of self-attention.
7. First Version of Self-Attention: Weighted Mixing
For one target word, self-attention does this:
- Compare the target word with every word in the sentence.
- Convert similarity scores into attention weights.
- Multiply each word vector by its attention weight.
- Add the weighted vectors together.
For Apple:
where:
This same process happens for every word.
flowchart TD
subgraph Sentence [Sentence vectors]
L["love"]
A["Apple"]
P["phones"]
end
A --> C1["compare Apple with love"]
A --> C2["compare Apple with Apple"]
A --> C3["compare Apple with phones"]
C1 --> SM["softmax"]
C2 --> SM
C3 --> SM
SM --> M["weighted sum of word vectors"]
L --> M
A --> M
P --> M
M --> OUT["Apple'"]
The Gravity Analogy
Think of words as objects in meaning-space.
Attention lets nearby or relevant words pull each other.
If Apple is strongly related to phones, then phones pulls the Apple representation toward the technology region.
If Apple is strongly related to juice, then juice pulls the Apple representation toward the fruit region.
This is the heart of self-attention:
Every word becomes a context-shaped mixture of the words around it.
8. But This First Version Has Two Problems
The simple version above is intuitive, but it is not enough for a trainable Transformer.
Problem 1: No Learnable Parameters
So far, we only used existing word embeddings and fixed math operations.
That means the model cannot learn what kind of similarity matters for a task.
Consider:
I love Apple watch.
The word watch can mean:
- to observe
- a timekeeping device
- a smartwatch product
The general embedding of watch might sit between "observe" and "clock." The general embedding of Apple might sit between "fruit" and "technology company."
Without trainable parameters, the model may fail to learn that Apple watch refers to a product category.
We want the model to learn:
In this task and this data, these relationships matter more than those relationships.
Problem 2: The Same Vector Is Used For Everything
In the simple version, the same embedding plays three roles:
- It asks what to look for.
- It advertises what it contains.
- It contributes information to the final mixture.
That creates symmetry.
The similarity from Apple to phones becomes the same as the similarity from phones to Apple.
But language is often directional.
Maybe phones should strongly shape Apple, because it clarifies Apple as a company. But Apple should not necessarily reshape phones just as strongly.
So we need different learned versions of each word vector.
That is why Transformers create:
- Query vectors
- Key vectors
- Value vectors
9. Query, Key, Value: The Three Roles
From every word embedding, self-attention creates three new vectors:
where:
- is the matrix of word embeddings.
- , , and are learnable weight matrices.
The same original word now has three learned projections.
| Vector | Intuitive role | Question it helps answer |
|---|---|---|
| Query | what this word is looking for | "What information do I need?" |
| Key | what this word can match against | "Should others pay attention to me?" |
| Value | what this word contributes | "If selected, what information do I pass on?" |
flowchart LR
E["word embedding"] --> WQ["multiply by W_Q"]
E --> WK["multiply by W_K"]
E --> WV["multiply by W_V"]
WQ --> Q["Query"]
WK --> K["Key"]
WV --> V["Value"]
The weight matrices are learned during training.
This solves both problems:
- The model now has trainable parameters.
- The model can use different vectors for matching and information mixing.
10. Attention For One Word Using Q, K, V
Suppose the sentence is:
love Apple phones
To create the new representation of Apple:
- Take the query of Apple: .
- Compare it with every key: , , .
- Convert the scores into attention weights using softmax.
- Use those weights to mix the value vectors: , , .
The score calculation is:
Then:
Finally:
flowchart TD
QA["Query: Apple"] --> S1["score with Key: love"]
QA --> S2["score with Key: Apple"]
QA --> S3["score with Key: phones"]
S1 --> SM["softmax"]
S2 --> SM
S3 --> SM
SM --> A1["weight alpha_1"]
SM --> A2["weight alpha_2"]
SM --> A3["weight alpha_3"]
V1["Value: love"] --> SUM["weighted sum"]
V2["Value: Apple"] --> SUM
V3["Value: phones"] --> SUM
A1 --> SUM
A2 --> SUM
A3 --> SUM
SUM --> OUT["new Apple representation"]
Notice the separation:
- Queries and keys decide attention weights.
- Values provide the content that gets mixed.
11. The Same Process Happens For Every Word
To create the new representation of love, use the query of love.
To create the new representation of Apple, use the query of Apple.
To create the new representation of phones, use the query of phones.
The keys and values for all words are available to all target words.
flowchart TD
LQ["q_love"] --> O1["love'"]
AQ["q_Apple"] --> O2["Apple'"]
PQ["q_phones"] --> O3["phones'"]
K["all keys: love, Apple, phones"] --> O1
K --> O2
K --> O3
V["all values: love, Apple, phones"] --> O1
V --> O2
V --> O3
Each word asks its own question, attends to the sentence, and receives a custom mixture of value vectors.
That is why self-attention creates one new contextual vector per input token.
12. Matrix Form: How Computers Actually Do It
The word-by-word explanation is useful for intuition.
But computers prefer matrices.
Suppose the sentence has 3 tokens:
love Apple phones
And suppose each word embedding has 512 dimensions.
Then the input matrix is:
Each row is one word vector:
| Row | Token |
|---|---|
| 1 | love |
| 2 | Apple |
| 3 | phones |
Now multiply by learned matrices:
So:
flowchart LR
X["X: 3 x 512"] --> WQ["W_Q: 512 x 64"]
X --> WK["W_K: 512 x 64"]
X --> WV["W_V: 512 x 64"]
WQ --> Q["Q: 3 x 64"]
WK --> K["K: 3 x 64"]
WV --> V["V: 3 x 64"]
Similarity Matrix
To compare every query with every key:
Shape:
This produces a score matrix:
| key: love | key: Apple | key: phones | |
|---|---|---|---|
| query: love | score | score | score |
| query: Apple | score | score | score |
| query: phones | score | score | score |
Each row says:
For this target word, how much does it match each word in the sentence?
Then we apply softmax across each row, turning scores into attention weights.
Finally:
Shape:
The result is a matrix of new contextual representations:
| Row | New contextual vector |
|---|---|
| 1 | love' |
| 2 | Apple' |
| 3 | phones' |
flowchart TD
Q["Q: queries"] --> DOT["Q times K^T"]
K["K: keys"] --> DOT
DOT --> SCORES["score matrix"]
SCORES --> SCALE["divide by sqrt(d_k)"]
SCALE --> SM["row-wise softmax"]
SM --> ATT["attention matrix"]
V["V: values"] --> MIX["attention matrix times V"]
ATT --> MIX
MIX --> OUT["contextual word vectors"]
Tiny Matrix Example With Real Numbers
Let us use very small vectors so we can see the whole calculation.
Sentence:
love Apple phones
Suppose each word embedding has only 2 dimensions:
So:
| Row | Token | Embedding |
|---|---|---|
| 1 | love | |
| 2 | Apple | |
| 3 | phones |
Now choose tiny learned matrices:
These numbers are just a toy example.
They were not derived from the sentence by hand. They were chosen here only so the arithmetic stays small enough to see.
In a real Transformer, , , and are learnable parameters.
At the beginning of training, they usually start as small random numbers. Then the model slowly changes them to reduce prediction error.
flowchart TD
A["start with random W_Q, W_K, W_V"] --> B["calculate Q, K, V"]
B --> C["run attention"]
C --> D["make prediction"]
D --> E["compare with correct answer"]
E --> F["calculate loss"]
F --> G["backpropagation finds useful changes"]
G --> H["optimizer updates W_Q, W_K, W_V"]
H --> B
So these matrices are not manually designed. They are discovered through training.
Their roles are:
| Matrix | What it learns |
|---|---|
| how to create a query: what this token should look for | |
| how to create a key: what this token offers for matching | |
| how to create a value: what information this token should pass forward |
In this toy example:
is the identity matrix, so it keeps the query vector the same as the original embedding. That is only for simplicity.
Now calculate:
Now compare every query with every key:
This is the raw attention score matrix.
| key: love | key: Apple | key: phones | |
|---|---|---|---|
| query: love | 1 | 1 | 0 |
| query: Apple | 2 | 3 | 2 |
| query: phones | 2 | 4 | 4 |
Because each query/key vector has 2 dimensions:
So we scale by:
The scaled score matrix is:
After applying softmax to each row, we get an approximate attention matrix:
Each row sums to 1.
The second row is the attention pattern for Apple:
Meaning:
| Value vector | Weight |
|---|---|
| love | 0.25 |
| Apple | 0.50 |
| phones | 0.25 |
Now mix the value vectors:
So:
This is the new contextual representation of Apple.
If we multiply the whole attention matrix by , we get the contextual representation for every token:
| Row | Token | New contextual vector |
|---|---|---|
| 1 | love | |
| 2 | Apple | |
| 3 | phones |
The important idea:
and decide how much attention to give. provides the information that gets mixed.
13. The Full Equation
The final scaled dot-product attention equation is:
Read it slowly:
| Part | Meaning |
|---|---|
| compare every query with every key | |
| scale scores so they do not become too large | |
| softmax | convert scores into attention probabilities |
| multiply by | mix information using those probabilities |
The transcript builds the equation first without the division by , then mentions that the original paper includes it. The scaling term keeps dot product scores numerically stable when key/query dimensions are large.
14. Self-Attention As A Story
For each word, self-attention performs this inner dialogue:
- "I am this word. What do I need to understand myself in this sentence?"
- "Let me compare my query with every word's key."
- "The stronger the match, the more attention I give."
- "Now I collect information from the value vectors."
- "My new meaning is the weighted mixture of what mattered."
sequenceDiagram
participant T as Target word
participant Q as Query
participant K as All keys
participant S as Softmax
participant V as All values
participant O as New vector
T->>Q: create query
Q->>K: compare with every key
K->>S: send similarity scores
S->>V: produce attention weights
V->>O: weighted sum of values
This is why the name is self-attention:
The sentence attends to itself. Each token looks at other tokens from the same input.
15. Why Learnable Q, K, V Matter
Without , , and , attention is only a fixed similarity operation.
With them, the model can learn many kinds of relationship:
| Relationship type | Example |
|---|---|
| product relationship | Apple -> watch |
| grammar relationship | adjective -> noun |
| reference relationship | pronoun -> earlier noun |
| semantic relationship | juice -> fruit |
| task-specific relationship | question word -> answer span |
The model learns which features should be emphasized in queries, keys, and values.
For one task, it may learn that product names matter.
For another task, it may learn that grammar roles matter.
For another, it may learn that long-distance references matter.
That is the power of trainable attention.
16. The Two Big Benefits
Benefit 1: Parallel Processing
RNNs process tokens step by step:
flowchart LR
W1["word 1"] --> W2["word 2"] --> W3["word 3"] --> W4["word 4"] --> W5["word 5"]
Self-attention compares tokens using matrix multiplication:
flowchart TD
W1["word 1"] --> A["attention layer"]
W2["word 2"] --> A
W3["word 3"] --> A
W4["word 4"] --> A
W5["word 5"] --> A
A --> O1["word 1'"]
A --> O2["word 2'"]
A --> O3["word 3'"]
A --> O4["word 4'"]
A --> O5["word 5'"]
The output for each token does not need the previous output to be computed first. That makes attention highly parallelizable on GPUs.
Important nuance:
Self-attention removes sequential dependency, but it does not mean total computation is constant. For tokens, the attention score matrix has pairwise comparisons. The advantage is that these comparisons can be computed in parallel.
Benefit 2: Long-Range Dependencies
In an RNN, information from an early word must travel step by step through the sequence.
In self-attention, any word can directly compare with any other word in one attention layer.
flowchart LR
A["word at start"] --- Z["word at end"]
A --- M1["middle word 1"]
A --- M2["middle word 2"]
Z --- M1
Z --- M2
This helps with sentences where meaning depends on words far apart.
Example:
The Apple device that I bought last year, after months of comparing music players, voicemail features, and battery life, was a phone.
The word Apple can directly attend to phone, even if many words separate them.
17. The Whole Mechanism In One Page
flowchart TD
X["Input tokens"] --> E["Embeddings"]
E --> Q["Q = X W_Q"]
E --> K["K = X W_K"]
E --> V["V = X W_V"]
Q --> S["Scores = Q K^T"]
K --> S
S --> SC["Scale by sqrt(d_k)"]
SC --> P["Softmax -> attention weights"]
P --> O["Weighted sum: weights times V"]
V --> O
O --> C["Contextual token representations"]
Self-attention is just this loop:
- Create query, key, and value vectors.
- Use queries and keys to compute relevance.
- Use softmax to turn relevance into attention.
- Use attention to mix values.
- Return context-aware vectors.
18. Example: Apple Phones vs Apple Juice
Let the original embedding of apple contain both technology and fruit meaning.
flowchart TD
A["apple: mixed meaning"] --> T["technology direction"]
A --> F["fruit direction"]
Sentence 1: I love Apple phones
The word phones receives a high attention weight when Apple is being updated.
flowchart LR
L["love"] -- "low weight" --> A2["Apple'"]
A["Apple"] -- "medium weight" --> A2
P["phones"] -- "high weight" --> A2
A2 --> TECH["more technology-like"]
Sentence 2: I love apple juice
The word juice receives a high attention weight when apple is being updated.
flowchart LR
L["love"] -- "low weight" --> A2["apple'"]
A["apple"] -- "medium weight" --> A2
J["juice"] -- "high weight" --> A2
A2 --> FRUIT["more fruit-like"]
Same starting word. Different neighboring words. Different final representation.
That is contextual meaning.
19. Common Mistakes
Mistake 1: "Attention weights are the final meaning."
No. Attention weights only say how strongly to mix value vectors. The final meaning is the weighted sum of values.
Mistake 2: "Dot product alone is self-attention."
No. Dot product gives scores. Self-attention also needs softmax, values, and learned projections.
Mistake 3: "Self-attention is always cheap."
No. It is parallelizable, but the attention matrix is . Long sequences can be expensive.
Mistake 4: "Q, K, and V are three different words."
No. For each token, Q, K, and V are three learned projections of the same input representation.
Mistake 5: "The word only attends to nearby words."
No. In full self-attention, every token can attend to every other token, unless a mask restricts it.
20. Glossary
| Term | Meaning |
|---|---|
| Token | A piece of text processed by the model, often a word or word fragment |
| Embedding | A vector representation of a token |
| Static embedding | Same vector for a word regardless of context |
| Contextual representation | A vector that changes depending on surrounding tokens |
| Attention score | Raw similarity between a query and a key |
| Attention weight | Softmax-normalized score used for mixing values |
| Query | Learned vector used to ask what the token needs |
| Key | Learned vector used to match against queries |
| Value | Learned vector whose information gets mixed into the output |
| Learnable matrices that create Q, K, and V | |
| Matrix of pairwise query-key scores | |
| Softmax | Function that converts scores into probabilities |
| Dimension of key/query vectors |
21. Memory Hooks
The Restaurant Analogy
Imagine every word is a customer at a restaurant.
- Query: what the customer wants to eat.
- Key: what each dish says it can satisfy.
- Value: the actual dish served.
The customer compares their query with all keys, chooses a weighted mixture, and receives a plate made from values.
The Search Analogy
Imagine searching in a library.
- Query: your search phrase.
- Keys: labels on books.
- Values: the actual content inside the books.
You match your query against labels, then read more from the books with stronger matches.
The Meaning Gravity Analogy
Words pull each other in meaning-space. Strongly related words pull more. Weakly related words pull less.
Self-attention learns which pulls should matter.
22. Active Recall
Answer these without looking above.
- Why is a static word embedding not enough for the word "apple"?
- What does it mean to create a contextual word representation?
- Why do we need attention weights to sum to 1?
- What does the dot product measure in attention?
- Why do we apply softmax after dot products?
- What problem is solved by learnable matrices , , and ?
- What is the difference between a key and a value?
- What is the shape of if a sentence has 100 tokens?
- Why is self-attention easier to parallelize than an RNN?
- Why is self-attention useful for long-range dependencies?
23. Answers
- Because the same word can mean different things in different contexts. "Apple phones" and "apple juice" should not use exactly the same final representation for Apple.
- It means creating a new vector for a word based on the other words in the same input.
- Because the weights are used like probabilities for mixing word information.
- It measures similarity or compatibility between two vectors.
- Because raw dot products can be negative or very large. Softmax converts them into non-negative probabilities that sum to 1.
- They let the model learn task-specific ways to compare and mix words.
- A key is used for matching against a query. A value is the information that gets mixed into the output.
- .
- Because each token's output can be computed from the same input matrices without waiting for previous token outputs.
- Because any token can directly attend to any other token, even if they are far apart in the sequence.
24. Final Summary
Self-attention begins with a simple need:
A word should mean what it means in this sentence, not only what it usually means.
To solve that, each word creates a new representation by mixing information from the other words.
The model decides how much to mix using attention weights.
Those weights come from query-key similarity:
The scores become probabilities through softmax:
The probabilities mix value vectors:
And the full scaled equation is:
That is self-attention:
A trainable way for every token to look at every other token and rebuild itself using context.