Self Attention in Transformers - The Deep Dive

Transformers in Deep Learning - Part 2
Sibling notes: Linear Transformation in Self-Attention - The Deep Dive · Query Key Value in Self-Attention - The Deep Dive

What Is Self-Attention, Really?

Let us build self-attention from the ground up. Not as a formula first. Not as a mysterious Transformer trick. We will begin with a simple problem:

The same word can mean different things in different contexts.

Self-attention is the mechanism that lets a model look at the other words in a sentence and reshape each word's representation according to its context.

Book Map

flowchart TD
    A["Problem: static word meanings"] --> B["Need: contextual word representation"]
    B --> C["Idea: mix each word with other words"]
    C --> D["Question: how much should each word matter?"]
    D --> E["Similarity: dot product"]
    E --> F["Probability: softmax"]
    F --> G["Learnable attention: Q, K, V"]
    G --> H["Matrix equation"]
    H --> I["Transformer self-attention"]

By the end, the formula

\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

will feel less like a spell and more like a compact sentence.

1. The Real Problem: One Word, Many Meanings

Imagine the word light.

It can mean visible radiation:

The light entered the room.

It can mean low weight:

This bag is light.

It can describe color:

She wore a light blue shirt.

It can even act like a verb:

Light the candle.

The spelling is the same, but the meaning changes.

The same thing happens with Apple:

Sentence	Meaning of Apple
I love Apple phones.	Apple as a technology company
I love apple juice.	Apple as a fruit

If a model gives the word apple the same fixed vector in both sentences, then something important is missing. The model needs a way to say:

"Apple" means one thing near "phones" and another thing near "juice."

That is the need self-attention answers.

2. Static Word Embeddings Are Not Enough

A word embedding is a vector: a list of numbers that represents a word.

For example, the word apple might contain features like:

Feature direction	Rough meaning
technology	company, phone, device, software
fruit	food, juice, tree, sweet
object	thing, noun, concrete entity

The problem is not that embeddings are useless. They are very useful. The problem is that a normal word embedding is static.

It does not automatically change from sentence to sentence.

flowchart LR
    A["apple embedding"] --> B["I love Apple phones"]
    A --> C["I love apple juice"]
    B --> D["Same vector used"]
    C --> D
    D --> E["Context is lost"]

If the original embedding of apple is 50 percent technology and 50 percent fruit, then it is too vague for both sentences.

For Apple phones, we want the representation to move toward technology.

For apple juice, we want the representation to move toward fruit.

This new, sentence-aware vector is called a contextual word representation.

Note

A static embedding says: "What does this word usually mean?"

A contextual representation says: "What does this word mean here?"

3. The First Leap: Let Every Word Borrow Meaning From Other Words

How do we make the representation of apple aware of the sentence?

The other words provide the context.

In:

I love Apple phones.

the word phones tells us that Apple probably refers to the technology company.

In:

I love apple juice.

the word juice tells us that apple probably refers to the fruit.

So the first idea is:

Create a new representation of a word by mixing it with the other words in the same sentence.

For the sentence:

love Apple phones

we might say:

\text{Apple'} = 0.1 \cdot \text{love} + 0.5 \cdot \text{Apple} + 0.4 \cdot \text{phones}

This does not mean mixing the written words. It means mixing their vectors.

flowchart LR
    L["love vector"] -- "0.1" --> A2["new Apple vector"]
    A["Apple vector"] -- "0.5" --> A2
    P["phones vector"] -- "0.4" --> A2

This new Apple vector keeps much of the original Apple meaning, but it also borrows meaning from phones. So it moves toward the technology side.

For:

love apple juice

we could write:

\text{Apple'} = 0.1 \cdot \text{love} + 0.5 \cdot \text{apple} + 0.4 \cdot \text{juice}

Now the vector borrows meaning from juice, so it moves toward fruit.

4. The Numbers Are Attention Weights

The numbers 0.1, 0.5, and 0.4 are not random decoration.

They answer this question:

When creating the new representation of this word, how much should I pay attention to each word in the sentence?

For Apple in "love Apple phones":

Word being considered	Question	Possible weight
love	How related is Apple to love?	0.1
Apple	How related is Apple to itself?	0.5
phones	How related is Apple to phones?	0.4

These are called attention weights.

They must behave like probabilities:

0.1 + 0.5 + 0.4 = 1

So self-attention needs two things:

A way to score how related two words are.
A way to convert those scores into probabilities.

That gives us the next two ingredients: dot product and softmax.

5. Similarity: The Dot Product

In embedding space, related words tend to live closer together or point in similar directions.

So to decide how much one word should attend to another, we need a similarity score.

One common similarity score is the dot product.

If two vectors point in similar directions, their dot product is high. If they are unrelated or opposite, the dot product is lower.

For two vectors:

a = [a_1, a_2]

b = [b_1, b_2]

their dot product is:

a \cdot b = a_1b_1 + a_2b_2

Tiny Example

Suppose:

\text{Apple} = [2,2]

\text{phones} = [0,4]

Then:

\text{Apple} \cdot \text{phones} = 2 \cdot 0 + 2 \cdot 4 = 8

If:

\text{love} = [-4,3]

then:

\text{Apple} \cdot \text{love} = 2 \cdot (-4) + 2 \cdot 3 = -2

So Apple is more similar to phones than to love in this simplified space.

flowchart TD
    A["Apple vector"] --> D1["dot product with love = -2"]
    A --> D2["dot product with Apple = 8"]
    A --> D3["dot product with phones = 8"]
    D1 --> S["softmax"]
    D2 --> S
    D3 --> S
    S --> W["attention weights"]

6. Softmax: Turning Scores Into Percentages

Dot products can produce any number:

large positive numbers
small positive numbers
zero
negative numbers

But attention weights should be probabilities. They should be non-negative and sum to 1.

That is exactly what softmax does.

Imagine you have a fixed attention budget of 1.

For the word Apple, the model may give raw relevance scores to the surrounding words:

Word	Raw score
love	1
Apple	3
phones	6

These scores already tell us the ordering:

phones matters most, Apple matters next, love matters least.

But they are not yet usable as mixing weights. They are just raw preference numbers. We cannot directly say:

\text{Apple'} = 1 \cdot \text{love} + 3 \cdot \text{Apple} + 6 \cdot \text{phones}

because this would make the size of the new vector depend on the size of the raw scores. We need percentages instead.

Softmax does this in two moves:

It makes every score positive by applying an exponential.
It divides each positive value by the total.

\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}

So every word receives a share of the attention budget:

flowchart LR
    S["raw scores: 1, 3, 6"] --> E["exponentiate: positive strengths"]
    E --> N["normalize by total"]
    N --> P["probabilities that sum to 1"]

The biggest score gets the biggest share, but smaller scores are not completely deleted.

That is why it is called soft max:

It behaves like choosing the maximum, but softly. Instead of selecting only one winner, it distributes attention across all options.

Given scores:

[-2, 8, 8]

softmax turns them into something close to:

[0.0, 0.5, 0.5]

The exact first value is not literally zero, but it is so small that the intuition is correct.

Note

Dot product answers: "How strongly are these two words related?"

Softmax answers: "Compared with all other words, what percentage of attention should this word receive?"

Now we can build a first version of self-attention.

7. First Version of Self-Attention: Weighted Mixing

For one target word, self-attention does this:

Compare the target word with every word in the sentence.
Convert similarity scores into attention weights.
Multiply each word vector by its attention weight.
Add the weighted vectors together.

For Apple:

\text{Apple'} = x_1 \cdot \text{love} + x_2 \cdot \text{Apple} + x_3 \cdot \text{phones}

where:

[x_1, x_2, x_3] = \text{softmax}( [\text{Apple} \cdot \text{love}, \text{Apple} \cdot \text{Apple}, \text{Apple} \cdot \text{phones}] )

This same process happens for every word.

flowchart TD
    subgraph Sentence [Sentence vectors]
        L["love"]
        A["Apple"]
        P["phones"]
    end

    A --> C1["compare Apple with love"]
    A --> C2["compare Apple with Apple"]
    A --> C3["compare Apple with phones"]

    C1 --> SM["softmax"]
    C2 --> SM
    C3 --> SM

    SM --> M["weighted sum of word vectors"]
    L --> M
    A --> M
    P --> M

    M --> OUT["Apple'"]

The Gravity Analogy

Think of words as objects in meaning-space.

Attention lets nearby or relevant words pull each other.

If Apple is strongly related to phones, then phones pulls the Apple representation toward the technology region.

If Apple is strongly related to juice, then juice pulls the Apple representation toward the fruit region.

This is the heart of self-attention:

Every word becomes a context-shaped mixture of the words around it.

8. But This First Version Has Two Problems

The simple version above is intuitive, but it is not enough for a trainable Transformer.

Problem 1: No Learnable Parameters

So far, we only used existing word embeddings and fixed math operations.

That means the model cannot learn what kind of similarity matters for a task.

Consider:

I love Apple watch.

The word watch can mean:

to observe
a timekeeping device
a smartwatch product

The general embedding of watch might sit between "observe" and "clock." The general embedding of Apple might sit between "fruit" and "technology company."

Without trainable parameters, the model may fail to learn that Apple watch refers to a product category.

We want the model to learn:

In this task and this data, these relationships matter more than those relationships.

Problem 2: The Same Vector Is Used For Everything

In the simple version, the same embedding plays three roles:

It asks what to look for.
It advertises what it contains.
It contributes information to the final mixture.

That creates symmetry.

The similarity from Apple to phones becomes the same as the similarity from phones to Apple.

But language is often directional.

Maybe phones should strongly shape Apple, because it clarifies Apple as a company. But Apple should not necessarily reshape phones just as strongly.

So we need different learned versions of each word vector.

That is why Transformers create:

Query vectors
Key vectors
Value vectors

9. Query, Key, Value: The Three Roles

From every word embedding, self-attention creates three new vectors:

Q = XW_Q

K = XW_K

V = XW_V

where:

$X$ is the matrix of word embeddings.
$W_Q$ , $W_K$ , and $W_V$ are learnable weight matrices.

The same original word now has three learned projections.

Vector	Intuitive role	Question it helps answer
Query	what this word is looking for	"What information do I need?"
Key	what this word can match against	"Should others pay attention to me?"
Value	what this word contributes	"If selected, what information do I pass on?"

flowchart LR
    E["word embedding"] --> WQ["multiply by W_Q"]
    E --> WK["multiply by W_K"]
    E --> WV["multiply by W_V"]

    WQ --> Q["Query"]
    WK --> K["Key"]
    WV --> V["Value"]

The weight matrices are learned during training.

This solves both problems:

The model now has trainable parameters.
The model can use different vectors for matching and information mixing.

10. Attention For One Word Using Q, K, V

Suppose the sentence is:

love Apple phones

To create the new representation of Apple:

Take the query of Apple: $q_{\text{Apple}}$ .
Compare it with every key: $k_{\text{love}}$ , $k_{\text{Apple}}$ , $k_{\text{phones}}$ .
Convert the scores into attention weights using softmax.
Use those weights to mix the value vectors: $v_{\text{love}}$ , $v_{\text{Apple}}$ , $v_{\text{phones}}$ .

The score calculation is:

q_{\text{Apple}} \cdot k_{\text{love}}

q_{\text{Apple}} \cdot k_{\text{Apple}}

q_{\text{Apple}} \cdot k_{\text{phones}}

Then:

[\alpha_1, \alpha_2, \alpha_3] = \text{softmax}([ q_{\text{Apple}}k_{\text{love}}^T, q_{\text{Apple}}k_{\text{Apple}}^T, q_{\text{Apple}}k_{\text{phones}}^T ])

Finally:

\text{Apple'} = \alpha_1 v_{\text{love}} + \alpha_2 v_{\text{Apple}} + \alpha_3 v_{\text{phones}}

flowchart TD
    QA["Query: Apple"] --> S1["score with Key: love"]
    QA --> S2["score with Key: Apple"]
    QA --> S3["score with Key: phones"]

    S1 --> SM["softmax"]
    S2 --> SM
    S3 --> SM

    SM --> A1["weight alpha_1"]
    SM --> A2["weight alpha_2"]
    SM --> A3["weight alpha_3"]

    V1["Value: love"] --> SUM["weighted sum"]
    V2["Value: Apple"] --> SUM
    V3["Value: phones"] --> SUM

    A1 --> SUM
    A2 --> SUM
    A3 --> SUM

    SUM --> OUT["new Apple representation"]

Notice the separation:

Queries and keys decide attention weights.
Values provide the content that gets mixed.

11. The Same Process Happens For Every Word

To create the new representation of love, use the query of love.

To create the new representation of Apple, use the query of Apple.

To create the new representation of phones, use the query of phones.

The keys and values for all words are available to all target words.

flowchart TD
    LQ["q_love"] --> O1["love'"]
    AQ["q_Apple"] --> O2["Apple'"]
    PQ["q_phones"] --> O3["phones'"]

    K["all keys: love, Apple, phones"] --> O1
    K --> O2
    K --> O3

    V["all values: love, Apple, phones"] --> O1
    V --> O2
    V --> O3

Each word asks its own question, attends to the sentence, and receives a custom mixture of value vectors.

That is why self-attention creates one new contextual vector per input token.

12. Matrix Form: How Computers Actually Do It

The word-by-word explanation is useful for intuition.

But computers prefer matrices.

Suppose the sentence has 3 tokens:

love Apple phones

And suppose each word embedding has 512 dimensions.

Then the input matrix is:

X \in \mathbb{R}^{3 \times 512}

Each row is one word vector:

Row	Token
1	love
2	Apple
3	phones

Now multiply by learned matrices:

W_Q \in \mathbb{R}^{512 \times 64}

W_K \in \mathbb{R}^{512 \times 64}

W_V \in \mathbb{R}^{512 \times 64}

So:

Q = XW_Q \in \mathbb{R}^{3 \times 64}

K = XW_K \in \mathbb{R}^{3 \times 64}

V = XW_V \in \mathbb{R}^{3 \times 64}

flowchart LR
    X["X: 3 x 512"] --> WQ["W_Q: 512 x 64"]
    X --> WK["W_K: 512 x 64"]
    X --> WV["W_V: 512 x 64"]

    WQ --> Q["Q: 3 x 64"]
    WK --> K["K: 3 x 64"]
    WV --> V["V: 3 x 64"]

Similarity Matrix

To compare every query with every key:

QK^T

Shape:

(3 \times 64)(64 \times 3) = 3 \times 3

This produces a score matrix:

	key: love	key: Apple	key: phones
query: love	score	score	score
query: Apple	score	score	score
query: phones	score	score	score

Each row says:

For this target word, how much does it match each word in the sentence?

Then we apply softmax across each row, turning scores into attention weights.

Finally:

\text{softmax}(QK^T)V

Shape:

(3 \times 3)(3 \times 64) = 3 \times 64

The result is a matrix of new contextual representations:

Row	New contextual vector
1	love'
2	Apple'
3	phones'

flowchart TD
    Q["Q: queries"] --> DOT["Q times K^T"]
    K["K: keys"] --> DOT
    DOT --> SCORES["score matrix"]
    SCORES --> SCALE["divide by sqrt(d_k)"]
    SCALE --> SM["row-wise softmax"]
    SM --> ATT["attention matrix"]
    V["V: values"] --> MIX["attention matrix times V"]
    ATT --> MIX
    MIX --> OUT["contextual word vectors"]

Tiny Matrix Example With Real Numbers

Let us use very small vectors so we can see the whole calculation.

Sentence:

love Apple phones

Suppose each word embedding has only 2 dimensions:

X = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 0 & 2 \end{bmatrix}

So:

Row	Token	Embedding
1	love	$[1,0]$
2	Apple	$[1,1]$
3	phones	$[0,2]$

Now choose tiny learned matrices:

W_Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} ,\quad W_K = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} ,\quad W_V = \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix}

These numbers are just a toy example.

They were not derived from the sentence by hand. They were chosen here only so the arithmetic stays small enough to see.

In a real Transformer, $W_Q$ , $W_K$ , and $W_V$ are learnable parameters.

At the beginning of training, they usually start as small random numbers. Then the model slowly changes them to reduce prediction error.

flowchart TD
    A["start with random W_Q, W_K, W_V"] --> B["calculate Q, K, V"]
    B --> C["run attention"]
    C --> D["make prediction"]
    D --> E["compare with correct answer"]
    E --> F["calculate loss"]
    F --> G["backpropagation finds useful changes"]
    G --> H["optimizer updates W_Q, W_K, W_V"]
    H --> B

So these matrices are not manually designed. They are discovered through training.

Their roles are:

Matrix	What it learns
$W_Q$	how to create a query: what this token should look for
$W_K$	how to create a key: what this token offers for matching
$W_V$	how to create a value: what information this token should pass forward

In this toy example:

W_Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}

is the identity matrix, so it keeps the query vector the same as the original embedding. That is only for simplicity.

Now calculate:

Q = XW_Q = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 0 & 2 \end{bmatrix}

K = XW_K = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 0 & 2 \end{bmatrix}

V = XW_V = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 2 & 2 \end{bmatrix}

Now compare every query with every key:

QK^T = \begin{bmatrix} 1 & 1 & 0 \\ 2 & 3 & 2 \\ 2 & 4 & 4 \end{bmatrix}

This is the raw attention score matrix.

	key: love	key: Apple	key: phones
query: love	1	1	0
query: Apple	2	3	2
query: phones	2	4	4

Because each query/key vector has 2 dimensions:

d_k = 2

So we scale by:

\sqrt{d_k} = \sqrt{2}

The scaled score matrix is:

\frac{QK^T}{\sqrt{2}}

After applying softmax to each row, we get an approximate attention matrix:

\text{softmax} \left( \frac{QK^T}{\sqrt{2}} \right) \approx \begin{bmatrix} 0.40 & 0.40 & 0.20 \\ 0.25 & 0.50 & 0.25 \\ 0.11 & 0.45 & 0.45 \end{bmatrix}

Each row sums to 1.

The second row is the attention pattern for Apple:

[0.25, 0.50, 0.25]

Meaning:

Value vector	Weight
love	0.25
Apple	0.50
phones	0.25

Now mix the value vectors:

\text{Apple'} = 0.25 \begin{bmatrix} 1 & 0 \end{bmatrix} + 0.50 \begin{bmatrix} 2 & 1 \end{bmatrix} + 0.25 \begin{bmatrix} 2 & 2 \end{bmatrix}

So:

\text{Apple'} = \begin{bmatrix} 1.75 & 1.00 \end{bmatrix}

This is the new contextual representation of Apple.

If we multiply the whole attention matrix by $V$ , we get the contextual representation for every token:

\text{softmax} \left( \frac{QK^T}{\sqrt{2}} \right) V \approx \begin{bmatrix} 1.60 & 0.80 \\ 1.75 & 1.00 \\ 1.89 & 1.34 \end{bmatrix}

Row	Token	New contextual vector
1	love	$[1.60,0.80]$
2	Apple	$[1.75,1.00]$
3	phones	$[1.89,1.34]$

The important idea:

$Q$ and $K$ decide how much attention to give. $V$ provides the information that gets mixed.

13. The Full Equation

The final scaled dot-product attention equation is:

\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

Read it slowly:

Part	Meaning
$QK^T$	compare every query with every key
$\sqrt{d_k}$	scale scores so they do not become too large
softmax	convert scores into attention probabilities
multiply by $V$	mix information using those probabilities

Note

The transcript builds the equation first without the division by $\sqrt{d_k}$ , then mentions that the original paper includes it. The scaling term keeps dot product scores numerically stable when key/query dimensions are large.

14. Self-Attention As A Story

For each word, self-attention performs this inner dialogue:

"I am this word. What do I need to understand myself in this sentence?"
"Let me compare my query with every word's key."
"The stronger the match, the more attention I give."
"Now I collect information from the value vectors."
"My new meaning is the weighted mixture of what mattered."

sequenceDiagram
    participant T as Target word
    participant Q as Query
    participant K as All keys
    participant S as Softmax
    participant V as All values
    participant O as New vector

    T->>Q: create query
    Q->>K: compare with every key
    K->>S: send similarity scores
    S->>V: produce attention weights
    V->>O: weighted sum of values

This is why the name is self-attention:

The sentence attends to itself. Each token looks at other tokens from the same input.

15. Why Learnable Q, K, V Matter

Without $W_Q$ , $W_K$ , and $W_V$ , attention is only a fixed similarity operation.

With them, the model can learn many kinds of relationship:

Relationship type	Example
product relationship	Apple -> watch
grammar relationship	adjective -> noun
reference relationship	pronoun -> earlier noun
semantic relationship	juice -> fruit
task-specific relationship	question word -> answer span

The model learns which features should be emphasized in queries, keys, and values.

For one task, it may learn that product names matter.

For another task, it may learn that grammar roles matter.

For another, it may learn that long-distance references matter.

That is the power of trainable attention.

16. The Two Big Benefits

Benefit 1: Parallel Processing

RNNs process tokens step by step:

flowchart LR
    W1["word 1"] --> W2["word 2"] --> W3["word 3"] --> W4["word 4"] --> W5["word 5"]

Self-attention compares tokens using matrix multiplication:

flowchart TD
    W1["word 1"] --> A["attention layer"]
    W2["word 2"] --> A
    W3["word 3"] --> A
    W4["word 4"] --> A
    W5["word 5"] --> A
    A --> O1["word 1'"]
    A --> O2["word 2'"]
    A --> O3["word 3'"]
    A --> O4["word 4'"]
    A --> O5["word 5'"]

The output for each token does not need the previous output to be computed first. That makes attention highly parallelizable on GPUs.

Important nuance:

Self-attention removes sequential dependency, but it does not mean total computation is constant. For $n$ tokens, the attention score matrix has $n^2$ pairwise comparisons. The advantage is that these comparisons can be computed in parallel.

Benefit 2: Long-Range Dependencies

In an RNN, information from an early word must travel step by step through the sequence.

In self-attention, any word can directly compare with any other word in one attention layer.

flowchart LR
    A["word at start"] --- Z["word at end"]
    A --- M1["middle word 1"]
    A --- M2["middle word 2"]
    Z --- M1
    Z --- M2

This helps with sentences where meaning depends on words far apart.

Example:

The Apple device that I bought last year, after months of comparing music players, voicemail features, and battery life, was a phone.

The word Apple can directly attend to phone, even if many words separate them.

17. The Whole Mechanism In One Page

flowchart TD
    X["Input tokens"] --> E["Embeddings"]
    E --> Q["Q = X W_Q"]
    E --> K["K = X W_K"]
    E --> V["V = X W_V"]

    Q --> S["Scores = Q K^T"]
    K --> S
    S --> SC["Scale by sqrt(d_k)"]
    SC --> P["Softmax -> attention weights"]
    P --> O["Weighted sum: weights times V"]
    V --> O
    O --> C["Contextual token representations"]

Self-attention is just this loop:

Create query, key, and value vectors.
Use queries and keys to compute relevance.
Use softmax to turn relevance into attention.
Use attention to mix values.
Return context-aware vectors.

18. Example: Apple Phones vs Apple Juice

Let the original embedding of apple contain both technology and fruit meaning.

flowchart TD
    A["apple: mixed meaning"] --> T["technology direction"]
    A --> F["fruit direction"]

Sentence 1: I love Apple phones

The word phones receives a high attention weight when Apple is being updated.

flowchart LR
    L["love"] -- "low weight" --> A2["Apple'"]
    A["Apple"] -- "medium weight" --> A2
    P["phones"] -- "high weight" --> A2
    A2 --> TECH["more technology-like"]

Sentence 2: I love apple juice

The word juice receives a high attention weight when apple is being updated.

flowchart LR
    L["love"] -- "low weight" --> A2["apple'"]
    A["apple"] -- "medium weight" --> A2
    J["juice"] -- "high weight" --> A2
    A2 --> FRUIT["more fruit-like"]

Same starting word. Different neighboring words. Different final representation.

That is contextual meaning.

19. Common Mistakes

Mistake 1: "Attention weights are the final meaning."

No. Attention weights only say how strongly to mix value vectors. The final meaning is the weighted sum of values.

Mistake 2: "Dot product alone is self-attention."

No. Dot product gives scores. Self-attention also needs softmax, values, and learned projections.

Mistake 3: "Self-attention is always cheap."

No. It is parallelizable, but the attention matrix is $n \times n$ . Long sequences can be expensive.

Mistake 4: "Q, K, and V are three different words."

No. For each token, Q, K, and V are three learned projections of the same input representation.

Mistake 5: "The word only attends to nearby words."

No. In full self-attention, every token can attend to every other token, unless a mask restricts it.

20. Glossary

Term	Meaning
Token	A piece of text processed by the model, often a word or word fragment
Embedding	A vector representation of a token
Static embedding	Same vector for a word regardless of context
Contextual representation	A vector that changes depending on surrounding tokens
Attention score	Raw similarity between a query and a key
Attention weight	Softmax-normalized score used for mixing values
Query	Learned vector used to ask what the token needs
Key	Learned vector used to match against queries
Value	Learned vector whose information gets mixed into the output
$W_Q, W_K, W_V$	Learnable matrices that create Q, K, and V
$QK^T$	Matrix of pairwise query-key scores
Softmax	Function that converts scores into probabilities
$d_k$	Dimension of key/query vectors

21. Memory Hooks

The Restaurant Analogy

Imagine every word is a customer at a restaurant.

Query: what the customer wants to eat.
Key: what each dish says it can satisfy.
Value: the actual dish served.

The customer compares their query with all keys, chooses a weighted mixture, and receives a plate made from values.

The Search Analogy

Imagine searching in a library.

Query: your search phrase.
Keys: labels on books.
Values: the actual content inside the books.

You match your query against labels, then read more from the books with stronger matches.

The Meaning Gravity Analogy

Words pull each other in meaning-space. Strongly related words pull more. Weakly related words pull less.

Self-attention learns which pulls should matter.

22. Active Recall

Answer these without looking above.

Why is a static word embedding not enough for the word "apple"?
What does it mean to create a contextual word representation?
Why do we need attention weights to sum to 1?
What does the dot product measure in attention?
Why do we apply softmax after dot products?
What problem is solved by learnable matrices $W_Q$ , $W_K$ , and $W_V$ ?
What is the difference between a key and a value?
What is the shape of $QK^T$ if a sentence has 100 tokens?
Why is self-attention easier to parallelize than an RNN?
Why is self-attention useful for long-range dependencies?

23. Answers

Because the same word can mean different things in different contexts. "Apple phones" and "apple juice" should not use exactly the same final representation for Apple.
It means creating a new vector for a word based on the other words in the same input.
Because the weights are used like probabilities for mixing word information.
It measures similarity or compatibility between two vectors.
Because raw dot products can be negative or very large. Softmax converts them into non-negative probabilities that sum to 1.
They let the model learn task-specific ways to compare and mix words.
A key is used for matching against a query. A value is the information that gets mixed into the output.
$100 \times 100$ .
Because each token's output can be computed from the same input matrices without waiting for previous token outputs.
Because any token can directly attend to any other token, even if they are far apart in the sequence.

24. Final Summary

Self-attention begins with a simple need:

A word should mean what it means in this sentence, not only what it usually means.

To solve that, each word creates a new representation by mixing information from the other words.

The model decides how much to mix using attention weights.

Those weights come from query-key similarity:

QK^T

The scores become probabilities through softmax:

\text{softmax}(QK^T)

The probabilities mix value vectors:

\text{softmax}(QK^T)V

And the full scaled equation is:

\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

That is self-attention:

A trainable way for every token to look at every other token and rebuild itself using context.

Self Attention in Transformers - The Deep Dive

What Is Self-Attention, Really?

Book Map

1. The Real Problem: One Word, Many Meanings

2. Static Word Embeddings Are Not Enough

3. The First Leap: Let Every Word Borrow Meaning From Other Words

4. The Numbers Are Attention Weights

5. Similarity: The Dot Product

Tiny Example

6. Softmax: Turning Scores Into Percentages

First Principle: Turning Preference Into Share

7. First Version of Self-Attention: Weighted Mixing

The Gravity Analogy

8. But This First Version Has Two Problems

Problem 1: No Learnable Parameters

Problem 2: The Same Vector Is Used For Everything

9. Query, Key, Value: The Three Roles

10. Attention For One Word Using Q, K, V

11. The Same Process Happens For Every Word

12. Matrix Form: How Computers Actually Do It

Similarity Matrix

Tiny Matrix Example With Real Numbers

13. The Full Equation

14. Self-Attention As A Story

15. Why Learnable Q, K, V Matter

16. The Two Big Benefits

Benefit 1: Parallel Processing

Benefit 2: Long-Range Dependencies

17. The Whole Mechanism In One Page

18. Example: Apple Phones vs Apple Juice

Sentence 1: I love Apple phones

Sentence 2: I love apple juice

19. Common Mistakes

Mistake 1: "Attention weights are the final meaning."

Mistake 2: "Dot product alone is self-attention."

Mistake 3: "Self-attention is always cheap."

Mistake 4: "Q, K, and V are three different words."

Mistake 5: "The word only attends to nearby words."

20. Glossary

21. Memory Hooks

The Restaurant Analogy

The Search Analogy

The Meaning Gravity Analogy

22. Active Recall

23. Answers

24. Final Summary