Linear Transformation in Self-Attention - The Deep Dive

Transformers in Deep Learning · Part 3 Sibling note: Self Attention in Transformers - The Deep Dive

Q: Why is multiplying by a matrix the secret ingredient of self-attention?

In Part 2 we discovered that every word embedding gets multiplied by three learnable matrices, WQW_Q, WKW_K, and WVW_V. We accepted that on faith. This note pays off the debt.

We will answer one question, slowly and completely:

Multiplying a vector by a matrix is called a linear transformation. Why is this one small operation a game-changer for self-attention?

By the end you will be able to prove, with tiny hand-checkable numbers, why adding these matrices makes the captured similarities more relevant and the new word representations more accurate.


Book Map

flowchart TD
    A["Recap: mix each word with its neighbours"] --> B["The numbers are similarities (dot product)"]
    B --> C["Pure math breaks on 'Apple watch'"]
    C --> D["Two problems with no learnable parameters"]
    D --> E["Fix: multiply by W_Q, W_K, W_V"]
    E --> F["What IS a linear transformation?"]
    F --> G["Power 1: change the dimension"]
    F --> H["Power 2: extract & magnify features"]
    G --> I["Worked example: Apple, Orange, Watch"]
    H --> I
    I --> J["Re-test the similarity: 136 vs 64"]
    J --> K["Where do the weights come from? Training"]
    K --> L["The Value vector finishes the job"]
    L --> M["The staggering scale: ~3 million params per word"]

The destination is a feeling, not a formula:

A weight matrix is a lens. It throws away the features a task does not care about, magnifies the ones it does, and in doing so lets words that should be related finally recognise each other.


1. Recap: A Word Is the Company It Keeps

Quick reminder of where Part 2 left us. The whole reason self-attention exists is that the same word means different things in different sentences.

Sentence Meaning of Apple
I love Apple phones. Apple the technology company
I love apple juice. Apple the fruit

A single static embedding of apple — say 50% technology, 50% fruit — is wrong for both sentences. It is too vague near "phones" and too vague near "juice".

The fix was to rebuild each word as a blend of the other words in its sentence:

Apple=0.1love+0.5apple+0.4phones\text{Apple}' = 0.1 \cdot \text{love} + 0.5 \cdot \text{apple} + 0.4 \cdot \text{phones}

Because we pour in a little phones (a technology word), the blend drifts toward the technology side. In the other sentence:

Apple=0.1love+0.4apple+0.5juice\text{Apple}' = 0.1 \cdot \text{love} + 0.4 \cdot \text{apple} + 0.5 \cdot \text{juice}

the juice drags it toward fruit. Same starting word, different neighbours, different meaning.

flowchart LR
    L["love"] -- "0.1" --> AP["Apple' (phones context)"]
    A["apple"] -- "0.5" --> AP
    P["phones"] -- "0.4" --> AP
    AP --> T["drifts toward TECHNOLOGY"]

    L2["love"] -- "0.1" --> AJ["Apple' (juice context)"]
    A2["apple"] -- "0.4" --> AJ
    J["juice"] -- "0.5" --> AJ
    AJ --> F["drifts toward FRUIT"]
Note

Each word becomes a weighted mixture of its sentence-mates. The weights are how much attention to pay.


2. The Numbers Were Never Magic — They Are Similarities

Those weights (0.10.1, 0.40.4, 0.50.5) are not decoration. Each one is the relationship between a pair of words.

  • 0.40.4 = how related apple is to phones
  • 0.50.5 = how related apple is to itself
  • 0.10.1 = how related apple is to love

And "relationship" has a precise meaning in vector space:

Two words close together in embedding space are similar. Two words far apart are unrelated — sometimes even opposite.

Mathematics already has a similarity score for two vectors: the dot product.

0.4=eappleephones,0.5=eappleeapple,0.1=eappleelove0.4 = e_{\text{apple}} \cdot e_{\text{phones}}, \quad 0.5 = e_{\text{apple}} \cdot e_{\text{apple}}, \quad 0.1 = e_{\text{apple}} \cdot e_{\text{love}}

But a dot product can be any number — large, small, or negative. To use the scores as mixing weights we need them to be non-negative and to sum to 1. That is the job of softmax (built from first principles in Part 2).

flowchart LR
    E["embeddings"] --> D["dot products = raw similarities"]
    D --> S["softmax"]
    S --> W["weights that sum to 1"]
    W --> M["weighted sum of embeddings"]
    M --> N["new contextual word"]

So the pure-math recipe is: dot-product similarities → softmax → weighted sum. Clean, elegant... and, as we are about to see, not quite good enough.


3. The Crack in the Plan: "I love Apple watch"

Everything above uses only the raw embeddings and fixed arithmetic. No trainable parameters appear anywhere. Let us stress-test that with a new sentence:

I love Apple watch.

Ideally, the new representation of watch should come out meaning Apple Smartwatch — it should carry Apple's technology-company properties plus smartwatch properties. Does the pure-math recipe deliver that? Let us look at where these words actually sit.

quadrantChart
    title Raw embedding space (no learnable parameters)
    x-axis "Fruit" --> "Technology"
    y-axis "Physical object" --> "Time / Watching"
    quadrant-1 "Smart wearables"
    quadrant-2 "Time / actions"
    quadrant-3 "Fruits & food"
    quadrant-4 "Tech devices"
    "apple": [0.45, 0.18]
    "orange": [0.12, 0.12]
    "phone": [0.9, 0.22]
    "watch": [0.4, 0.9]
    "clock": [0.22, 0.95]
    "observe": [0.32, 0.82]

(Plain-text fallback, in case the quadrant chart does not render: apple sits mid-way between fruit and technology, low down among physical objects. watch sits high up near "clock / observe / time". The two are far apart — especially on the vertical axis.)

Because watch lives up near clock / observe / time, and apple lives down between fruit and tech company, the raw dot product between them is tiny. Meanwhile watch is most similar to... watch itself. So the recipe produces something like:

watch=0.1love+0.1apple+0.8watch\text{watch}' = 0.1 \cdot \text{love} + 0.1 \cdot \text{apple} + 0.8 \cdot \text{watch}

The new watch is almost entirely the old watch. Two things have gone wrong.

Note

Problem 1 — Too little Apple. Only a 0.10.1 sliver of apple gets mixed in. The product meaning "Apple Smartwatch" never forms.

Problem 2 — The wrong kind of Apple. Whatever sliver does get added carries both Apple-the-company and Apple-the-fruit properties. Ideally the new watch should contain no apple-fruit at all.

flowchart TD
    S["Sentence: love Apple watch"] --> R["Pure-math attention"]
    R --> P1["Problem 1: similarity(apple, watch) too low"]
    R --> P2["Problem 2: the Apple it mixes in is half-fruit"]
    P1 --> NEED["We need LEARNABLE parameters"]
    P2 --> NEED

Both problems have the same root: fixed arithmetic cannot learn what this task considers similar. We need to inject knobs the model can tune. We need to inject them in two places:

  1. While measuring similarity (so apple and watch can learn to attract).
  2. While choosing what to mix in (so we add Apple-the-company, not Apple-the-fruit).

The knobs are the matrices WQW_Q, WKW_K, WVW_V. And multiplying a vector by such a matrix is exactly what the Attention Is All You Need authors call a linear transformation.


4. What Is a Linear Transformation?

Despite the intimidating name, you have met it before. Recall a single layer of a neural network:

layer=f ⁣(Wx+b)\text{layer} = f\!\left(W x + b\right)

You multiply the input xx by a weight matrix WW, add a bias bb, then squash it through a nonlinear activation ff (like ReLU or tanh\tanh).

A linear transformation is that exact operation with the two extra ingredients removed:

y=Wx\boxed{\,y = W x\,}
Note

Drop the bias bb. Drop the nonlinear activation ff. What remains — vector times matrix — is a linear transformation. The "Crux" of the operation is unchanged: the weights still reshape the vector.

flowchart LR
    subgraph NN["Neural-network layer"]
        X1["x"] --> WB["W x + b"] --> ACT["activation f"]
    end
    subgraph LT["Linear transformation"]
        X2["x"] --> WO["W x"]
    end
    NN -. "drop b and f" .-> LT

This humble operation buys us two superpowers:

  1. It can change the dimensionality of a vector.
  2. It can extract and magnify the features that matter for a task.

Let us take them one at a time.


5. Superpower 1 — Changing the Dimensionality

Matrix multiplication can resize a vector. In the real Transformer, a 512512-dimensional embedding is squeezed down to a 6464-dimensional one:

x1×512  ×  W512×64  =  y1×64\underbrace{x}_{1 \times 512} \; \times \; \underbrace{W}_{512 \times 64} \;=\; \underbrace{y}_{1 \times 64}

The inner dimensions (512512) cancel; the outer dimensions (11 and 6464) survive. A 512512-vector walks in, a 6464-vector walks out.

Note

The video phrases it as a column vector: a 64×51264\times512 matrix times a 512×1512\times1 vector gives 64×164\times1. This note uses the row convention (1×5121\times512 times 512×64512\times64) to match the worked examples and the sibling note. They are transposes of each other — the dimension change is identical.

We cannot picture 6464 dimensions, so shrink the idea to something we can see. Take a 3-D vector and collapse it to 2-D:

[322]    linear transformation    []\begin{bmatrix} 3 & 2 & 2 \end{bmatrix} \;\xrightarrow{\;\text{linear transformation}\;}\; \begin{bmatrix} \cdot & \cdot \end{bmatrix}
flowchart LR
    V3["a point floating in 3-D space (x, y, z)"] -->|"multiply by a 3x2 matrix"| V2["the same point, now living on a 2-D plane (u, v)"]

The vector that lived in a roomy 3-D space now lives on a flat 2-D plane. That is dimensionality reduction — and it is only half the story. The interesting half is which information survives the squeeze.


6. Superpower 2 — Extracting and Magnifying Features

A 512512-D embedding is a grab-bag of features: technology, fruit, food, and ~500 other arbitrary directions. Most are irrelevant to any single task.

Imagine building a chatbot for an electronics store. You only care about technology features. Who cares whether "apple" is also a fruit?

A well-trained weight matrix acts like a filter that keeps the technology-flavoured features and discards the rest. The weights govern what comes out:

Note

If the weights attached to technology features are large, the output is dominated by technology. The other 500 features get tiny weights and are effectively discarded. The result is a vector that is a clean measure of technology-ness.

flowchart LR
    subgraph IN["512 raw features"]
        F1["technology"]
        F2["fruit"]
        F3["food"]
        F4["...500 more..."]
    end
    IN -->|"W: big weights on tech, tiny on the rest"| OUT

    subgraph OUT["64 richer features"]
        G1["is_smart"]
        G2["is_callable"]
        G3["...tech-flavoured..."]
    end

The input had hundreds of random features. The output has a handful of richer, task-relevant features. The matrix did not just shrink the vector — it refined it.

Time to prove all of this with numbers small enough to check by hand.


7. Worked Example — Apple, Orange, Watch

To keep it visible, pretend every embedding has just 3 dimensions, measuring three features:

features=[is_tech,  is_watch,  is_fruit]\text{features} = [\,\text{is\_tech},\; \text{is\_watch},\; \text{is\_fruit}\,]
Word is_tech is_watch is_fruit Embedding
Apple 2 0 2 [2,0,2][2,0,2]
Orange 0 0 3 [0,0,3][0,0,3]
Watch 1 3 0 [1,3,0][1,3,0]

Apple is both tech and fruit. Orange is purely fruit. Watch is tech-ish and very watch-ish.

Now pick a weight matrix. We deliberately load the is_tech row with big numbers and starve the is_fruit row — this matrix is meant to extract technology:

W=[432101]  is_tech (big)  is_watch  is_fruit (small)W = \begin{bmatrix} \textbf{4} & \textbf{3} \\ 2 & 1 \\ 0 & 1 \end{bmatrix} \begin{array}{l} \;\leftarrow \text{is\_tech (big)} \\ \;\leftarrow \text{is\_watch} \\ \;\leftarrow \text{is\_fruit (small)} \end{array}

Each output value is the dot product of the word with a column of WW.

Apple [2,0,2][2,0,2]:

col 1:2(4)+0(2)+2(0)=8col 2:2(3)+0(1)+2(1)=8Apple[8,8]\begin{aligned} \text{col 1:} &\quad 2(4) + 0(2) + 2(0) = 8 \\ \text{col 2:} &\quad 2(3) + 0(1) + 2(1) = 8 \end{aligned} \quad\Rightarrow\quad \text{Apple} \to [\,8,\,8\,]

Orange [0,0,3][0,0,3]:

col 1:0(4)+0(2)+3(0)=0col 2:0(3)+0(1)+3(1)=3Orange[0,3]\begin{aligned} \text{col 1:} &\quad 0(4) + 0(2) + 3(0) = 0 \\ \text{col 2:} &\quad 0(3) + 0(1) + 3(1) = 3 \end{aligned} \quad\Rightarrow\quad \text{Orange} \to [\,0,\,3\,]

Watch [1,3,0][1,3,0]:

col 1:1(4)+3(2)+0(0)=10col 2:1(3)+3(1)+0(1)=6Watch[10,6]\begin{aligned} \text{col 1:} &\quad 1(4) + 3(2) + 0(0) = 10 \\ \text{col 2:} &\quad 1(3) + 3(1) + 0(1) = 6 \end{aligned} \quad\Rightarrow\quad \text{Watch} \to [\,10,\,6\,]

Collect the results:

Word Original After linear transformation Reading
Apple [2,0,2][2,0,2] [8,8][8,8] high → strong tech
Orange [0,0,3][0,0,3] [0,3][0,3] low → barely any tech
Watch [1,3,0][1,3,0] [10,6][10,6] high → strong tech
Note

Orange collapses to [0,3][0,3] — almost no technology, because its only strong feature (fruit) was starved by the matrix. Apple keeps a big [8,8][8,8] because its technology feature was magnified. For an electronics-store model, the vector [8,8][8,8] tells you far more about Apple the company than the raw [2,0,2][2,0,2] ever did.

The new features need not be human-nameable. They might be is_smart or is_callable — whatever the training process found useful. The point stands: random features in, refined task-relevant features out.

flowchart TD
    AppleRaw["Apple [2,0,2]: tech + fruit"] -->|"W extracts tech"| AppleNew["[8,8]: strongly tech"]
    OrangeRaw["Orange [0,0,3]: pure fruit"] -->|"W extracts tech"| OrangeNew["[0,3]: almost no tech"]
    WatchRaw["Watch [1,3,0]: tech + watch"] -->|"W extracts tech"| WatchNew["[10,6]: strongly tech"]

8. Re-testing the Similarity: Does Apple Finally See Watch?

Back to the wound from Section 3: in raw space, similarity(apple, watch) was pathetically low. Let us redo the measurement, but through the lens of learned matrices.

Treat the matrix WW above as WQW_Q. Then the query of watch is:

qwatch=ewatchWQ=[1,3,0]WQ=[10,6]q_{\text{watch}} = e_{\text{watch}} \, W_Q = [1,3,0]\,W_Q = [\,10,\,6\,]

Now we need keys. Pick a WKW_K (these values are assumed — chosen to illustrate, exactly as the video does):

WK=[111142]W_K = \begin{bmatrix} 1 & 1 \\ 1 & 1 \\ 4 & 2 \end{bmatrix}

Compute the keys for Apple and watch:

kApple=eAppleWK=[2,0,2]WK=[10,6]k_{\text{Apple}} = e_{\text{Apple}}\,W_K = [2,0,2]\,W_K = [\,10,\,6\,] (2(1)+0(1)+2(4)=10,2(1)+0(1)+2(2)=6)\big(2(1)+0(1)+2(4)=10,\quad 2(1)+0(1)+2(2)=6\big) kwatch=ewatchWK=[1,3,0]WK=[4,4]k_{\text{watch}} = e_{\text{watch}}\,W_K = [1,3,0]\,W_K = [\,4,\,4\,] (1(1)+3(1)+0(4)=4,1(1)+3(1)+0(2)=4)\big(1(1)+3(1)+0(4)=4,\quad 1(1)+3(1)+0(2)=4\big)

Now the moment of truth — two dot products:

qwatchkApplesimilarity(watch, apple)=[10,6][10,6]=100+36=136\underbrace{q_{\text{watch}} \cdot k_{\text{Apple}}}_{\text{similarity(watch, apple)}} = [10,6]\cdot[10,6] = 100 + 36 = \mathbf{136} qwatchkwatchsimilarity(watch, watch)=[10,6][4,4]=40+24=64\underbrace{q_{\text{watch}} \cdot k_{\text{watch}}}_{\text{similarity(watch, watch)}} = [10,6]\cdot[4,4] = 40 + 24 = \mathbf{64}
Note

In raw space, watch–watch was the strongest similarity and watch–apple was nearly invisible. Through the learned lens, watch–apple scores 136 while watch–watch scores only 64. The model can now pour a large amount of Apple into the new watch — exactly what "Apple Smartwatch" requires.

flowchart LR
    subgraph RAW["Raw embeddings (Section 3)"]
        R1["sim(watch, watch): HIGH"]
        R2["sim(watch, apple): tiny"]
    end
    subgraph LEARN["Through W_Q and W_K"]
        L1["sim(watch, apple) = 136"]
        L2["sim(watch, watch) = 64"]
    end
    RAW -->|"add learnable parameters"| LEARN

This is the proof the video promised: the same words, run through trainable matrices, produce a similarity that is far more relevant to the sentence's true meaning.


9. Where Do These Lucky Weights Come From? Training.

It looks like cheating — we hand-picked matrices that make watch attend to apple. But that hand-picking is precisely what training does, automatically, across millions of sentences.

Note

Yes — and so does gradient descent. We feed the model the sentence love Apple watch. During backpropagation we tell it: you should have captured the dependency between watch and apple. The optimizer nudges WQW_Q, WKW_K, WVW_V so that next time, the similarity comes out high. Repeat over a corpus and the weights settle into matrices that capture genuinely useful relationships.

flowchart TD
    Start["Random W_Q, W_K, W_V"] --> Fwd["Run attention on 'love Apple watch'"]
    Fwd --> Pred["Make a prediction"]
    Pred --> Loss["Compare with the correct answer -> loss"]
    Loss --> Back["Backpropagation: 'capture watch <-> apple'"]
    Back --> Update["Optimizer updates W_Q, W_K, W_V"]
    Update --> Fwd

The matrices are discovered, not designed. We never tell the model how to make apple and watch attend — only that the prediction was wrong. It finds the lens itself.


10. The Value Vector Finishes the Job

High similarity gets a big helping of Apple mixed into watch. But remember Problem 2: we wanted to add Apple-the-company, not Apple-the-fruit.

That is the Value vector's role. It is produced by yet another linear transformation:

vApple=eAppleWVv_{\text{Apple}} = e_{\text{Apple}} \, W_V

Just like WQW_Q extracted technology features in Section 7, a trained WVW_V makes vApplev_{\text{Apple}} carry mostly Apple-technology-company features and leave the fruit features behind. So when the big attention weight scoops up vApplev_{\text{Apple}}, it scoops up the right flavour of Apple.

flowchart LR
    EA["e(apple): tech + fruit"] -->|"W_V keeps tech, drops fruit"| VA["v(apple): mostly tech-company"]
    VA -->|"mixed in with high weight"| WOUT["new watch = Apple Smartwatch"]
Note
  • WQW_Q and WKW_K decide how much attention flows (they fixed Problem 1).
  • WVW_V decides what content flows (it fixes Problem 2). Together, the three linear transformations turn a half-fruit, barely-attended watch into a confident Apple Smartwatch.

This is the full payoff. The new representation of watch:

watch  =  αlovevlove+αapplevapple+αwatchvwatch\text{watch}' \;=\; \alpha_{\text{love}}\,v_{\text{love}} + \alpha_{\text{apple}}\,v_{\text{apple}} + \alpha_{\text{watch}}\,v_{\text{watch}}

now carries a large αapple\alpha_{\text{apple}} (thanks to WQ,WKW_Q, W_K) multiplying a tech-flavoured vapplev_{\text{apple}} (thanks to WVW_V). Apple Smartwatch, achieved.


11. The Staggering Scale: ~3 Million Parameters for One Word

We did all of this with a toy 3×23 \times 2 matrix. The real Transformer is built on the same operation, just enormously wider and stacked.

Note

Honest answer from the video: we do not know what every weight does. We simply give the network enough knobs and let training find the relationships. How many knobs? Let us count.

Layer of the onion Multiplier Why
One weight matrix 512×64512 \times 64 a single WW, full size
One attention head ×8\times\, 8 each head has its own WQ,WK,WVW_Q, W_K, W_V family (multi-head)
One encoder block ×6\times\, 6 the encoder stacks 6 such blocks
Encoder and decoder ×2\times\, 2 the decoder carries a similar (or larger) parameter load

Multiply it through:

512×64×8×6×2  =  3,145,728    3.1 million parameters512 \times 64 \times 8 \times 6 \times 2 \;=\; 3{,}145{,}728 \;\approx\; \textbf{3.1 million parameters}
flowchart TD
    M["one matrix: 512 x 64"] --> H["x8 heads"]
    H --> B["x6 encoder blocks"]
    B --> ED["x2 (encoder + decoder)"]
    ED --> R["~3.1 million parameters to process ONE word"]
Note

This is the video's back-of-the-envelope estimate, not an exact parameter count (real models also have feed-forward layers, embeddings, layer norms, etc.). The lesson is the scale: millions of tuned knobs stand behind every single token. We cannot say what each one does, yet together they reliably surface the relationships that matter.

The mechanics of multi-head attention, the encoder block, and the decoder are the subject of upcoming notes in this series.


12. The Whole Idea in One Picture

flowchart TD
    X["Word embedding e (512-D, random features)"] --> Q["q = e W_Q"]
    X --> K["k = e W_K"]
    X --> V["v = e W_V"]

    Q --> SIM["similarity = q . k (now relevant!)"]
    K --> SIM
    SIM --> SM["softmax -> attention weights"]
    SM --> MIX["weighted sum of value vectors"]
    V --> MIX
    MIX --> OUT["new contextual word (e.g. Apple Smartwatch)"]

    subgraph WHY["Each W is a linear transformation that..."]
        P1["changes 512-D -> 64-D"]
        P2["extracts & magnifies task-relevant features"]
    end
    WHY -. "powers all three projections" .-> Q

Linear transformation is the engine under every arrow that touches a WW:

  1. It resizes the vector (51264512 \to 64).
  2. It refines the vector (keep tech, drop fruit).
  3. Done three times (WQ,WK,WVW_Q, W_K, W_V), it lets words measure relevant similarity and exchange the right content.

13. Common Mistakes

Note

"Linear transformation is some exotic operation." No — it is just y=Wxy = Wx: a neural-network layer with the bias and activation removed.

"The matrices are designed by hand." No — WQ,WK,WVW_Q, W_K, W_V start random and are learned by backpropagation. The hand-picked numbers in this note only mimic what training discovers.

"Reducing dimensions just throws information away." It throws away the irrelevant part and magnifies the relevant part. Compression and refinement happen together.

"WQW_Q and WVW_V do the same thing." No. WQW_Q/WKW_K control how much to attend; WVW_V controls what gets mixed in. Problem 1 vs Problem 2.

"A high attention weight means apple-the-fruit gets added." Only if WVW_V lets it. The value vector decides the flavour of what flows.


14. Glossary

Term Meaning
Linear transformation Multiplying a vector by a matrix, y=Wxy = Wx — a neural-network layer minus bias and activation
Weight matrix (WW) The learnable matrix that performs the transformation
WQ,WK,WVW_Q, W_K, W_V The query, key, and value weight matrices in self-attention
Query (qq) eWQe\,W_Q — what a word is looking for
Key (kk) eWKe\,W_K — what a word offers for matching
Value (vv) eWVe\,W_V — the (refined) content a word contributes
Dot product Similarity score between two vectors; high = similar
Dimensionality reduction Shrinking a vector's size (e.g. 51264512 \to 64) via a matrix
Feature extraction Keeping & magnifying task-relevant features, discarding the rest
Multi-head attention Running several independent WQ,WK,WVW_Q,W_K,W_V families in parallel (8 in the base model)
Encoder block One stacked layer of the encoder (6 in the base model)

15. Active Recall

Cover the next section and try these.

  1. In one line, what is a linear transformation, and how does it differ from a neural-network layer?
  2. Name the two superpowers a weight matrix gives self-attention.
  3. In "I love Apple watch", what were the two problems with parameter-free attention?
  4. For embedding [2,0,2][2,0,2] and matrix columns [4,2,0]T[4,2,0]^T and [3,1,1]T[3,1,1]^T, compute the output. (It should be the Apple result.)
  5. After applying WQW_Q and WKW_K, what were the similarities for watch–apple and watch–watch? Which is bigger, and why does that matter?
  6. Where do the "lucky" weights actually come from?
  7. Which matrix fixes "too little Apple", and which fixes "the wrong kind of Apple"?
  8. Roughly how many parameters process a single word, and what factors multiply together to get there?

16. Answers

  1. A linear transformation is y=Wxy = Wx — multiply a vector by a matrix. A neural-network layer is f(Wx+b)f(Wx + b); a linear transformation drops the bias bb and the nonlinear activation ff.
  2. (a) Changing dimensionality (e.g. 51264512 \to 64); (b) extracting and magnifying task-relevant features while discarding the rest.
  3. (a) Far too little of apple was mixed into watch (low similarity); (b) the apple that was mixed in carried both fruit and company properties, when only company properties were wanted.
  4. Col 1: 2(4)+0(2)+2(0)=82(4)+0(2)+2(0)=8. Col 2: 2(3)+0(1)+2(1)=82(3)+0(1)+2(1)=8. Output [8,8][8,8]. ✓
  5. qwatch ⁣ ⁣kapple=[10,6] ⁣ ⁣[10,6]=136q_{\text{watch}}\!\cdot\! k_{\text{apple}} = [10,6]\!\cdot\![10,6] = \mathbf{136}; qwatch ⁣ ⁣kwatch=[10,6] ⁣ ⁣[4,4]=64q_{\text{watch}}\!\cdot\! k_{\text{watch}} = [10,6]\!\cdot\![4,4] = \mathbf{64}. Watch–apple (136) is far bigger than watch–watch (64) — the reverse of raw space — so the model now attends strongly to apple when rebuilding watch.
  6. From training: the matrices begin random, and backpropagation tunes them across millions of sentences to capture the relationships that reduce prediction error.
  7. WQW_Q and WKW_K fix "too little Apple" (they raise the attention weight). WVW_V fixes "the wrong kind of Apple" (it refines the value vector to be tech-flavoured).
  8. About 3.1 million: 512×64512 \times 64 (one matrix) ×8\times\, 8 (heads) ×6\times\, 6 (encoder blocks) ×2\times\, 2 (encoder + decoder).

17. Memory Hooks

Note

A weight matrix is a camera lens. The world (the raw embedding) is full of detail you do not need. The lens focuses on what matters, blurs the rest, and projects it onto a smaller sensor (fewer dimensions). WQ,WK,WVW_Q, W_K, W_V are three different lenses on the same scene.

Note

WQW_Q asks the question in a shared language. WKW_K answers in that same language. Without the translators, apple and watch speak past each other; with them, they finally understand they belong together.

Note

The raw embedding is a rough block of marble holding every possible statue. The linear transformation chisels away the fruit, the food, the 500 irrelevant features — and reveals the technology figure that was waiting inside.


18. Final Summary

We started with a working idea from Part 2: rebuild each word as a weighted blend of its neighbours, with weights given by dot-product similarity → softmax.

Then we broke it. The sentence love Apple watch exposed two failures of pure, parameter-free math:

watch barely noticed apple, and the trickle of apple it did absorb was half-fruit.

The cure was to insert learnable matrices — and multiplying a vector by a matrix is a linear transformation, the same operation as a neural-network layer minus its bias and activation. That one operation does two things:

x1×512W512×64linear transformation=y1×64(1) resize  +  (2) refine\underbrace{x_{1\times 512} \, W_{512 \times 64}}_{\text{linear transformation}} = y_{1 \times 64} \qquad\Rightarrow\qquad \text{(1) resize} \;+\; \text{(2) refine}

Applied as WQ,WK,WVW_Q, W_K, W_V, it let watch and apple recognise each other — we proved it with hand-checked numbers (136136 vs 6464) — and let the value vector deliver Apple-the-company rather than Apple-the-fruit. Training, not human design, discovers these matrices; and at full scale, roughly 3 million of these tuned knobs stand behind every single word.

The big idea, in one breath: a linear transformation is a learnable lens. It is what turns fixed, generic word vectors into sharp, context-aware, task-relevant representations — and that is why it is the game-changer at the heart of self-attention.


Series: Part 3 of Transformers in Deep Learning. ← Previous: Self Attention in Transformers - The Deep Dive · Next: Multi-Head Attention (upcoming).