Scaled Dot-Product Attention
Transformers in Deep Learning - Part 5
Sibling notes: Self Attention in Transformers - The Deep Dive · Query Key Value in Self-Attention - The Deep Dive
The self-attention formula is often written as:
But the actual Transformer formula from Attention Is All You Need is:
The extra term is:
This note explains why that scaling is needed.
Short answer:
As the dimension of query and key vectors grows, their dot products naturally become larger and more spread out. Large dot products make softmax too sharp, which can make gradients tiny. Dividing by keeps the scores at a stable scale.
1. Quick Recap: Where the Scaling Appears
Suppose the sentence is:
love Apple phones
Each word begins as an embedding. If each embedding has 512 dimensions, the input matrix is:
There are 3 tokens and each token has a 512-dimensional embedding.
Self-attention creates query, key, and value matrices:
If:
then:
Here:
is the dimension of each query/key vector.
flowchart LR
X["X: 3 x 512"] --> WQ["W_Q: 512 x 64"]
X --> WK["W_K: 512 x 64"]
X --> WV["W_V: 512 x 64"]
WQ --> Q["Q: 3 x 64"]
WK --> K["K: 3 x 64"]
WV --> V["V: 3 x 64"]
Next, attention compares every query with every key:
The shape is:
The output is a matrix of attention scores. Each score is a dot product between one query vector and one key vector.
2. What Really Contains
For the sentence:
love Apple phones
the score matrix contains pairwise relationships:
| Score | Meaning |
|---|---|
| love compared with love | |
| Apple compared with love | |
| Apple compared with phones | |
| phones compared with Apple |
Each score is produced by multiplying and adding 64 pairs of numbers:
That repeated addition is the key.
The larger becomes, the more terms are added into each dot product.
3. The Core Problem: Dot Products Grow With Dimension
Imagine two random vectors:
Their dot product is:
If the components are roughly independent, centered around zero, and similarly scaled, each term contributes some variance.
When terms are added, the variances add.
So:
For attention:
This means the size of the attention scores naturally grows as the query/key dimension grows.
flowchart TD
D1["small d_k"] --> T1["few terms in dot product"]
T1 --> V1["lower score variance"]
D2["large d_k"] --> T2["many terms in dot product"]
T2 --> V2["higher score variance"]
The shape of may stay the same, but the numbers inside it become more spread out.
For 3 tokens:
whether , , or .
But the variance of the values inside that matrix grows with .
4. Tiny Intuition: Why Larger Dimensions Create Larger Scores
A 2-dimensional dot product has only two terms:
A 5-dimensional dot product has five terms:
Same kind of numbers, more dimensions, larger possible dot product.
This is the basic reason score variance grows with dimension:
More dimensions mean more multiplied terms being added together.
5. The Variance Calculation
Assume:
- The entries of and are independent random variables.
- They have mean 0.
- They have variance 1.
For one dot product:
Because the terms are independent:
If and each have variance 1 and mean 0, then:
So:
Therefore:
This is the mathematical reason the raw attention scores become unstable as grows.
6. Why High Variance Is Bad Before Softmax
Attention scores are not the final output. They are passed into softmax:
Softmax turns scores into probabilities.
The problem is that softmax is very sensitive to large gaps between numbers.
For example:
A difference of 4 becomes an almost one-sided probability distribution.
If the scores are even more spread out:
The largest score dominates. The others become almost zero.
flowchart LR
S["high-variance scores"] --> SM["softmax"]
SM --> P["one value near 1, others near 0"]
P --> G["tiny gradients for ignored positions"]
This is the real training issue:
Large unscaled dot products can push softmax into a saturated region, where attention becomes almost one-hot and gradients through most positions become very small.
This is often described as a vanishing-gradient problem. More precisely, it is softmax saturation causing weak learning signals for many attention links.
7. What Scaling Should Accomplish
The goal is not to make all attention scores small.
The goal is to keep their variance stable as changes.
Without scaling:
| Approximate score variance | |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 8 | 8 |
| 64 | 64 |
| 512 | 512 |
That means larger attention dimensions produce larger and sharper softmax inputs.
The desired behavior is:
| Desired score variance after scaling | |
|---|---|
| 1 | about 1 |
| 2 | about 1 |
| 8 | about 1 |
| 64 | about 1 |
| 512 | about 1 |
So the scaling factor must cancel the variance growth caused by .
8. Why Divide by ?
A key variance property is:
The raw dot product has variance:
Now scale it by:
The scaled score is:
Its variance becomes:
Substitute:
Then:
That is why the denominator is , not .
Dividing by would shrink the variance too much:
The goal is to normalize the variance, not crush it.
9. The Final Formula
The complete scaled dot-product attention formula is:
Step by step:
| Step | Operation | Meaning |
|---|---|---|
| 1 | compute query-key similarity scores | |
| 2 | divide by | stabilize score variance |
| 3 | softmax | convert scores into attention weights |
| 4 | multiply by | mix value vectors into contextual representations |
flowchart LR
Q["Q"] --> S["QK^T"]
K["K^T"] --> S
S --> SCALE["divide by sqrt(d_k)"]
SCALE --> SM["softmax"]
SM --> MIX["multiply by V"]
V["V"] --> MIX
MIX --> OUT["contextual representations"]
10. What Happens Without Scaling?
Without the scaling term:
large can make the dot products too spread out.
Then softmax may produce attention weights like:
Instead of a useful distribution like:
When attention becomes too sharp too early:
- most tokens receive almost no attention
- many value vectors barely contribute
- gradients through ignored positions become tiny
- training becomes slower or less stable
Scaling keeps the scores in a range where softmax can still distribute attention meaningfully.
11. Common Confusions
is the number of words." No. is the dimension of the query/key vectors. In the example, there are 3 words but .
changes the shape of the attention matrix." No. It only rescales the numbers. If is , the scaled matrix is still .
No. Scaling is needed because the variance of dot-product scores grows with vector dimension.
Softmax converts scores into probabilities, but large score gaps make it too sharp. Scaling helps before softmax is applied.
." Dividing by would make the variance roughly . Dividing by makes the variance roughly stable.
12. One-Line Interview Answer
Scaled dot-product attention divides by because the variance of query-key dot products grows with the key dimension . Without scaling, the scores can become large, softmax can saturate, and gradients can become small. The factor keeps the score variance approximately constant and makes training more stable.
13. Final Intuition
The dot product is an adding machine.
As the vector dimension grows, it adds more terms. More terms create larger and more variable scores. Large score gaps make softmax behave almost like a hard maximum.
Scaling by keeps the comparison fair:
Big enough for meaningful differences, but not so big that softmax becomes saturated.
That one denominator is what turns raw dot-product attention into scaled dot-product attention.