Scaled Dot-Product Attention

Transformers in Deep Learning - Part 5
Sibling notes: Self Attention in Transformers - The Deep Dive · Query Key Value in Self-Attention - The Deep Dive

The self-attention formula is often written as:

\text{Attention}(Q,K,V) = \text{softmax}(QK^T)V

But the actual Transformer formula from Attention Is All You Need is:

\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

The extra term is:

\frac{1}{\sqrt{d_k}}

This note explains why that scaling is needed.

Short answer:

As the dimension of query and key vectors grows, their dot products naturally become larger and more spread out. Large dot products make softmax too sharp, which can make gradients tiny. Dividing by $\sqrt{d_k}$ keeps the scores at a stable scale.

1. Quick Recap: Where the Scaling Appears

Suppose the sentence is:

love Apple phones

Each word begins as an embedding. If each embedding has 512 dimensions, the input matrix is:

X \in \mathbb{R}^{3 \times 512}

There are 3 tokens and each token has a 512-dimensional embedding.

Self-attention creates query, key, and value matrices:

Q = XW_Q

K = XW_K

V = XW_V

If:

W_Q, W_K, W_V \in \mathbb{R}^{512 \times 64}

then:

Q,K,V \in \mathbb{R}^{3 \times 64}

Here:

d_k = 64

$d_k$ is the dimension of each query/key vector.

flowchart LR
    X["X: 3 x 512"] --> WQ["W_Q: 512 x 64"]
    X --> WK["W_K: 512 x 64"]
    X --> WV["W_V: 512 x 64"]

    WQ --> Q["Q: 3 x 64"]
    WK --> K["K: 3 x 64"]
    WV --> V["V: 3 x 64"]

Next, attention compares every query with every key:

QK^T

The shape is:

(3 \times 64)(64 \times 3) = 3 \times 3

The output is a matrix of attention scores. Each score is a dot product between one query vector and one key vector.

2. What $QK^T$ Really Contains

For the sentence:

love Apple phones

the score matrix contains pairwise relationships:

Score	Meaning
$q_{\text{love}} \cdot k_{\text{love}}$	love compared with love
$q_{\text{Apple}} \cdot k_{\text{love}}$	Apple compared with love
$q_{\text{Apple}} \cdot k_{\text{phones}}$	Apple compared with phones
$q_{\text{phones}} \cdot k_{\text{Apple}}$	phones compared with Apple

Each score is produced by multiplying and adding 64 pairs of numbers:

q \cdot k = q_1k_1 + q_2k_2 + \cdots + q_{64}k_{64}

That repeated addition is the key.

The larger $d_k$ becomes, the more terms are added into each dot product.

3. The Core Problem: Dot Products Grow With Dimension

Imagine two random vectors:

q = [q_1, q_2, \dots, q_d]

k = [k_1, k_2, \dots, k_d]

Their dot product is:

q \cdot k = \sum_{i=1}^{d} q_i k_i

If the components are roughly independent, centered around zero, and similarly scaled, each term $q_i k_i$ contributes some variance.

When $d$ terms are added, the variances add.

So:

\text{Var}(q \cdot k) \propto d

For attention:

\text{Var}(q \cdot k) \propto d_k

This means the size of the attention scores naturally grows as the query/key dimension grows.

flowchart TD
    D1["small d_k"] --> T1["few terms in dot product"]
    T1 --> V1["lower score variance"]

    D2["large d_k"] --> T2["many terms in dot product"]
    T2 --> V2["higher score variance"]

The shape of $QK^T$ may stay the same, but the numbers inside it become more spread out.

For 3 tokens:

QK^T \in \mathbb{R}^{3 \times 3}

whether $d_k = 2$ , $64$ , or $512$ .

But the variance of the values inside that $3 \times 3$ matrix grows with $d_k$ .

4. Tiny Intuition: Why Larger Dimensions Create Larger Scores

A 2-dimensional dot product has only two terms:

[10,10] \cdot [10,10] = 10 \cdot 10 + 10 \cdot 10 = 200

A 5-dimensional dot product has five terms:

[10,10,10,10,10] \cdot [10,10,10,10,10] = 500

Same kind of numbers, more dimensions, larger possible dot product.

This is the basic reason score variance grows with dimension:

More dimensions mean more multiplied terms being added together.

5. The Variance Calculation

Assume:

The entries of $q$ and $k$ are independent random variables.
They have mean 0.
They have variance 1.

For one dot product:

s = q \cdot k = \sum_{i=1}^{d_k} q_i k_i

Because the terms are independent:

\text{Var}(s) = \sum_{i=1}^{d_k} \text{Var}(q_i k_i)

If $q_i$ and $k_i$ each have variance 1 and mean 0, then:

\text{Var}(q_i k_i) = 1

So:

\text{Var}(s) = \sum_{i=1}^{d_k} 1 = d_k

Therefore:

\text{Var}(q \cdot k) = d_k

This is the mathematical reason the raw attention scores become unstable as $d_k$ grows.

6. Why High Variance Is Bad Before Softmax

Attention scores are not the final output. They are passed into softmax:

\text{softmax}(QK^T)

Softmax turns scores into probabilities.

The problem is that softmax is very sensitive to large gaps between numbers.

For example:

\text{softmax}([7,3]) \approx [0.982, 0.018]

A difference of 4 becomes an almost one-sided probability distribution.

If the scores are even more spread out:

\text{softmax}([20,5,-3]) \approx [0.9999997, 0.0000003, 0]

The largest score dominates. The others become almost zero.

flowchart LR
    S["high-variance scores"] --> SM["softmax"]
    SM --> P["one value near 1, others near 0"]
    P --> G["tiny gradients for ignored positions"]

This is the real training issue:

Large unscaled dot products can push softmax into a saturated region, where attention becomes almost one-hot and gradients through most positions become very small.

This is often described as a vanishing-gradient problem. More precisely, it is softmax saturation causing weak learning signals for many attention links.

7. What Scaling Should Accomplish

The goal is not to make all attention scores small.

The goal is to keep their variance stable as $d_k$ changes.

Without scaling:

$d_k$	Approximate score variance
1	1
2	2
8	8
64	64
512	512

That means larger attention dimensions produce larger and sharper softmax inputs.

The desired behavior is:

$d_k$	Desired score variance after scaling
1	about 1
2	about 1
8	about 1
64	about 1
512	about 1

So the scaling factor must cancel the variance growth caused by $d_k$ .

8. Why Divide by $\sqrt{d_k}$ ?

A key variance property is:

\text{Var}(aX) = a^2 \text{Var}(X)

The raw dot product has variance:

\text{Var}(q \cdot k) = d_k

Now scale it by:

\frac{1}{\sqrt{d_k}}

The scaled score is:

\frac{q \cdot k}{\sqrt{d_k}}

Its variance becomes:

\text{Var} \left( \frac{q \cdot k}{\sqrt{d_k}} \right) = \left(\frac{1}{\sqrt{d_k}}\right)^2 \text{Var}(q \cdot k)

Substitute:

\text{Var}(q \cdot k) = d_k

Then:

\left(\frac{1}{\sqrt{d_k}}\right)^2 d_k = \frac{1}{d_k} d_k = 1

That is why the denominator is $\sqrt{d_k}$ , not $d_k$ .

Dividing by $d_k$ would shrink the variance too much:

\text{Var} \left( \frac{q \cdot k}{d_k} \right) = \frac{1}{d_k^2} d_k = \frac{1}{d_k}

The goal is to normalize the variance, not crush it.

9. The Final Formula

The complete scaled dot-product attention formula is:

\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

Step by step:

Step	Operation	Meaning
1	$QK^T$	compute query-key similarity scores
2	divide by $\sqrt{d_k}$	stabilize score variance
3	softmax	convert scores into attention weights
4	multiply by $V$	mix value vectors into contextual representations

flowchart LR
    Q["Q"] --> S["QK^T"]
    K["K^T"] --> S
    S --> SCALE["divide by sqrt(d_k)"]
    SCALE --> SM["softmax"]
    SM --> MIX["multiply by V"]
    V["V"] --> MIX
    MIX --> OUT["contextual representations"]

10. What Happens Without Scaling?

Without the scaling term:

\text{softmax}(QK^T)

large $d_k$ can make the dot products too spread out.

Then softmax may produce attention weights like:

[0.999, 0.0004, 0.0006]

Instead of a useful distribution like:

[0.55, 0.25, 0.20]

When attention becomes too sharp too early:

most tokens receive almost no attention
many value vectors barely contribute
gradients through ignored positions become tiny
training becomes slower or less stable

Scaling keeps the scores in a range where softmax can still distribute attention meaningfully.

11. Common Confusions

Note

$d_k$ is the number of words." No. $d_k$ is the dimension of the query/key vectors. In the example, there are 3 words but $d_k = 64$ .

Note

$\sqrt{d_k}$ changes the shape of the attention matrix." No. It only rescales the numbers. If $QK^T$ is $3 \times 3$ , the scaled matrix is still $3 \times 3$ .

Note

No. Scaling is needed because the variance of dot-product scores grows with vector dimension.

Note

Softmax converts scores into probabilities, but large score gaps make it too sharp. Scaling helps before softmax is applied.

Note

$d_k$ ." Dividing by $d_k$ would make the variance roughly $1/d_k$ . Dividing by $\sqrt{d_k}$ makes the variance roughly stable.

12. One-Line Interview Answer

Scaled dot-product attention divides $QK^T$ by $\sqrt{d_k}$ because the variance of query-key dot products grows with the key dimension $d_k$ . Without scaling, the scores can become large, softmax can saturate, and gradients can become small. The $\sqrt{d_k}$ factor keeps the score variance approximately constant and makes training more stable.

13. Final Intuition

The dot product is an adding machine.

As the vector dimension grows, it adds more terms. More terms create larger and more variable scores. Large score gaps make softmax behave almost like a hard maximum.

Scaling by $\sqrt{d_k}$ keeps the comparison fair:

Big enough for meaningful differences, but not so big that softmax becomes saturated.

That one denominator is what turns raw dot-product attention into scaled dot-product attention.

Scaled Dot-Product Attention

1. Quick Recap: Where the Scaling Appears

2. What QKTQK^TQKT Really Contains

3. The Core Problem: Dot Products Grow With Dimension

4. Tiny Intuition: Why Larger Dimensions Create Larger Scores

5. The Variance Calculation

6. Why High Variance Is Bad Before Softmax

7. What Scaling Should Accomplish

8. Why Divide by dk\sqrt{d_k}dk​​?

9. The Final Formula

10. What Happens Without Scaling?

11. Common Confusions

12. One-Line Interview Answer

13. Final Intuition

2. What $QK^T$ Really Contains

8. Why Divide by $\sqrt{d_k}$ ?