Scaled Dot-Product Attention

Transformers in Deep Learning - Part 5
Sibling notes: Self Attention in Transformers - The Deep Dive · Query Key Value in Self-Attention - The Deep Dive

The self-attention formula is often written as:

Attention(Q,K,V)=softmax(QKT)V\text{Attention}(Q,K,V) = \text{softmax}(QK^T)V

But the actual Transformer formula from Attention Is All You Need is:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

The extra term is:

1dk\frac{1}{\sqrt{d_k}}

This note explains why that scaling is needed.

Short answer:

As the dimension of query and key vectors grows, their dot products naturally become larger and more spread out. Large dot products make softmax too sharp, which can make gradients tiny. Dividing by dk\sqrt{d_k} keeps the scores at a stable scale.


1. Quick Recap: Where the Scaling Appears

Suppose the sentence is:

love Apple phones

Each word begins as an embedding. If each embedding has 512 dimensions, the input matrix is:

XR3×512X \in \mathbb{R}^{3 \times 512}

There are 3 tokens and each token has a 512-dimensional embedding.

Self-attention creates query, key, and value matrices:

Q=XWQQ = XW_Q K=XWKK = XW_K V=XWVV = XW_V

If:

WQ,WK,WVR512×64W_Q, W_K, W_V \in \mathbb{R}^{512 \times 64}

then:

Q,K,VR3×64Q,K,V \in \mathbb{R}^{3 \times 64}

Here:

dk=64d_k = 64

dkd_k is the dimension of each query/key vector.

flowchart LR
    X["X: 3 x 512"] --> WQ["W_Q: 512 x 64"]
    X --> WK["W_K: 512 x 64"]
    X --> WV["W_V: 512 x 64"]

    WQ --> Q["Q: 3 x 64"]
    WK --> K["K: 3 x 64"]
    WV --> V["V: 3 x 64"]

Next, attention compares every query with every key:

QKTQK^T

The shape is:

(3×64)(64×3)=3×3(3 \times 64)(64 \times 3) = 3 \times 3

The output is a matrix of attention scores. Each score is a dot product between one query vector and one key vector.


2. What QKTQK^T Really Contains

For the sentence:

love Apple phones

the score matrix contains pairwise relationships:

Score Meaning
qlovekloveq_{\text{love}} \cdot k_{\text{love}} love compared with love
qApplekloveq_{\text{Apple}} \cdot k_{\text{love}} Apple compared with love
qApplekphonesq_{\text{Apple}} \cdot k_{\text{phones}} Apple compared with phones
qphoneskAppleq_{\text{phones}} \cdot k_{\text{Apple}} phones compared with Apple

Each score is produced by multiplying and adding 64 pairs of numbers:

qk=q1k1+q2k2++q64k64q \cdot k = q_1k_1 + q_2k_2 + \cdots + q_{64}k_{64}

That repeated addition is the key.

The larger dkd_k becomes, the more terms are added into each dot product.


3. The Core Problem: Dot Products Grow With Dimension

Imagine two random vectors:

q=[q1,q2,,qd]q = [q_1, q_2, \dots, q_d] k=[k1,k2,,kd]k = [k_1, k_2, \dots, k_d]

Their dot product is:

qk=i=1dqikiq \cdot k = \sum_{i=1}^{d} q_i k_i

If the components are roughly independent, centered around zero, and similarly scaled, each term qikiq_i k_i contributes some variance.

When dd terms are added, the variances add.

So:

Var(qk)d\text{Var}(q \cdot k) \propto d

For attention:

Var(qk)dk\text{Var}(q \cdot k) \propto d_k

This means the size of the attention scores naturally grows as the query/key dimension grows.

flowchart TD
    D1["small d_k"] --> T1["few terms in dot product"]
    T1 --> V1["lower score variance"]

    D2["large d_k"] --> T2["many terms in dot product"]
    T2 --> V2["higher score variance"]

The shape of QKTQK^T may stay the same, but the numbers inside it become more spread out.

For 3 tokens:

QKTR3×3QK^T \in \mathbb{R}^{3 \times 3}

whether dk=2d_k = 2, 6464, or 512512.

But the variance of the values inside that 3×33 \times 3 matrix grows with dkd_k.


4. Tiny Intuition: Why Larger Dimensions Create Larger Scores

A 2-dimensional dot product has only two terms:

[10,10][10,10]=1010+1010=200[10,10] \cdot [10,10] = 10 \cdot 10 + 10 \cdot 10 = 200

A 5-dimensional dot product has five terms:

[10,10,10,10,10][10,10,10,10,10]=500[10,10,10,10,10] \cdot [10,10,10,10,10] = 500

Same kind of numbers, more dimensions, larger possible dot product.

This is the basic reason score variance grows with dimension:

More dimensions mean more multiplied terms being added together.


5. The Variance Calculation

Assume:

  • The entries of qq and kk are independent random variables.
  • They have mean 0.
  • They have variance 1.

For one dot product:

s=qk=i=1dkqikis = q \cdot k = \sum_{i=1}^{d_k} q_i k_i

Because the terms are independent:

Var(s)=i=1dkVar(qiki)\text{Var}(s) = \sum_{i=1}^{d_k} \text{Var}(q_i k_i)

If qiq_i and kik_i each have variance 1 and mean 0, then:

Var(qiki)=1\text{Var}(q_i k_i) = 1

So:

Var(s)=i=1dk1=dk\text{Var}(s) = \sum_{i=1}^{d_k} 1 = d_k

Therefore:

Var(qk)=dk\text{Var}(q \cdot k) = d_k

This is the mathematical reason the raw attention scores become unstable as dkd_k grows.


6. Why High Variance Is Bad Before Softmax

Attention scores are not the final output. They are passed into softmax:

softmax(QKT)\text{softmax}(QK^T)

Softmax turns scores into probabilities.

The problem is that softmax is very sensitive to large gaps between numbers.

For example:

softmax([7,3])[0.982,0.018]\text{softmax}([7,3]) \approx [0.982, 0.018]

A difference of 4 becomes an almost one-sided probability distribution.

If the scores are even more spread out:

softmax([20,5,3])[0.9999997,0.0000003,0]\text{softmax}([20,5,-3]) \approx [0.9999997, 0.0000003, 0]

The largest score dominates. The others become almost zero.

flowchart LR
    S["high-variance scores"] --> SM["softmax"]
    SM --> P["one value near 1, others near 0"]
    P --> G["tiny gradients for ignored positions"]

This is the real training issue:

Large unscaled dot products can push softmax into a saturated region, where attention becomes almost one-hot and gradients through most positions become very small.

This is often described as a vanishing-gradient problem. More precisely, it is softmax saturation causing weak learning signals for many attention links.


7. What Scaling Should Accomplish

The goal is not to make all attention scores small.

The goal is to keep their variance stable as dkd_k changes.

Without scaling:

dkd_k Approximate score variance
1 1
2 2
8 8
64 64
512 512

That means larger attention dimensions produce larger and sharper softmax inputs.

The desired behavior is:

dkd_k Desired score variance after scaling
1 about 1
2 about 1
8 about 1
64 about 1
512 about 1

So the scaling factor must cancel the variance growth caused by dkd_k.


8. Why Divide by dk\sqrt{d_k}?

A key variance property is:

Var(aX)=a2Var(X)\text{Var}(aX) = a^2 \text{Var}(X)

The raw dot product has variance:

Var(qk)=dk\text{Var}(q \cdot k) = d_k

Now scale it by:

1dk\frac{1}{\sqrt{d_k}}

The scaled score is:

qkdk\frac{q \cdot k}{\sqrt{d_k}}

Its variance becomes:

Var(qkdk)=(1dk)2Var(qk)\text{Var} \left( \frac{q \cdot k}{\sqrt{d_k}} \right) = \left(\frac{1}{\sqrt{d_k}}\right)^2 \text{Var}(q \cdot k)

Substitute:

Var(qk)=dk\text{Var}(q \cdot k) = d_k

Then:

(1dk)2dk=1dkdk=1\left(\frac{1}{\sqrt{d_k}}\right)^2 d_k = \frac{1}{d_k} d_k = 1

That is why the denominator is dk\sqrt{d_k}, not dkd_k.

Dividing by dkd_k would shrink the variance too much:

Var(qkdk)=1dk2dk=1dk\text{Var} \left( \frac{q \cdot k}{d_k} \right) = \frac{1}{d_k^2} d_k = \frac{1}{d_k}

The goal is to normalize the variance, not crush it.


9. The Final Formula

The complete scaled dot-product attention formula is:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

Step by step:

Step Operation Meaning
1 QKTQK^T compute query-key similarity scores
2 divide by dk\sqrt{d_k} stabilize score variance
3 softmax convert scores into attention weights
4 multiply by VV mix value vectors into contextual representations
flowchart LR
    Q["Q"] --> S["QK^T"]
    K["K^T"] --> S
    S --> SCALE["divide by sqrt(d_k)"]
    SCALE --> SM["softmax"]
    SM --> MIX["multiply by V"]
    V["V"] --> MIX
    MIX --> OUT["contextual representations"]

10. What Happens Without Scaling?

Without the scaling term:

softmax(QKT)\text{softmax}(QK^T)

large dkd_k can make the dot products too spread out.

Then softmax may produce attention weights like:

[0.999,0.0004,0.0006][0.999, 0.0004, 0.0006]

Instead of a useful distribution like:

[0.55,0.25,0.20][0.55, 0.25, 0.20]

When attention becomes too sharp too early:

  • most tokens receive almost no attention
  • many value vectors barely contribute
  • gradients through ignored positions become tiny
  • training becomes slower or less stable

Scaling keeps the scores in a range where softmax can still distribute attention meaningfully.


11. Common Confusions

Note

dkd_k is the number of words." No. dkd_k is the dimension of the query/key vectors. In the example, there are 3 words but dk=64d_k = 64.

Note

dk\sqrt{d_k} changes the shape of the attention matrix." No. It only rescales the numbers. If QKTQK^T is 3×33 \times 3, the scaled matrix is still 3×33 \times 3.

Note

No. Scaling is needed because the variance of dot-product scores grows with vector dimension.

Note

Softmax converts scores into probabilities, but large score gaps make it too sharp. Scaling helps before softmax is applied.

Note

dkd_k." Dividing by dkd_k would make the variance roughly 1/dk1/d_k. Dividing by dk\sqrt{d_k} makes the variance roughly stable.


12. One-Line Interview Answer

Scaled dot-product attention divides QKTQK^T by dk\sqrt{d_k} because the variance of query-key dot products grows with the key dimension dkd_k. Without scaling, the scores can become large, softmax can saturate, and gradients can become small. The dk\sqrt{d_k} factor keeps the score variance approximately constant and makes training more stable.


13. Final Intuition

The dot product is an adding machine.

As the vector dimension grows, it adds more terms. More terms create larger and more variable scores. Large score gaps make softmax behave almost like a hard maximum.

Scaling by dk\sqrt{d_k} keeps the comparison fair:

Big enough for meaningful differences, but not so big that softmax becomes saturated.

That one denominator is what turns raw dot-product attention into scaled dot-product attention.