Positional Encoding in Transformers
Transformers in Deep Learning - Part 7
Sibling notes: Self Attention in Transformers - The Deep Dive · Multi-Head Attention - The Deep Dive
Self-attention processes tokens in parallel.
That is one of its biggest strengths:
- it is faster than sequential recurrence
- it can compare distant words directly
- it can build contextual representations efficiently
But parallel processing creates a serious problem:
Self-attention by itself has no built-in sense of word order.
The Transformer solves this by adding positional encoding to the input embeddings.
1. Why Position Matters
Consider these two sentences:
The cat chased the mouse.
The mouse chased the cat.
They contain almost the same words, but the meanings are different because the order is different.
| Sentence | Meaning |
|---|---|
| cat chased mouse | cat is the subject, mouse is the object |
| mouse chased cat | mouse is the subject, cat is the object |
If a model sees only a bag of token embeddings, it cannot know which word came first, second, or third.
That is the problem positional encoding solves.
2. The Core Problem With Self-Attention
Self-attention compares every token with every other token:
This operation is powerful, but without positional information it is permutation-equivariant: if the tokens are reordered, the outputs are reordered in the same way, with no built-in knowledge of which order is the correct sentence order.
If the model receives the same token embeddings without position information, then:
"cat chased mouse" and "mouse chased cat" look too similar.
The model needs two kinds of information:
| Information type | Example |
|---|---|
| semantic information | what the word means |
| positional information | where the word appears |
Word embeddings provide semantic information.
Positional encodings provide order information.
3. A Naive Idea: Add Position Numbers
One first idea is:
Give each token its position number.
For example:
| Token | Position |
|---|---|
| The | 1 |
| cat | 2 |
| chased | 3 |
| the | 4 |
| mouse | 5 |
This seems simple, but it has problems.
Problem 1: Unbounded Values
Longer sequences create larger numbers:
Large raw values can create scale problems during training.
Problem 2: Poor Geometry
The model should understand relative distance:
- position 10 is near position 11
- position 10 is farther from position 50
- the offset from 10 to 15 is similar to the offset from 80 to 85
Raw integers do not give the model a rich, smooth geometry for this.
Problem 3: Hard Generalization
If position is just a number, the model may learn position-specific quirks rather than reusable patterns like:
"attend to the previous word"
"attend five tokens back"
"attend to a nearby noun"
So a better encoding should be:
- bounded
- continuous
- patterned
- useful for relative position
- compatible with word embeddings
4. Why Sine and Cosine Are Useful
Sine and cosine have exactly the kind of behavior we want:
| Property | Why it helps |
|---|---|
| bounded | values stay between -1 and 1 |
| continuous | nearby positions change smoothly |
| periodic | creates repeating structure |
| frequency-controlled | different dimensions can track different scales |
A single sine wave is not enough:
Because sine is periodic, two different positions may eventually produce the same value.
So the Transformer does not use one sine value.
It uses many sine and cosine waves with different frequencies.
5. From One Wave to Many Waves
Instead of encoding a position as one number, encode it as a vector:
Each sine/cosine pair uses a different frequency.
Fast-changing waves help distinguish nearby positions.
Slow-changing waves help represent longer-range structure.
flowchart TD
P["position"] --> F1["high-frequency sin/cos"]
P --> F2["medium-frequency sin/cos"]
P --> F3["low-frequency sin/cos"]
F1 --> PE["positional encoding vector"]
F2 --> PE
F3 --> PE
The result is a positional fingerprint: not a single number, but a structured vector.
6. The Transformer Positional Encoding Formula
The original Transformer uses:
where:
| Symbol | Meaning |
|---|---|
| token position in the sequence | |
| dimension-pair index | |
| embedding/model dimension | |
| even dimension | |
| odd dimension |
Even dimensions use sine.
Odd dimensions use cosine.
For:
there are:
sine/cosine pairs.
Each pair uses a different wavelength.
7. What the Formula Produces
For each position, the formula creates a vector with the same dimension as the word embedding.
If:
then:
For a sequence of 3 tokens:
Each row is the positional encoding for one token position:
| Row | Meaning |
|---|---|
| row 0 | position 0 |
| row 1 | position 1 |
| row 2 | position 2 |
The model receives:
where:
- is the token embedding matrix
- is the positional encoding matrix
8. Why Different Frequencies Matter
Low-index dimensions change quickly.
High-index dimensions change slowly.
This creates a multi-scale position signal:
| Dimension type | Behavior |
|---|---|
| high frequency | changes quickly across nearby positions |
| low frequency | changes slowly across far positions |
Nearby positions have similar positional vectors.
Farther positions differ across more dimensions.
This gives the model a smooth way to reason about both local and long-range order.
flowchart LR
P1["nearby positions"] --> S1["similar PE vectors"]
P2["far positions"] --> S2["larger differences across dimensions"]
The encoding does not merely label positions. It creates a structured geometry of positions.
9. The Relative Position Property
The most important reason for sine/cosine encodings is that relative shifts have a predictable structure.
For any fixed offset , the encoding of:
can be represented as a linear transformation of the encoding of:
That is:
for a transformation matrix determined by the offset .
This is not magic. It comes from trigonometric identities:
For each sine/cosine pair, shifting by behaves like a rotation.
flowchart LR
PE1["PE(pos)"] --> T["offset transform T_k"]
T --> PE2["PE(pos + k)"]
This helps the model learn relative patterns like:
- previous token
- next token
- five tokens back
- nearby phrase
- long-range dependency
The model is not explicitly programmed to calculate these offsets. The encoding makes those relationships easier to learn.
10. Why Add Instead of Concatenate?
Another possible design would be concatenation:
If both and are 512-dimensional, concatenation creates a 1024-dimensional input.
That would increase the size of the projection matrices:
| Design | Input dimension | Example shape |
|---|---|---|
| addition | 512 | |
| concatenation | 1024 |
Concatenation would add many extra parameters to:
- every attention head
- every Transformer layer
Addition keeps the input dimension unchanged:
So the model gets position information without doubling the dimensionality.
11. Why Addition Does Not Simply Destroy the Embedding
At first, addition seems suspicious.
If:
then semantic and positional information are mixed into the same vector.
How can the model still use them?
The answer is:
The model learns projections that interpret the combined signal.
Word embeddings are learned semantic vectors.
Sinusoidal positional encodings are fixed, structured wave patterns.
Because the position signal has a regular structure across dimensions, the following learned matrices can use it:
and later feed-forward layers can also learn how much to rely on semantic features versus positional features.
Addition does not mean the two signals are perfectly independent. It means the model receives a combined vector with enough structure to recover useful semantic and positional information during training.
12. Shape Walkthrough
Suppose:
tokens:
love Apple phones
and:
The embedding matrix is:
The positional encoding matrix is:
They are added element-wise:
So:
Then self-attention proceeds normally:
The Transformer now has access to both:
- what each token means
- where each token appears
13. Learned vs Fixed Positional Encodings
The original Transformer used fixed sinusoidal positional encodings.
Many later models use learned positional embeddings instead.
| Type | Description |
|---|---|
| fixed sinusoidal | computed from sine/cosine formula |
| learned positional embedding | learned like ordinary parameters |
Fixed sinusoidal encodings have a useful built-in relative-position structure.
Learned positional embeddings can adapt to the training data, but may not extrapolate as naturally beyond trained sequence lengths.
Modern Transformer variants also use other approaches, such as relative position bias and rotary positional embeddings. The central need is the same:
Attention needs some way to know order.
14. Common Confusions
No. Plain self-attention needs positional information added or built into the attention mechanism.
They are simple, but not ideal. They are unbounded and do not provide a rich smooth structure for relative position.
Not exactly. They provide a highly useful multi-frequency pattern over practical sequence lengths, but periodic functions are not an infinite uniqueness guarantee.
In the original Transformer, it is added element-wise to the token embedding.
The signals are mixed, but the model can learn to use the structured positional pattern through its learned projections.
15. One-Line Interview Answer
Positional encoding gives Transformers information about token order, because self-attention processes tokens in parallel and has no inherent sequence order. The original Transformer adds fixed sine/cosine vectors to token embeddings, using many frequencies so nearby and distant positions have learnable, structured relationships. Addition keeps the model dimension unchanged while giving attention access to both word meaning and position.
16. Final Intuition
Word embeddings answer:
What does this token mean?
Positional encodings answer:
Where does this token appear?
Self-attention needs both.
Without positional encoding, the model sees words but not order.
With positional encoding, the input becomes:
That lets the Transformer keep its parallel speed while still understanding sequence structure.