Positional Encoding in Transformers

Transformers in Deep Learning - Part 7
Sibling notes: Self Attention in Transformers - The Deep Dive · Multi-Head Attention - The Deep Dive

Self-attention processes tokens in parallel.

That is one of its biggest strengths:

  • it is faster than sequential recurrence
  • it can compare distant words directly
  • it can build contextual representations efficiently

But parallel processing creates a serious problem:

Self-attention by itself has no built-in sense of word order.

The Transformer solves this by adding positional encoding to the input embeddings.


1. Why Position Matters

Consider these two sentences:

The cat chased the mouse.

The mouse chased the cat.

They contain almost the same words, but the meanings are different because the order is different.

Sentence Meaning
cat chased mouse cat is the subject, mouse is the object
mouse chased cat mouse is the subject, cat is the object

If a model sees only a bag of token embeddings, it cannot know which word came first, second, or third.

That is the problem positional encoding solves.


2. The Core Problem With Self-Attention

Self-attention compares every token with every other token:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

This operation is powerful, but without positional information it is permutation-equivariant: if the tokens are reordered, the outputs are reordered in the same way, with no built-in knowledge of which order is the correct sentence order.

If the model receives the same token embeddings without position information, then:

"cat chased mouse" and "mouse chased cat" look too similar.

The model needs two kinds of information:

Information type Example
semantic information what the word means
positional information where the word appears

Word embeddings provide semantic information.

Positional encodings provide order information.


3. A Naive Idea: Add Position Numbers

One first idea is:

Give each token its position number.

For example:

Token Position
The 1
cat 2
chased 3
the 4
mouse 5

This seems simple, but it has problems.

Problem 1: Unbounded Values

Longer sequences create larger numbers:

1,2,3,,1000,1,2,3,\dots,1000,\dots

Large raw values can create scale problems during training.

Problem 2: Poor Geometry

The model should understand relative distance:

  • position 10 is near position 11
  • position 10 is farther from position 50
  • the offset from 10 to 15 is similar to the offset from 80 to 85

Raw integers do not give the model a rich, smooth geometry for this.

Problem 3: Hard Generalization

If position is just a number, the model may learn position-specific quirks rather than reusable patterns like:

"attend to the previous word"
"attend five tokens back"
"attend to a nearby noun"

So a better encoding should be:

  • bounded
  • continuous
  • patterned
  • useful for relative position
  • compatible with word embeddings

4. Why Sine and Cosine Are Useful

Sine and cosine have exactly the kind of behavior we want:

Property Why it helps
bounded values stay between -1 and 1
continuous nearby positions change smoothly
periodic creates repeating structure
frequency-controlled different dimensions can track different scales

A single sine wave is not enough:

sin(pos)\sin(pos)

Because sine is periodic, two different positions may eventually produce the same value.

So the Transformer does not use one sine value.

It uses many sine and cosine waves with different frequencies.


5. From One Wave to Many Waves

Instead of encoding a position as one number, encode it as a vector:

PE(pos)=[sin(),cos(),sin(),cos(),]PE(pos) = [ \sin(\cdot), \cos(\cdot), \sin(\cdot), \cos(\cdot), \dots ]

Each sine/cosine pair uses a different frequency.

Fast-changing waves help distinguish nearby positions.

Slow-changing waves help represent longer-range structure.

flowchart TD
    P["position"] --> F1["high-frequency sin/cos"]
    P --> F2["medium-frequency sin/cos"]
    P --> F3["low-frequency sin/cos"]

    F1 --> PE["positional encoding vector"]
    F2 --> PE
    F3 --> PE

The result is a positional fingerprint: not a single number, but a structured vector.


6. The Transformer Positional Encoding Formula

The original Transformer uses:

PE(pos,2i)=sin(pos100002i/dmodel)PE(pos, 2i) = \sin \left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE(pos, 2i+1) = \cos \left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right)

where:

Symbol Meaning
pospos token position in the sequence
ii dimension-pair index
dmodeld_{\text{model}} embedding/model dimension
2i2i even dimension
2i+12i+1 odd dimension

Even dimensions use sine.

Odd dimensions use cosine.

For:

dmodel=512d_{\text{model}} = 512

there are:

256256

sine/cosine pairs.

Each pair uses a different wavelength.


7. What the Formula Produces

For each position, the formula creates a vector with the same dimension as the word embedding.

If:

dmodel=512d_{\text{model}} = 512

then:

PE(pos)R512PE(pos) \in \mathbb{R}^{512}

For a sequence of 3 tokens:

PER3×512PE \in \mathbb{R}^{3 \times 512}

Each row is the positional encoding for one token position:

Row Meaning
row 0 position 0
row 1 position 1
row 2 position 2

The model receives:

Xinput=E+PEX_{\text{input}} = E + PE

where:

  • EE is the token embedding matrix
  • PEPE is the positional encoding matrix

8. Why Different Frequencies Matter

Low-index dimensions change quickly.

High-index dimensions change slowly.

This creates a multi-scale position signal:

Dimension type Behavior
high frequency changes quickly across nearby positions
low frequency changes slowly across far positions

Nearby positions have similar positional vectors.

Farther positions differ across more dimensions.

This gives the model a smooth way to reason about both local and long-range order.

flowchart LR
    P1["nearby positions"] --> S1["similar PE vectors"]
    P2["far positions"] --> S2["larger differences across dimensions"]

The encoding does not merely label positions. It creates a structured geometry of positions.


9. The Relative Position Property

The most important reason for sine/cosine encodings is that relative shifts have a predictable structure.

For any fixed offset kk, the encoding of:

pos+kpos + k

can be represented as a linear transformation of the encoding of:

pospos

That is:

PE(pos+k)=TkPE(pos)PE(pos+k) = T_k PE(pos)

for a transformation matrix TkT_k determined by the offset kk.

This is not magic. It comes from trigonometric identities:

sin(a+b)=sin(a)cos(b)+cos(a)sin(b)\sin(a+b) = \sin(a)\cos(b) + \cos(a)\sin(b) cos(a+b)=cos(a)cos(b)sin(a)sin(b)\cos(a+b) = \cos(a)\cos(b) - \sin(a)\sin(b)

For each sine/cosine pair, shifting by kk behaves like a rotation.

flowchart LR
    PE1["PE(pos)"] --> T["offset transform T_k"]
    T --> PE2["PE(pos + k)"]

This helps the model learn relative patterns like:

  • previous token
  • next token
  • five tokens back
  • nearby phrase
  • long-range dependency

The model is not explicitly programmed to calculate these offsets. The encoding makes those relationships easier to learn.


10. Why Add Instead of Concatenate?

Another possible design would be concatenation:

[E;PE][E; PE]

If both EE and PEPE are 512-dimensional, concatenation creates a 1024-dimensional input.

That would increase the size of the projection matrices:

Design Input dimension Example WQW_Q shape
addition 512 512×64512 \times 64
concatenation 1024 1024×641024 \times 64

Concatenation would add many extra parameters to:

  • WQW_Q
  • WKW_K
  • WVW_V
  • every attention head
  • every Transformer layer

Addition keeps the input dimension unchanged:

E+PERn×dmodelE + PE \in \mathbb{R}^{n \times d_{\text{model}}}

So the model gets position information without doubling the dimensionality.


11. Why Addition Does Not Simply Destroy the Embedding

At first, addition seems suspicious.

If:

Xinput=E+PEX_{\text{input}} = E + PE

then semantic and positional information are mixed into the same vector.

How can the model still use them?

The answer is:

The model learns projections that interpret the combined signal.

Word embeddings are learned semantic vectors.

Sinusoidal positional encodings are fixed, structured wave patterns.

Because the position signal has a regular structure across dimensions, the following learned matrices can use it:

WQ,WK,WVW_Q, W_K, W_V

and later feed-forward layers can also learn how much to rely on semantic features versus positional features.

Addition does not mean the two signals are perfectly independent. It means the model receives a combined vector with enough structure to recover useful semantic and positional information during training.


12. Shape Walkthrough

Suppose:

n=3n = 3

tokens:

love Apple phones

and:

dmodel=512d_{\text{model}} = 512

The embedding matrix is:

ER3×512E \in \mathbb{R}^{3 \times 512}

The positional encoding matrix is:

PER3×512PE \in \mathbb{R}^{3 \times 512}

They are added element-wise:

Xinput=E+PEX_{\text{input}} = E + PE

So:

XinputR3×512X_{\text{input}} \in \mathbb{R}^{3 \times 512}

Then self-attention proceeds normally:

Q=XinputWQQ = X_{\text{input}} W_Q K=XinputWKK = X_{\text{input}} W_K V=XinputWVV = X_{\text{input}} W_V

The Transformer now has access to both:

  • what each token means
  • where each token appears

13. Learned vs Fixed Positional Encodings

The original Transformer used fixed sinusoidal positional encodings.

Many later models use learned positional embeddings instead.

Type Description
fixed sinusoidal computed from sine/cosine formula
learned positional embedding learned like ordinary parameters

Fixed sinusoidal encodings have a useful built-in relative-position structure.

Learned positional embeddings can adapt to the training data, but may not extrapolate as naturally beyond trained sequence lengths.

Modern Transformer variants also use other approaches, such as relative position bias and rotary positional embeddings. The central need is the same:

Attention needs some way to know order.


14. Common Confusions

Note

No. Plain self-attention needs positional information added or built into the attention mechanism.

Note

They are simple, but not ideal. They are unbounded and do not provide a rich smooth structure for relative position.

Note

Not exactly. They provide a highly useful multi-frequency pattern over practical sequence lengths, but periodic functions are not an infinite uniqueness guarantee.

Note

In the original Transformer, it is added element-wise to the token embedding.

Note

The signals are mixed, but the model can learn to use the structured positional pattern through its learned projections.


15. One-Line Interview Answer

Positional encoding gives Transformers information about token order, because self-attention processes tokens in parallel and has no inherent sequence order. The original Transformer adds fixed sine/cosine vectors to token embeddings, using many frequencies so nearby and distant positions have learnable, structured relationships. Addition keeps the model dimension unchanged while giving attention access to both word meaning and position.


16. Final Intuition

Word embeddings answer:

What does this token mean?

Positional encodings answer:

Where does this token appear?

Self-attention needs both.

Without positional encoding, the model sees words but not order.

With positional encoding, the input becomes:

meaning+position\text{meaning} + \text{position}

That lets the Transformer keep its parallel speed while still understanding sequence structure.