Positional Encoding in Transformers

Transformers in Deep Learning - Part 7
Sibling notes: Self Attention in Transformers - The Deep Dive · Multi-Head Attention - The Deep Dive

Self-attention processes tokens in parallel.

That is one of its biggest strengths:

it is faster than sequential recurrence
it can compare distant words directly
it can build contextual representations efficiently

But parallel processing creates a serious problem:

Self-attention by itself has no built-in sense of word order.

The Transformer solves this by adding positional encoding to the input embeddings.

1. Why Position Matters

Consider these two sentences:

The cat chased the mouse.

The mouse chased the cat.

They contain almost the same words, but the meanings are different because the order is different.

Sentence	Meaning
cat chased mouse	cat is the subject, mouse is the object
mouse chased cat	mouse is the subject, cat is the object

If a model sees only a bag of token embeddings, it cannot know which word came first, second, or third.

That is the problem positional encoding solves.

2. The Core Problem With Self-Attention

Self-attention compares every token with every other token:

\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

This operation is powerful, but without positional information it is permutation-equivariant: if the tokens are reordered, the outputs are reordered in the same way, with no built-in knowledge of which order is the correct sentence order.

If the model receives the same token embeddings without position information, then:

"cat chased mouse" and "mouse chased cat" look too similar.

The model needs two kinds of information:

Information type	Example
semantic information	what the word means
positional information	where the word appears

Word embeddings provide semantic information.

Positional encodings provide order information.

3. A Naive Idea: Add Position Numbers

One first idea is:

Give each token its position number.

For example:

Token	Position
The	1
cat	2
chased	3
the	4
mouse	5

This seems simple, but it has problems.

Problem 1: Unbounded Values

Longer sequences create larger numbers:

1,2,3,\dots,1000,\dots

Large raw values can create scale problems during training.

Problem 2: Poor Geometry

The model should understand relative distance:

position 10 is near position 11
position 10 is farther from position 50
the offset from 10 to 15 is similar to the offset from 80 to 85

Raw integers do not give the model a rich, smooth geometry for this.

Problem 3: Hard Generalization

If position is just a number, the model may learn position-specific quirks rather than reusable patterns like:

"attend to the previous word"
"attend five tokens back"
"attend to a nearby noun"

So a better encoding should be:

bounded
continuous
patterned
useful for relative position
compatible with word embeddings

4. Why Sine and Cosine Are Useful

Sine and cosine have exactly the kind of behavior we want:

Property	Why it helps
bounded	values stay between -1 and 1
continuous	nearby positions change smoothly
periodic	creates repeating structure
frequency-controlled	different dimensions can track different scales

A single sine wave is not enough:

\sin(pos)

Because sine is periodic, two different positions may eventually produce the same value.

So the Transformer does not use one sine value.

It uses many sine and cosine waves with different frequencies.

5. From One Wave to Many Waves

Instead of encoding a position as one number, encode it as a vector:

PE(pos) = [ \sin(\cdot), \cos(\cdot), \sin(\cdot), \cos(\cdot), \dots ]

Each sine/cosine pair uses a different frequency.

Fast-changing waves help distinguish nearby positions.

Slow-changing waves help represent longer-range structure.

flowchart TD
    P["position"] --> F1["high-frequency sin/cos"]
    P --> F2["medium-frequency sin/cos"]
    P --> F3["low-frequency sin/cos"]

    F1 --> PE["positional encoding vector"]
    F2 --> PE
    F3 --> PE

The result is a positional fingerprint: not a single number, but a structured vector.

6. The Transformer Positional Encoding Formula

The original Transformer uses:

PE(pos, 2i) = \sin \left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right)

PE(pos, 2i+1) = \cos \left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right)

where:

Symbol	Meaning
$pos$	token position in the sequence
$i$	dimension-pair index
$d_{\text{model}}$	embedding/model dimension
$2i$	even dimension
$2i+1$	odd dimension

Even dimensions use sine.

Odd dimensions use cosine.

For:

d_{\text{model}} = 512

there are:

256

sine/cosine pairs.

Each pair uses a different wavelength.

7. What the Formula Produces

For each position, the formula creates a vector with the same dimension as the word embedding.

If:

d_{\text{model}} = 512

then:

PE(pos) \in \mathbb{R}^{512}

For a sequence of 3 tokens:

PE \in \mathbb{R}^{3 \times 512}

Each row is the positional encoding for one token position:

Row	Meaning
row 0	position 0
row 1	position 1
row 2	position 2

The model receives:

X_{\text{input}} = E + PE

where:

$E$ is the token embedding matrix
$PE$ is the positional encoding matrix

8. Why Different Frequencies Matter

Low-index dimensions change quickly.

High-index dimensions change slowly.

This creates a multi-scale position signal:

Dimension type	Behavior
high frequency	changes quickly across nearby positions
low frequency	changes slowly across far positions

Nearby positions have similar positional vectors.

Farther positions differ across more dimensions.

This gives the model a smooth way to reason about both local and long-range order.

flowchart LR
    P1["nearby positions"] --> S1["similar PE vectors"]
    P2["far positions"] --> S2["larger differences across dimensions"]

The encoding does not merely label positions. It creates a structured geometry of positions.

9. The Relative Position Property

The most important reason for sine/cosine encodings is that relative shifts have a predictable structure.

For any fixed offset $k$ , the encoding of:

pos + k

can be represented as a linear transformation of the encoding of:

pos

That is:

PE(pos+k) = T_k PE(pos)

for a transformation matrix $T_k$ determined by the offset $k$ .

This is not magic. It comes from trigonometric identities:

\sin(a+b) = \sin(a)\cos(b) + \cos(a)\sin(b)

\cos(a+b) = \cos(a)\cos(b) - \sin(a)\sin(b)

For each sine/cosine pair, shifting by $k$ behaves like a rotation.

flowchart LR
    PE1["PE(pos)"] --> T["offset transform T_k"]
    T --> PE2["PE(pos + k)"]

This helps the model learn relative patterns like:

previous token
next token
five tokens back
nearby phrase
long-range dependency

The model is not explicitly programmed to calculate these offsets. The encoding makes those relationships easier to learn.

10. Why Add Instead of Concatenate?

Another possible design would be concatenation:

[E; PE]

If both $E$ and $PE$ are 512-dimensional, concatenation creates a 1024-dimensional input.

That would increase the size of the projection matrices:

Design	Input dimension	Example $W_Q$ shape
addition	512	$512 \times 64$
concatenation	1024	$1024 \times 64$

Concatenation would add many extra parameters to:

$W_Q$
$W_K$
$W_V$
every attention head
every Transformer layer

Addition keeps the input dimension unchanged:

E + PE \in \mathbb{R}^{n \times d_{\text{model}}}

So the model gets position information without doubling the dimensionality.

11. Why Addition Does Not Simply Destroy the Embedding

At first, addition seems suspicious.

If:

X_{\text{input}} = E + PE

then semantic and positional information are mixed into the same vector.

How can the model still use them?

The answer is:

The model learns projections that interpret the combined signal.

Word embeddings are learned semantic vectors.

Sinusoidal positional encodings are fixed, structured wave patterns.

Because the position signal has a regular structure across dimensions, the following learned matrices can use it:

W_Q, W_K, W_V

and later feed-forward layers can also learn how much to rely on semantic features versus positional features.

Addition does not mean the two signals are perfectly independent. It means the model receives a combined vector with enough structure to recover useful semantic and positional information during training.

12. Shape Walkthrough

Suppose:

n = 3

tokens:

love Apple phones

and:

d_{\text{model}} = 512

The embedding matrix is:

E \in \mathbb{R}^{3 \times 512}

The positional encoding matrix is:

PE \in \mathbb{R}^{3 \times 512}

They are added element-wise:

X_{\text{input}} = E + PE

So:

X_{\text{input}} \in \mathbb{R}^{3 \times 512}

Then self-attention proceeds normally:

Q = X_{\text{input}} W_Q

K = X_{\text{input}} W_K

V = X_{\text{input}} W_V

The Transformer now has access to both:

what each token means
where each token appears

13. Learned vs Fixed Positional Encodings

The original Transformer used fixed sinusoidal positional encodings.

Many later models use learned positional embeddings instead.

Type	Description
fixed sinusoidal	computed from sine/cosine formula
learned positional embedding	learned like ordinary parameters

Fixed sinusoidal encodings have a useful built-in relative-position structure.

Learned positional embeddings can adapt to the training data, but may not extrapolate as naturally beyond trained sequence lengths.

Modern Transformer variants also use other approaches, such as relative position bias and rotary positional embeddings. The central need is the same:

Attention needs some way to know order.

14. Common Confusions

Note

No. Plain self-attention needs positional information added or built into the attention mechanism.

Note

They are simple, but not ideal. They are unbounded and do not provide a rich smooth structure for relative position.

Note

Not exactly. They provide a highly useful multi-frequency pattern over practical sequence lengths, but periodic functions are not an infinite uniqueness guarantee.

Note

In the original Transformer, it is added element-wise to the token embedding.

Note

The signals are mixed, but the model can learn to use the structured positional pattern through its learned projections.

15. One-Line Interview Answer

Positional encoding gives Transformers information about token order, because self-attention processes tokens in parallel and has no inherent sequence order. The original Transformer adds fixed sine/cosine vectors to token embeddings, using many frequencies so nearby and distant positions have learnable, structured relationships. Addition keeps the model dimension unchanged while giving attention access to both word meaning and position.

16. Final Intuition

Word embeddings answer:

What does this token mean?

Positional encodings answer:

Where does this token appear?

Self-attention needs both.

Without positional encoding, the model sees words but not order.

With positional encoding, the input becomes:

\text{meaning} + \text{position}

That lets the Transformer keep its parallel speed while still understanding sequence structure.