Transformer Encoder Architecture
Transformers in Deep Learning - Part 10
Sibling notes: Multi-Head Attention - The Deep Dive · Positional Encoding in Transformers - The Deep Dive · Residual Connections in Transformers - The Deep Dive
The Transformer encoder is built from components that have already appeared separately:
- token embeddings
- positional encoding
- multi-head self-attention
- residual connections
- layer normalization
- feed-forward networks
The encoder combines these pieces into a repeatable block.
The goal of the encoder is:
Turn input tokens into contextual representations that understand meaning, order, and relationships inside the sentence.
1. The Input: Token Embeddings
For an NLP task, the model begins with tokens represented as vectors.
If the sentence is:
love Apple phones
and each token embedding has 512 dimensions, then the input embedding matrix is:
Each row represents one token.
| Token | Vector shape |
|---|---|
| love | |
| Apple | |
| phones |
Embeddings carry semantic information: what the token means in general.
But normal embeddings are static. The same embedding for Apple may be used in:
Apple phones
and:
apple juice
The encoder needs to transform static embeddings into contextual representations.
2. Add Positional Encoding
Self-attention processes tokens in parallel, so it does not naturally know word order.
The model must know that:
The cat chased the mouse.
is different from:
The mouse chased the cat.
So before the encoder block receives the input, positional encoding is added:
where:
- is the token embedding matrix
- is the positional encoding matrix
If:
and:
then:
The encoder input now contains both:
- semantic meaning
- position/order information
flowchart LR
E["token embeddings"] --> ADD["+"]
PE["positional encodings"] --> ADD
ADD --> X["encoder input X"]
3. The First Major Layer: Multi-Head Self-Attention
The first major sublayer inside an encoder block is multi-head self-attention.
Self-attention lets each token look at every other token and build a contextual representation.
For example, in:
I love Apple phones
attention can help Apple move toward the technology-company meaning because of phones.
But one attention pattern is not enough for complex language.
For:
The keys on the table belong to Sara.
the word keys participates in more than one relationship:
| Relationship | Meaning |
|---|---|
| keys -> table | location |
| keys -> Sara | possession |
Multi-head attention allows several attention patterns to happen in parallel.
One head may focus on location. Another may focus on ownership. Another may focus on grammar or long-range dependencies.
4. Shape of Multi-Head Attention
Using the original Transformer-style dimensions:
The encoder input is:
Each attention head receives the same input but has its own learned projection matrices:
So each head produces:
Each head output is:
With 8 heads:
After concatenation:
Then an output projection maps it back into the model dimension:
So multi-head attention starts with:
and ends with:
but the output is now context-aware.
5. Parallelism: Powerful but Not Free
Self-attention processes tokens in parallel.
That is a major advantage over recurrent networks, which must process tokens step by step.
Parallelism helps because:
- all token representations can be computed at once
- long-range dependencies can be modeled directly
- GPU hardware can be used efficiently
But this does not mean a 1000-token sentence costs the same as a 10-token sentence.
Attention compares every token with every other token, so its score matrix grows with sequence length:
The compute cost of attention grows roughly as:
The key point is:
Transformers are parallel over sequence positions, but longer sequences still require more computation and memory.
6. Add and Norm After Attention
After multi-head attention, the encoder applies Add and Norm.
This means:
- Add the attention input back to the attention output.
- Normalize the result with layer normalization.
If the input to attention is , then:
The addition is a residual connection.
It gives information a shortcut path:
flowchart LR
X["X"] --> MHA["multi-head attention"]
MHA --> ADD["+"]
X --> ADD
ADD --> LN["layer norm"]
LN --> Y["Y"]
This helps with:
- forward information flow
- backward gradient flow
- stable training in deep stacks
Layer normalization keeps activations controlled after the residual addition.
7. Why the Feed-Forward Network Is Needed
After attention and normalization, the encoder uses a position-wise feed-forward network.
Attention mixes information across tokens.
The feed-forward network transforms each token representation independently.
Why is this needed?
Because attention projections are mostly linear transformations plus softmax-based mixing. The encoder still needs stronger nonlinear processing to learn complex patterns.
Without nonlinear activation functions, stacking linear layers would not add much power:
Multiple linear layers can collapse into one linear layer.
A nonlinear activation prevents that collapse and gives the model more expressive power.
8. The Feed-Forward Network Formula
In the original Transformer, each encoder block uses a two-layer feed-forward network:
The activation:
is ReLU.
The dimensions are:
So for each token:
- Expand from 512 dimensions to 2048 dimensions.
- Apply ReLU.
- Project back from 2048 dimensions to 512 dimensions.
flowchart LR
X["token vector: 512"] --> D1["dense: 2048"]
D1 --> R["ReLU"]
R --> D2["dense: 512"]
The expansion gives the model more capacity to build richer features.
The projection back to 512 keeps the shape compatible with residual connections and the next encoder block.
9. Add and Norm After Feed-Forward
The feed-forward network is followed by another Add and Norm step.
If is the result after the attention Add and Norm, then:
Again:
- the residual addition preserves information and gradient flow
- layer normalization stabilizes activations
- the output shape remains unchanged
If:
then:
This is the output of one complete encoder block.
10. One Complete Encoder Block
One encoder block can be summarized as:
More explicitly:
flowchart TD
X["input: embeddings + position"] --> MHA["multi-head self-attention"]
MHA --> ADD1["add residual"]
X --> ADD1
ADD1 --> LN1["layer norm"]
LN1 --> FFN["feed-forward network"]
FFN --> ADD2["add residual"]
LN1 --> ADD2
ADD2 --> LN2["layer norm"]
LN2 --> OUT["encoder block output"]
This is the encoder block used in the original Transformer architecture.
11. Stacking Encoder Blocks
The original Attention Is All You Need Transformer stacks 6 encoder blocks.
The output of one encoder block becomes the input to the next:
Each encoder block has its own parameters.
That means:
Encoder block 1 and encoder block 2 have the same architecture, but not the same weights.
The multi-head attention weights, feed-forward weights, and layer norm parameters are learned separately for each block.
flowchart LR
X["input"] --> E1["encoder block 1"]
E1 --> E2["encoder block 2"]
E2 --> E3["..."]
E3 --> E6["encoder block 6"]
E6 --> O["encoder output"]
12. Positional Encoding Is Added Once
Positional encoding is added before the first encoder block:
Then the result flows through the encoder stack.
The model does not normally re-add the same positional encoding between every encoder block.
The position information becomes part of the hidden representation and is transformed through the stack.
13. Why Every Block Keeps the Same Shape
The encoder carefully preserves shape.
If the input is:
then:
| Stage | Shape |
|---|---|
| embeddings + positional encoding | |
| multi-head attention output | |
| first Add and Norm output | |
| feed-forward output | |
| second Add and Norm output | |
| encoder block output |
This shape consistency matters because residual connections require element-wise addition:
Both tensors must have the same shape.
14. What Lower and Higher Encoder Layers Learn
Stacking encoder blocks lets the model build representations gradually.
Early layers may capture simpler or more local patterns:
- nearby word relationships
- phrase structure
- local syntax
Later layers can build more abstract patterns:
- sentence-level meaning
- long-range dependencies
- entity relationships
- task-relevant abstractions
This is similar to deep networks in vision:
Lower layers capture simpler features; higher layers combine them into richer concepts.
15. Encoder Output
After the final encoder block, the model has a contextual representation for every input token.
For 3 input tokens and :
Each row is now a deep contextual representation.
The output can be used by:
- a decoder in sequence-to-sequence models
- a classification head
- a token-level prediction head
- another task-specific module
In the original encoder-decoder Transformer, the encoder output is passed to the decoder, where it is used in encoder-decoder attention.
16. Common Confusions
Not exactly. The encoder receives token embeddings plus positional encodings.
No. The blocks have the same structure, but each block has its own learned parameters.
No. Attention is parallelizable, but its computation and memory still grow with sequence length.
No. Self-attention mixes information across tokens. The feed-forward network is applied independently to each token position.
It is two operations: residual addition and layer normalization.
17. One-Line Interview Answer
The Transformer encoder takes token embeddings plus positional encodings, passes them through stacked encoder blocks, and each block contains multi-head self-attention, residual addition with layer normalization, a position-wise feed-forward network, and another residual addition with layer normalization. The output is a contextual representation for every input token.
18. Final Intuition
The encoder is a context-building machine.
Token embeddings provide meaning.
Positional encodings provide order.
Multi-head attention lets tokens exchange information.
Residual connections preserve signal and gradients.
Layer normalization stabilizes training.
The feed-forward network adds nonlinear processing and feature capacity.
Stack these blocks, and the model gradually turns raw token vectors into deep contextual representations.