Transformer Encoder Architecture

Transformers in Deep Learning - Part 10
Sibling notes: Multi-Head Attention - The Deep Dive · Positional Encoding in Transformers - The Deep Dive · Residual Connections in Transformers - The Deep Dive

The Transformer encoder is built from components that have already appeared separately:

  • token embeddings
  • positional encoding
  • multi-head self-attention
  • residual connections
  • layer normalization
  • feed-forward networks

The encoder combines these pieces into a repeatable block.

The goal of the encoder is:

Turn input tokens into contextual representations that understand meaning, order, and relationships inside the sentence.


1. The Input: Token Embeddings

For an NLP task, the model begins with tokens represented as vectors.

If the sentence is:

love Apple phones

and each token embedding has 512 dimensions, then the input embedding matrix is:

ER3×512E \in \mathbb{R}^{3 \times 512}

Each row represents one token.

Token Vector shape
love 1×5121 \times 512
Apple 1×5121 \times 512
phones 1×5121 \times 512

Embeddings carry semantic information: what the token means in general.

But normal embeddings are static. The same embedding for Apple may be used in:

Apple phones

and:

apple juice

The encoder needs to transform static embeddings into contextual representations.


2. Add Positional Encoding

Self-attention processes tokens in parallel, so it does not naturally know word order.

The model must know that:

The cat chased the mouse.

is different from:

The mouse chased the cat.

So before the encoder block receives the input, positional encoding is added:

X=E+PEX = E + PE

where:

  • EE is the token embedding matrix
  • PEPE is the positional encoding matrix

If:

ER3×512E \in \mathbb{R}^{3 \times 512}

and:

PER3×512PE \in \mathbb{R}^{3 \times 512}

then:

XR3×512X \in \mathbb{R}^{3 \times 512}

The encoder input now contains both:

  • semantic meaning
  • position/order information
flowchart LR
    E["token embeddings"] --> ADD["+"]
    PE["positional encodings"] --> ADD
    ADD --> X["encoder input X"]

3. The First Major Layer: Multi-Head Self-Attention

The first major sublayer inside an encoder block is multi-head self-attention.

Self-attention lets each token look at every other token and build a contextual representation.

For example, in:

I love Apple phones

attention can help Apple move toward the technology-company meaning because of phones.

But one attention pattern is not enough for complex language.

For:

The keys on the table belong to Sara.

the word keys participates in more than one relationship:

Relationship Meaning
keys -> table location
keys -> Sara possession

Multi-head attention allows several attention patterns to happen in parallel.

One head may focus on location. Another may focus on ownership. Another may focus on grammar or long-range dependencies.


4. Shape of Multi-Head Attention

Using the original Transformer-style dimensions:

dmodel=512d_{\text{model}} = 512 h=8h = 8 dk=dv=64d_k = d_v = 64

The encoder input is:

XR3×512X \in \mathbb{R}^{3 \times 512}

Each attention head receives the same input but has its own learned projection matrices:

WiQ,WiK,WiVR512×64W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{512 \times 64}

So each head produces:

Qi,Ki,ViR3×64Q_i,K_i,V_i \in \mathbb{R}^{3 \times 64}

Each head output is:

headiR3×64\text{head}_i \in \mathbb{R}^{3 \times 64}

With 8 heads:

8×64=5128 \times 64 = 512

After concatenation:

Concat(head1,,head8)R3×512\text{Concat}(\text{head}_1,\dots,\text{head}_8) \in \mathbb{R}^{3 \times 512}

Then an output projection maps it back into the model dimension:

MHA(X)R3×512\text{MHA}(X) \in \mathbb{R}^{3 \times 512}

So multi-head attention starts with:

3×5123 \times 512

and ends with:

3×5123 \times 512

but the output is now context-aware.


5. Parallelism: Powerful but Not Free

Self-attention processes tokens in parallel.

That is a major advantage over recurrent networks, which must process tokens step by step.

Parallelism helps because:

  • all token representations can be computed at once
  • long-range dependencies can be modeled directly
  • GPU hardware can be used efficiently

But this does not mean a 1000-token sentence costs the same as a 10-token sentence.

Attention compares every token with every other token, so its score matrix grows with sequence length:

n×nn \times n

The compute cost of attention grows roughly as:

O(n2)O(n^2)

The key point is:

Transformers are parallel over sequence positions, but longer sequences still require more computation and memory.


6. Add and Norm After Attention

After multi-head attention, the encoder applies Add and Norm.

This means:

  1. Add the attention input back to the attention output.
  2. Normalize the result with layer normalization.

If the input to attention is XX, then:

Y=LayerNorm(X+MHA(X))Y = \text{LayerNorm}(X + \text{MHA}(X))

The addition is a residual connection.

It gives information a shortcut path:

flowchart LR
    X["X"] --> MHA["multi-head attention"]
    MHA --> ADD["+"]
    X --> ADD
    ADD --> LN["layer norm"]
    LN --> Y["Y"]

This helps with:

  • forward information flow
  • backward gradient flow
  • stable training in deep stacks

Layer normalization keeps activations controlled after the residual addition.


7. Why the Feed-Forward Network Is Needed

After attention and normalization, the encoder uses a position-wise feed-forward network.

Attention mixes information across tokens.

The feed-forward network transforms each token representation independently.

Why is this needed?

Because attention projections are mostly linear transformations plus softmax-based mixing. The encoder still needs stronger nonlinear processing to learn complex patterns.

Without nonlinear activation functions, stacking linear layers would not add much power:

W2(W1x)=(W2W1)xW_2(W_1x) = (W_2W_1)x

Multiple linear layers can collapse into one linear layer.

A nonlinear activation prevents that collapse and gives the model more expressive power.


8. The Feed-Forward Network Formula

In the original Transformer, each encoder block uses a two-layer feed-forward network:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

The activation:

max(0,)\max(0,\cdot)

is ReLU.

The dimensions are:

5122048512512 \rightarrow 2048 \rightarrow 512

So for each token:

  1. Expand from 512 dimensions to 2048 dimensions.
  2. Apply ReLU.
  3. Project back from 2048 dimensions to 512 dimensions.
flowchart LR
    X["token vector: 512"] --> D1["dense: 2048"]
    D1 --> R["ReLU"]
    R --> D2["dense: 512"]

The expansion gives the model more capacity to build richer features.

The projection back to 512 keeps the shape compatible with residual connections and the next encoder block.


9. Add and Norm After Feed-Forward

The feed-forward network is followed by another Add and Norm step.

If YY is the result after the attention Add and Norm, then:

Z=LayerNorm(Y+FFN(Y))Z = \text{LayerNorm}(Y + \text{FFN}(Y))

Again:

  • the residual addition preserves information and gradient flow
  • layer normalization stabilizes activations
  • the output shape remains unchanged

If:

YR3×512Y \in \mathbb{R}^{3 \times 512}

then:

ZR3×512Z \in \mathbb{R}^{3 \times 512}

This ZZ is the output of one complete encoder block.


10. One Complete Encoder Block

One encoder block can be summarized as:

XMulti-Head AttentionAdd and NormFeed-Forward NetworkAdd and NormX \rightarrow \text{Multi-Head Attention} \rightarrow \text{Add and Norm} \rightarrow \text{Feed-Forward Network} \rightarrow \text{Add and Norm}

More explicitly:

Y=LayerNorm(X+MHA(X))Y = \text{LayerNorm}(X + \text{MHA}(X)) Z=LayerNorm(Y+FFN(Y))Z = \text{LayerNorm}(Y + \text{FFN}(Y))
flowchart TD
    X["input: embeddings + position"] --> MHA["multi-head self-attention"]
    MHA --> ADD1["add residual"]
    X --> ADD1
    ADD1 --> LN1["layer norm"]
    LN1 --> FFN["feed-forward network"]
    FFN --> ADD2["add residual"]
    LN1 --> ADD2
    ADD2 --> LN2["layer norm"]
    LN2 --> OUT["encoder block output"]

This is the encoder block used in the original Transformer architecture.


11. Stacking Encoder Blocks

The original Attention Is All You Need Transformer stacks 6 encoder blocks.

The output of one encoder block becomes the input to the next:

X1Encoder1X2Encoder2Encoder6X_1 \rightarrow \text{Encoder}_1 \rightarrow X_2 \rightarrow \text{Encoder}_2 \rightarrow \dots \rightarrow \text{Encoder}_6

Each encoder block has its own parameters.

That means:

Encoder block 1 and encoder block 2 have the same architecture, but not the same weights.

The multi-head attention weights, feed-forward weights, and layer norm parameters are learned separately for each block.

flowchart LR
    X["input"] --> E1["encoder block 1"]
    E1 --> E2["encoder block 2"]
    E2 --> E3["..."]
    E3 --> E6["encoder block 6"]
    E6 --> O["encoder output"]

12. Positional Encoding Is Added Once

Positional encoding is added before the first encoder block:

X=E+PEX = E + PE

Then the result flows through the encoder stack.

The model does not normally re-add the same positional encoding between every encoder block.

The position information becomes part of the hidden representation and is transformed through the stack.


13. Why Every Block Keeps the Same Shape

The encoder carefully preserves shape.

If the input is:

3×5123 \times 512

then:

Stage Shape
embeddings + positional encoding 3×5123 \times 512
multi-head attention output 3×5123 \times 512
first Add and Norm output 3×5123 \times 512
feed-forward output 3×5123 \times 512
second Add and Norm output 3×5123 \times 512
encoder block output 3×5123 \times 512

This shape consistency matters because residual connections require element-wise addition:

x+F(x)x + F(x)

Both tensors must have the same shape.


14. What Lower and Higher Encoder Layers Learn

Stacking encoder blocks lets the model build representations gradually.

Early layers may capture simpler or more local patterns:

  • nearby word relationships
  • phrase structure
  • local syntax

Later layers can build more abstract patterns:

  • sentence-level meaning
  • long-range dependencies
  • entity relationships
  • task-relevant abstractions

This is similar to deep networks in vision:

Lower layers capture simpler features; higher layers combine them into richer concepts.


15. Encoder Output

After the final encoder block, the model has a contextual representation for every input token.

For 3 input tokens and dmodel=512d_{\text{model}} = 512:

EncoderOutputR3×512\text{EncoderOutput} \in \mathbb{R}^{3 \times 512}

Each row is now a deep contextual representation.

The output can be used by:

  • a decoder in sequence-to-sequence models
  • a classification head
  • a token-level prediction head
  • another task-specific module

In the original encoder-decoder Transformer, the encoder output is passed to the decoder, where it is used in encoder-decoder attention.


16. Common Confusions

Note

Not exactly. The encoder receives token embeddings plus positional encodings.

Note

No. The blocks have the same structure, but each block has its own learned parameters.

Note

No. Attention is parallelizable, but its computation and memory still grow with sequence length.

Note

No. Self-attention mixes information across tokens. The feed-forward network is applied independently to each token position.

Note

It is two operations: residual addition and layer normalization.


17. One-Line Interview Answer

The Transformer encoder takes token embeddings plus positional encodings, passes them through stacked encoder blocks, and each block contains multi-head self-attention, residual addition with layer normalization, a position-wise feed-forward network, and another residual addition with layer normalization. The output is a contextual representation for every input token.


18. Final Intuition

The encoder is a context-building machine.

Token embeddings provide meaning.

Positional encodings provide order.

Multi-head attention lets tokens exchange information.

Residual connections preserve signal and gradients.

Layer normalization stabilizes training.

The feed-forward network adds nonlinear processing and feature capacity.

Stack these blocks, and the model gradually turns raw token vectors into deep contextual representations.