Residual Connections in Transformers

Transformers in Deep Learning - Part 9
Sibling notes: Batch Normalization and Layer Normalization - The Deep Dive · Multi-Head Attention - The Deep Dive

Deep neural networks are powerful, but they are difficult to train.

As information moves through many layers, two things can go wrong:

  • the forward signal can become distorted
  • the backward gradient can become too small or too large

Residual connections solve this by giving information a shortcut path through the network.


1. The Problem With Very Deep Networks

Imagine an untrained deep network with randomly initialized weights.

The input passes through layer after layer:

xf1(x)f2(f1(x))x \rightarrow f_1(x) \rightarrow f_2(f_1(x)) \rightarrow \dots

Each layer transforms the signal.

Early in training, those transformations are mostly random.

After many layers, useful information from the original input may become distorted.

flowchart LR
    X["input"] --> L1["layer 1"]
    L1 --> L2["layer 2"]
    L2 --> L3["layer 3"]
    L3 --> LN["many more layers"]
    LN --> O["noisy output"]

If the output is mostly noise, learning becomes harder.

The network must discover useful transformations while also trying not to destroy the information it already has.


2. The Vanishing Gradient Problem

Training uses backpropagation.

Gradients flow backward from the loss through all layers.

In a very deep network, gradients may repeatedly get multiplied by small values.

If many values are less than 1, their product can become extremely small:

0.5200.0000010.5^{20} \approx 0.000001

This means early layers may receive almost no useful update.

That is the vanishing gradient problem.

flowchart RL
    LOSS["loss"] --> G3["later gradients"]
    G3 --> G2["smaller gradients"]
    G2 --> G1["tiny gradients near early layers"]

If the early layers barely update, the whole deep model may train slowly or fail to converge.


3. The Core Idea of Residual Connections

Instead of forcing information to pass only through a long chain of transformations, give it a shortcut.

For a block that computes:

F(x)F(x)

a residual connection outputs:

y=x+F(x)y = x + F(x)

The input xx skips around the block and is added to the block's transformed output.

flowchart LR
    X["x"] --> F["block F(x)"]
    F --> ADD["+"]
    X --> ADD
    ADD --> Y["y = x + F(x)"]

This shortcut is also called a skip connection.

The block no longer has to learn the entire desired output from scratch.

It only needs to learn the residual:

What should be added to the input?


4. Why It Helps Forward Information Flow

Without a residual connection, the next block receives only:

F(x)F(x)

If FF is poor early in training, the next block receives poor information.

With a residual connection, the next block receives:

x+F(x)x + F(x)

Even if F(x)F(x) is not useful yet, the original input signal xx still passes forward.

This gives later layers access to cleaner information.

flowchart TD
    X["input signal"] --> LONG["transformed path F(x)"]
    X --> SHORT["shortcut path x"]
    LONG --> ADD["combine"]
    SHORT --> ADD
    ADD --> NEXT["next block"]

The network can start with something close to an identity path and gradually learn useful refinements.


5. Why It Helps Gradient Flow

Residual connections also help during backpropagation.

For:

y=x+F(x)y = x + F(x)

the gradient has a direct path through the addition.

The derivative contains an identity term:

yx=I+F(x)x\frac{\partial y}{\partial x} = I + \frac{\partial F(x)}{\partial x}

That II term matters.

It means gradients can flow backward through the shortcut even when the transformed path is weak.

So residual connections improve:

  • forward information flow
  • backward gradient flow
  • early training stability
  • optimization of deep networks

6. Learning Residual Information

A residual block does not need to learn:

"Create the entire next representation."

It can instead learn:

"Keep the input, and add the useful difference."

This reduces the burden on each block.

If the best transformation is close to doing nothing, the block can learn a small residual.

If the model needs richer behavior, the block can add more through F(x)F(x).

This is why residual networks can be much deeper than plain feed-forward stacks.


7. Why Shapes Must Match

Residual addition is element-wise:

x+F(x)x + F(x)

So the two tensors must have the same shape.

If:

xR3×512x \in \mathbb{R}^{3 \times 512}

then:

F(x)R3×512F(x) \in \mathbb{R}^{3 \times 512}

is needed for direct addition.

This shape requirement influences architecture design.

Transformers are built so the major sublayers preserve the model dimension:

dmodeldmodeld_{\text{model}} \rightarrow d_{\text{model}}

That makes residual addition simple.


8. Addition vs Concatenation

Instead of adding, one could concatenate:

[x;F(x)][x; F(x)]

Concatenation does not require matching feature dimensions, but it increases the representation size.

Method Shape requirement Cost
addition same shape keeps dimension fixed
concatenation dimensions can differ increases dimension and parameters

Transformers usually use addition because it keeps the model width stable.

If every residual path doubled the dimension, later layers would become much more expensive.


9. Residual Connections in Transformers

A Transformer block has two major sublayers:

  • multi-head attention
  • feed-forward network

Each sublayer uses a residual connection.

Conceptually:

y=x+MultiHeadAttention(x)y = x + \text{MultiHeadAttention}(x)

then:

z=y+FeedForward(y)z = y + \text{FeedForward}(y)

The original Transformer combines residual connections with layer normalization:

LayerNorm(x+Sublayer(x))\text{LayerNorm}(x + \text{Sublayer}(x))

This pattern is often called:

Add and Norm

because it adds the residual connection and then normalizes the result.


10. Dimension Walkthrough

Suppose the input to a Transformer block is:

XR3×512X \in \mathbb{R}^{3 \times 512}

The multi-head attention block returns:

MHA(X)R3×512\text{MHA}(X) \in \mathbb{R}^{3 \times 512}

So residual addition is valid:

Y=X+MHA(X)R3×512Y = X + \text{MHA}(X) \in \mathbb{R}^{3 \times 512}

Then the feed-forward network also returns:

FFN(Y)R3×512\text{FFN}(Y) \in \mathbb{R}^{3 \times 512}

So another residual addition is valid:

Z=Y+FFN(Y)R3×512Z = Y + \text{FFN}(Y) \in \mathbb{R}^{3 \times 512}

The sequence length stays the same.

The model dimension stays the same.

But the representation becomes more contextual and refined after each block.


11. Why Transformers Need Residual Connections

Transformers stack many blocks.

In large language models, the number of layers can be very high.

Without residual connections:

  • early input information could be degraded
  • gradients could struggle to reach early layers
  • very deep stacks would be harder to optimize
  • training would be slower and less stable

Residual connections let the model go deep while preserving trainability.

They are one of the small architectural details that make large Transformers practical.


12. Common Confusions

Note

No. The layer still computes F(x)F(x). The shortcut adds the original input back to that output.

Note

No. They also create a shorter gradient path during backpropagation.

Note

No. Copying is available as a stable starting path, but the learned block can add useful transformations through F(x)F(x).

Note

No. Direct addition requires matching shapes.

Note

No. Residual connections help information and gradients flow. Layer normalization stabilizes activation scale.


13. One-Line Interview Answer

Residual connections add a sublayer's input to its output, usually written as x+F(x)x + F(x). This gives information and gradients a shortcut path through deep networks, reduces the burden on each layer, and makes very deep Transformer stacks easier to train.


14. Final Intuition

A plain deep network says:

Every layer must completely transform the signal for the next layer.

A residual network says:

Keep the original signal available, and let each block add what is missing.

That small change makes depth much easier to train.

In Transformers, residual connections preserve information across attention and feed-forward blocks, while layer normalization keeps the resulting activations stable.