Residual Connections in Transformers
Transformers in Deep Learning - Part 9
Sibling notes: Batch Normalization and Layer Normalization - The Deep Dive · Multi-Head Attention - The Deep Dive
Deep neural networks are powerful, but they are difficult to train.
As information moves through many layers, two things can go wrong:
- the forward signal can become distorted
- the backward gradient can become too small or too large
Residual connections solve this by giving information a shortcut path through the network.
1. The Problem With Very Deep Networks
Imagine an untrained deep network with randomly initialized weights.
The input passes through layer after layer:
Each layer transforms the signal.
Early in training, those transformations are mostly random.
After many layers, useful information from the original input may become distorted.
flowchart LR
X["input"] --> L1["layer 1"]
L1 --> L2["layer 2"]
L2 --> L3["layer 3"]
L3 --> LN["many more layers"]
LN --> O["noisy output"]
If the output is mostly noise, learning becomes harder.
The network must discover useful transformations while also trying not to destroy the information it already has.
2. The Vanishing Gradient Problem
Training uses backpropagation.
Gradients flow backward from the loss through all layers.
In a very deep network, gradients may repeatedly get multiplied by small values.
If many values are less than 1, their product can become extremely small:
This means early layers may receive almost no useful update.
That is the vanishing gradient problem.
flowchart RL
LOSS["loss"] --> G3["later gradients"]
G3 --> G2["smaller gradients"]
G2 --> G1["tiny gradients near early layers"]
If the early layers barely update, the whole deep model may train slowly or fail to converge.
3. The Core Idea of Residual Connections
Instead of forcing information to pass only through a long chain of transformations, give it a shortcut.
For a block that computes:
a residual connection outputs:
The input skips around the block and is added to the block's transformed output.
flowchart LR
X["x"] --> F["block F(x)"]
F --> ADD["+"]
X --> ADD
ADD --> Y["y = x + F(x)"]
This shortcut is also called a skip connection.
The block no longer has to learn the entire desired output from scratch.
It only needs to learn the residual:
What should be added to the input?
4. Why It Helps Forward Information Flow
Without a residual connection, the next block receives only:
If is poor early in training, the next block receives poor information.
With a residual connection, the next block receives:
Even if is not useful yet, the original input signal still passes forward.
This gives later layers access to cleaner information.
flowchart TD
X["input signal"] --> LONG["transformed path F(x)"]
X --> SHORT["shortcut path x"]
LONG --> ADD["combine"]
SHORT --> ADD
ADD --> NEXT["next block"]
The network can start with something close to an identity path and gradually learn useful refinements.
5. Why It Helps Gradient Flow
Residual connections also help during backpropagation.
For:
the gradient has a direct path through the addition.
The derivative contains an identity term:
That term matters.
It means gradients can flow backward through the shortcut even when the transformed path is weak.
So residual connections improve:
- forward information flow
- backward gradient flow
- early training stability
- optimization of deep networks
6. Learning Residual Information
A residual block does not need to learn:
"Create the entire next representation."
It can instead learn:
"Keep the input, and add the useful difference."
This reduces the burden on each block.
If the best transformation is close to doing nothing, the block can learn a small residual.
If the model needs richer behavior, the block can add more through .
This is why residual networks can be much deeper than plain feed-forward stacks.
7. Why Shapes Must Match
Residual addition is element-wise:
So the two tensors must have the same shape.
If:
then:
is needed for direct addition.
This shape requirement influences architecture design.
Transformers are built so the major sublayers preserve the model dimension:
That makes residual addition simple.
8. Addition vs Concatenation
Instead of adding, one could concatenate:
Concatenation does not require matching feature dimensions, but it increases the representation size.
| Method | Shape requirement | Cost |
|---|---|---|
| addition | same shape | keeps dimension fixed |
| concatenation | dimensions can differ | increases dimension and parameters |
Transformers usually use addition because it keeps the model width stable.
If every residual path doubled the dimension, later layers would become much more expensive.
9. Residual Connections in Transformers
A Transformer block has two major sublayers:
- multi-head attention
- feed-forward network
Each sublayer uses a residual connection.
Conceptually:
then:
The original Transformer combines residual connections with layer normalization:
This pattern is often called:
Add and Norm
because it adds the residual connection and then normalizes the result.
10. Dimension Walkthrough
Suppose the input to a Transformer block is:
The multi-head attention block returns:
So residual addition is valid:
Then the feed-forward network also returns:
So another residual addition is valid:
The sequence length stays the same.
The model dimension stays the same.
But the representation becomes more contextual and refined after each block.
11. Why Transformers Need Residual Connections
Transformers stack many blocks.
In large language models, the number of layers can be very high.
Without residual connections:
- early input information could be degraded
- gradients could struggle to reach early layers
- very deep stacks would be harder to optimize
- training would be slower and less stable
Residual connections let the model go deep while preserving trainability.
They are one of the small architectural details that make large Transformers practical.
12. Common Confusions
No. The layer still computes . The shortcut adds the original input back to that output.
No. They also create a shorter gradient path during backpropagation.
No. Copying is available as a stable starting path, but the learned block can add useful transformations through .
No. Direct addition requires matching shapes.
No. Residual connections help information and gradients flow. Layer normalization stabilizes activation scale.
13. One-Line Interview Answer
Residual connections add a sublayer's input to its output, usually written as . This gives information and gradients a shortcut path through deep networks, reduces the burden on each layer, and makes very deep Transformer stacks easier to train.
14. Final Intuition
A plain deep network says:
Every layer must completely transform the signal for the next layer.
A residual network says:
Keep the original signal available, and let each block add what is missing.
That small change makes depth much easier to train.
In Transformers, residual connections preserve information across attention and feed-forward blocks, while layer normalization keeps the resulting activations stable.