Transformer Decoder Architecture

Transformers in Deep Learning - Part 11
Sibling notes: Transformer Encoder Architecture - The Deep Dive · Multi-Head Attention - The Deep Dive · Residual Connections in Transformers - The Deep Dive

The Transformer decoder generates the output sequence.

In a translation model, the encoder reads the source sentence:

I like pizza.

and the decoder generates the target sentence:

me gusta la pizza.

The decoder has a harder job than the encoder:

It must generate one token at a time, while using both the tokens generated so far and the encoder's representation of the input sentence.


1. Encoder-Decoder Strategy

In an encoder-decoder model, the encoder reads the input sequence and turns it into useful internal representations.

The decoder then uses those representations to produce the output sequence.

For translation:

flowchart LR
    EN["English input: I like pizza"] --> ENC["encoder"]
    ENC --> C["encoder output"]
    C --> DEC["decoder"]
    DEC --> ES["Spanish output: me gusta la pizza"]

Older RNN-based encoder-decoder models processed tokens sequentially.

The encoder compressed the whole input into a hidden state, and the decoder generated tokens one by one from that summary.

Transformers keep the encoder-decoder idea, but change the mechanics:

  • the encoder processes the input sequence in parallel
  • the decoder uses attention instead of only one fixed summary vector
  • training can process target positions in parallel with a causal mask
  • inference still generates tokens sequentially

2. Training vs Inference

The decoder behaves differently during training and inference.

During Inference

At inference time, the model does not know the future output tokens.

It starts with a start token:

[<start>][\text{<start>}]

Then it predicts the first token:

me\text{me}

Then the decoder receives:

[<start>,me][\text{<start>}, \text{me}]

and predicts:

gusta\text{gusta}

This continues until the model predicts an end token.

flowchart TD
    S["<start>"] --> P1["predict me"]
    P1 --> I2["<start>, me"]
    I2 --> P2["predict gusta"]
    P2 --> I3["<start>, me, gusta"]
    I3 --> P3["predict la"]

Inference is sequential because each next token depends on the tokens generated so far.

During Training

During training, the full target sentence is already known from the dataset.

So instead of feeding the model's wrong predictions back into itself, the decoder receives the correct previous target tokens.

This is called teacher forcing.

For the target:

me gusta la pizza

the decoder input is shifted right:

[<start>,me,gusta,la][\text{<start>}, \text{me}, \text{gusta}, \text{la}]

and the expected output is:

[me,gusta,la,pizza][\text{me}, \text{gusta}, \text{la}, \text{pizza}]

This lets training compute losses for all positions in one forward pass.


3. The Data Leakage Problem

If the decoder receives the whole target sequence at once, self-attention creates a problem.

Plain self-attention allows every token to look at every other token.

That means the token at position 1 could look at position 2, 3, and 4.

During training, this would let the model see future answers.

For example, when predicting:

me\text{me}

the model should not already see:

gusta la pizza\text{gusta la pizza}

That would be data leakage.

The model would learn under conditions that do not exist during inference.

The solution is masked self-attention, also called causal self-attention.


4. Masked Self-Attention

Masked self-attention is self-attention with one extra rule:

A token may attend only to itself and earlier tokens, not future tokens.

For decoder input:

[<start>,me,gusta,la,pizza][\text{<start>}, \text{me}, \text{gusta}, \text{la}, \text{pizza}]

the allowed attention pattern is:

Token Can attend to
<start> <start>
me <start>, me
gusta <start>, me, gusta
la <start>, me, gusta, la
pizza <start>, me, gusta, la, pizza

Future positions are blocked.

flowchart TD
    A["position t"] --> B["can see positions <= t"]
    A --> C["cannot see positions > t"]

This keeps training parallel while preserving the same causal constraint used during inference.


5. How the Causal Mask Works

Self-attention first computes score values:

QKTdk\frac{QK^T}{\sqrt{d_k}}

For 5 decoder tokens, the score matrix has shape:

5×55 \times 5

Before softmax, the decoder adds a mask.

Allowed positions receive:

00

Future positions receive:

-\infty

So the mask looks conceptually like:

[000000000000000]\begin{bmatrix} 0 & -\infty & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty & -\infty \\ 0 & 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix}

Then:

softmax()=0\text{softmax}(-\infty) = 0

So future tokens contribute no attention weight.

The masked attention formula is:

MaskedAttention(Q,K,V)=softmax(QKTdk+M)V\text{MaskedAttention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} + M \right) V

where MM is the causal mask.

The mask is not learned. It is a fixed rule.


6. Decoder Input: Embeddings Plus Position

Just like the encoder, the decoder needs token meaning and token order.

The decoder input begins as target-token embeddings:

EtargetE_{\text{target}}

Then positional encodings are added:

X=Etarget+PEtargetX = E_{\text{target}} + PE_{\text{target}}

If the target input length is 5 and the model dimension is 512:

XR5×512X \in \mathbb{R}^{5 \times 512}

This is the input to masked multi-head self-attention.


7. First Decoder Sublayer: Masked Multi-Head Self-Attention

The first sublayer of the decoder is:

masked multi-head self-attention

It is like encoder self-attention, but with a causal mask.

Its job is to build contextual representations from the target tokens generated so far.

For example, when processing:

[<start>,me,gusta][\text{<start>}, \text{me}, \text{gusta}]

the representation of gusta can use <start> and me, but not future tokens like la or pizza.

The output shape remains:

5×5125 \times 512

Then the decoder applies Add and Norm:

Y=LayerNorm(X+MaskedMHA(X))Y = \text{LayerNorm}(X + \text{MaskedMHA}(X))

8. Second Decoder Sublayer: Cross-Attention

The second major decoder sublayer is cross-attention.

Masked self-attention answers:

How do the target tokens relate to previous target tokens?

Cross-attention answers:

How do the target tokens relate to the source sentence encoded by the encoder?

This is where the decoder uses the encoder output.

If:

H=EncoderOutputH = \text{EncoderOutput}

then cross-attention uses:

  • queries from the decoder side
  • keys from the encoder side
  • values from the encoder side
Q=YWQQ = YW_Q K=HWKK = HW_K V=HWVV = HW_V

Then:

CrossAttention(Y,H)=softmax(QKTdk)V\text{CrossAttention}(Y,H) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
flowchart LR
    Y["decoder hidden states"] --> Q["queries"]
    H["encoder output"] --> K["keys"]
    H --> V["values"]
    Q --> CA["cross-attention"]
    K --> CA
    V --> CA
    CA --> OUT["source-aware decoder states"]

This lets each target token attend to the most relevant source tokens.


9. Why Cross-Attention Is Powerful

In older encoder-decoder models, the encoder often compressed the input sentence into one summary vector.

That creates a bottleneck.

For long sentences, early information can be forgotten.

Transformers avoid that bottleneck.

The encoder produces one contextual vector for every source token:

HRnsource×512H \in \mathbb{R}^{n_{\text{source}} \times 512}

The decoder can attend to all of them dynamically.

For translation:

  • gusta may attend strongly to like
  • pizza may attend strongly to pizza
  • function words may attend to syntax or phrase structure

Each target position gets its own source-aware context.


10. Add and Norm After Cross-Attention

Cross-attention is followed by another Add and Norm step.

If YY is the input to cross-attention:

Z=LayerNorm(Y+CrossAttention(Y,H))Z = \text{LayerNorm}(Y + \text{CrossAttention}(Y,H))

This preserves the residual path and stabilizes activations.

The shape remains:

5×5125 \times 512

11. Third Decoder Sublayer: Feed-Forward Network

After cross-attention, the decoder uses a position-wise feed-forward network.

It is the same kind of feed-forward network used in the encoder:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

with dimensions:

5122048512512 \rightarrow 2048 \rightarrow 512

The feed-forward network:

  • adds nonlinearity
  • increases feature capacity
  • transforms each token position independently
  • returns to 512 dimensions for residual compatibility

Then the decoder applies another Add and Norm:

O=LayerNorm(Z+FFN(Z))O = \text{LayerNorm}(Z + \text{FFN}(Z))

This OO is the output of one complete decoder block.


12. One Complete Decoder Block

One decoder block contains:

  1. masked multi-head self-attention
  2. Add and Norm
  3. encoder-decoder cross-attention
  4. Add and Norm
  5. feed-forward network
  6. Add and Norm
flowchart TD
    X["target embeddings + position"] --> MMHA["masked multi-head self-attention"]
    MMHA --> ADD1["add residual"]
    X --> ADD1
    ADD1 --> LN1["layer norm"]

    LN1 --> CA["cross-attention"]
    ENC["encoder output"] --> CA
    CA --> ADD2["add residual"]
    LN1 --> ADD2
    ADD2 --> LN2["layer norm"]

    LN2 --> FFN["feed-forward network"]
    FFN --> ADD3["add residual"]
    LN2 --> ADD3
    ADD3 --> LN3["layer norm"]
    LN3 --> OUT["decoder block output"]

In equations:

Y=LayerNorm(X+MaskedMHA(X))Y = \text{LayerNorm}(X + \text{MaskedMHA}(X)) Z=LayerNorm(Y+CrossAttention(Y,H))Z = \text{LayerNorm}(Y + \text{CrossAttention}(Y,H)) O=LayerNorm(Z+FFN(Z))O = \text{LayerNorm}(Z + \text{FFN}(Z))

13. Stacking Decoder Blocks

The original Transformer uses 6 decoder blocks.

Each decoder block has the same structure but its own learned parameters.

The output of one decoder block becomes the input to the next:

X1Decoder1X2Decoder2Decoder6X_1 \rightarrow \text{Decoder}_1 \rightarrow X_2 \rightarrow \text{Decoder}_2 \rightarrow \dots \rightarrow \text{Decoder}_6

Positional encoding is added before the first decoder block, not between every block.

The encoder output is available to the cross-attention sublayer in every decoder block.


14. Final Prediction Layer

After the final decoder block, the model has a contextual vector for each target position.

If the decoder input length is 5:

OR5×512O \in \mathbb{R}^{5 \times 512}

These 512-dimensional vectors must be converted into vocabulary predictions.

Suppose the target vocabulary has 100,000 tokens.

The model applies a linear projection:

logits=OWvocab+b\text{logits} = OW_{\text{vocab}} + b

where:

WvocabR512×100000W_{\text{vocab}} \in \mathbb{R}^{512 \times 100000}

So:

logitsR5×100000\text{logits} \in \mathbb{R}^{5 \times 100000}

Then softmax converts logits into probabilities:

P(next token)=softmax(logits)P(\text{next token}) = \text{softmax}(\text{logits})

The highest-probability token can be selected as the prediction, or a decoding strategy such as sampling or beam search can be used.


15. Why Decoder Output Count Matches Input Count

The decoder produces one output distribution for each input position.

During training:

Decoder input:

[<start>,me,gusta,la][\text{<start>}, \text{me}, \text{gusta}, \text{la}]

Expected output:

[me,gusta,la,pizza][\text{me}, \text{gusta}, \text{la}, \text{pizza}]

So a length-4 decoder input produces 4 prediction distributions.

Each position predicts the next token.


16. What Happens During Inference

During inference, generation is still sequential.

Step 1:

[<start>]predict me[\text{<start>}] \rightarrow \text{predict me}

Step 2:

[<start>,me]predict gusta[\text{<start>}, \text{me}] \rightarrow \text{predict gusta}

Step 3:

[<start>,me,gusta]predict la[\text{<start>}, \text{me}, \text{gusta}] \rightarrow \text{predict la}

At step 3, the decoder technically produces output distributions for every position in the prefix.

But only the last position is used:

use prediction at final position\text{use prediction at final position}

Earlier position predictions are ignored because they were already generated in previous steps.

In practical implementations, caching is often used so the model does not recompute all previous key/value states from scratch at every step.


17. Is the Mask Used During Inference?

The decoder architecture remains causal during inference.

If the full prefix is processed at each step, the causal mask can still be applied.

However, because the input prefix contains no future tokens, the mask often has little visible effect at the final position.

The key point is:

The decoder is trained and used as a causal model. A position is never allowed to depend on future target tokens.

The mask is a fixed operation, not a trainable parameter.


18. Decoder Shape Summary

Assume:

  • target length = 5
  • source length = 3
  • dmodel=512d_{\text{model}} = 512
  • h=8h = 8
  • dk=dv=64d_k = d_v = 64
Stage Shape
target embeddings + positional encoding 5×5125 \times 512
masked self-attention output 5×5125 \times 512
first Add and Norm 5×5125 \times 512
encoder output 3×5123 \times 512
cross-attention output 5×5125 \times 512
second Add and Norm 5×5125 \times 512
feed-forward output 5×5125 \times 512
third Add and Norm 5×5125 \times 512
vocabulary logits $5 \times

19. Decoder vs Encoder

Component Encoder Decoder
input source tokens target tokens generated so far
self-attention unmasked masked/causal
cross-attention no yes, attends to encoder output
feed-forward network yes yes
Add and Norm after each major sublayer after each major sublayer
output contextual source vectors next-token prediction states

The decoder has everything the encoder has, plus:

  • causal masking
  • cross-attention to encoder outputs
  • final vocabulary projection

20. Cross-Attention Beyond Translation

Cross-attention is not limited to translation.

It is a general mechanism for connecting one representation stream to another.

For example:

Query side Key/value side Meaning
text tokens image patches text attends to image content
generated text audio features text generation uses audio context
decoder tokens encoder tokens translation or summarization

This is why cross-attention is important in multimodal models.

It lets one kind of representation pull relevant information from another kind of representation.


21. Common Confusions

Note

No. Training can process all target positions in parallel using teacher forcing and a causal mask.

Note

It receives the shifted target sequence, but the causal mask prevents each position from attending to future target tokens.

Note

No. The diagonal is usually allowed because a token may attend to itself. Future positions above the diagonal are masked.

Note

No. The mask is fixed. The attention projection matrices are learned.

Note

No. Queries come from the decoder hidden states, while keys and values come from the encoder output.

Note

No. At each generation step, only the final position's prediction is used.


22. One-Line Interview Answer

The Transformer decoder uses masked self-attention to look only at previous target tokens, cross-attention to attend to encoder outputs, and a feed-forward network for nonlinear processing, with residual connections and layer normalization around each sublayer. During training, shifted targets and causal masking allow parallel training; during inference, tokens are generated sequentially.


23. Final Intuition

The encoder reads the source.

The decoder writes the target.

Masked self-attention tells the decoder:

Use only what has already been generated.

Cross-attention tells it:

Look back at the source sentence and pull the relevant information.

The feed-forward network refines each token representation.

The vocabulary projection turns the final hidden states into actual token predictions.

Together, these pieces let the Transformer generate sequences while staying faithful to the input and avoiding future-token leakage.