Epochs, Batches, and Normalization

Deep Learning Foundations - The Intuition
Sibling notes: Batch Normalization and Layer Normalization - The Deep Dive · Gradient Descent - The Deep Dive · Residual Connections in Transformers - The Deep Dive

These are the small "bits and bytes" of training that are easy to nod along to and then quietly lose over time:

  • Epoch - how many times the model sees the data
  • Batch - how much data it sees at once
  • Batch normalization - keep the numbers stable using a group of examples
  • Layer normalization - keep the numbers stable using one example

This is the intuition companion. For the full math (the scale and shift parameters, padding, pre-norm vs post-norm), see Batch Normalization and Layer Normalization - The Deep Dive.

flowchart LR
    D["full dataset"] --> B["split into batches"]
    B --> I["one batch = one update = one iteration"]
    I --> E["all batches seen once = one epoch"]
    E --> R["repeat for many epochs"]

1. Epoch: One Full Pass Through the Data

(small correction first: the word is epoch, not "e-pouch".)

An epoch is one complete pass through the entire training dataset.

Picture studying for an exam from a 1,000-page book:

  • read all 1,000 pages once -> 1 epoch
  • read the whole book again -> 2 epochs
  • read it ten times -> 10 epochs

In training, the "book" is your training data. If you have 10,000 images and the model has seen all 10,000 of them once, that is 1 epoch.

Why more than one epoch?

A model does not learn everything in a single pass, just like a student rarely masters a book on the first read.

  • first pass: "okay, I roughly get it"
  • second pass: "now I see the pattern"
  • third pass: "now it is sticking"

Each epoch gives the model another chance to adjust its weights and shrink its mistakes.

Note

Read the book too many times and you start memorizing exact pages instead of understanding. That is overfitting. In practice you watch the validation loss and stop when it stops improving (early stopping).


2. Batch: A Chunk Seen at Once

Showing the model all 10,000 examples in one shot is heavy and unstable. So we split the data into small groups called batches.

With 10,000 examples and a batch size of 100:

  • Batch 1: examples 1-100
  • Batch 2: examples 101-200
  • ...
  • Batch 100: examples 9901-10000

After all 100 batches, the model has seen the whole dataset once. That is 1 epoch.

Note

A batch is one plate of food. An epoch is the full meal. You finish the meal one plate at a time.


3. Epoch vs Batch vs Iteration (the part that gets confused)

There is a third word that completes the picture: iteration (also called a step).

  • one batch flows through the model
  • the model predicts, measures the loss, and updates its weights once
  • that single update is one iteration

So the three connect by simple arithmetic:

iterations per epoch=dataset sizebatch size\text{iterations per epoch} = \frac{\text{dataset size}}{\text{batch size}}

Tiny example:

Quantity Value
dataset size 10,000
batch size 100
iterations per epoch 100
epochs 10
total weight updates 1,000
Term Meaning
Batch a small chunk of examples processed together
Iteration / step one weight update (one batch)
Epoch the whole dataset seen once (all batches)

4. Why Normalize Anything?

A neural network just passes numbers from layer to layer:

layer 1 -> layer 2 -> layer 3 -> ...

During training, every layer's weights keep changing, so each layer keeps receiving inputs whose scale and distribution keep shifting underneath it.

Note

It is like learning to drive while, every few minutes, the steering wheel changes size and the pedals move. You can still learn, but it stays shaky.

Normalization says: before passing the numbers on, put them back into a comfortable, predictable range. That makes training steadier and usually faster.


5. Batch Normalization

Batch normalization cleans up the activations using a batch of examples.

For each feature, it looks across the examples in the batch and rescales the values so they have mean close to 0 and spread close to 1:

x^=xμσ\hat{x} = \frac{x - \mu}{\sigma}

where the mean and spread are measured over the batch.

Note

A batch of marks like 5, 10, 90, 200, -30 is messy. Batch norm rescales them into a balanced range so the next layer receives something tidy to work with.

It is called batch normalization because the average and spread come from the current batch. If a batch holds 32 images, it asks "across these 32, what is the average activation, and how spread out is it?" and normalizes from that.

Why it helps: steadier training, often faster learning, and less chance of activations exploding or collapsing. It is like adding traffic rules so the numbers flow smoothly instead of chaotically.


6. Layer Normalization

Layer normalization has the same goal, stable numbers, but it measures along a different direction:

  • Batch norm normalizes across the batch, for each feature.
  • Layer norm normalizes across the features, inside one example.
Note

Batch norm asks: "for this feature, what is the average across many examples?" Layer norm asks: "for this one example, what is the average across its own features?"


7. The Clean Difference

Picture a classroom where each student has marks in a few subjects:

Student Maths Science English
A 60 80 50
B 90 40 70
C 30 95 65

Batch norm runs down the columns: normalize Maths across all students, then Science across all students, and so on. It compares one feature across many examples.

Layer norm runs across the rows: normalize all of Student A's subjects together, then all of Student B's, and so on. It balances the features inside a single example.

              Maths  Science  English
 Student A     60      80       50     --> layer norm runs across a row
 Student B     90      40       70
 Student C     30      95       65
                 |
                 v
           batch norm runs down a column

A tiny number example makes it concrete:

Example 1: [2, 4, 6, 8]
Example 2: [1, 3, 5, 7]
Example 3: [10, 20, 30, 40]

Batch norm view (down each column):

  • feature 1 -> 2, 1, 10
  • feature 2 -> 4, 3, 20

Layer norm view (across each row):

  • Example 1 -> 2, 4, 6, 8
  • Example 3 -> 10, 20, 30, 40

8. Why Transformers Use Layer Normalization

Transformers process sequences of tokens, like:

I love machine learning

Each token is a vector, and the model wants to stabilize each token's own representation. Layer norm fits this naturally because it works on one example's features and does not lean on the batch.

Batch norm leans on batch statistics, which gets awkward when:

  • the batch is small
  • sequences have different lengths, so padding distorts the average
  • you generate one token at a time at inference

So in Transformers you almost always see LayerNorm: each token is cleaned up based on its own feature vector. (More detail in Batch Normalization and Layer Normalization - The Deep Dive.)


9. One Important Detail: Scale and Shift

Centering to mean 0 and spread 1 is only half the move. Both batch norm and layer norm then apply two learnable numbers per feature:

y=γx^+βy = \gamma \hat{x} + \beta
  • gamma rescales
  • beta shifts

This lets the network keep the stability when it helps, or partly undo the normalization when a layer genuinely prefers a different range. The model decides this during training, the same way it learns its weights.


10. Common Confusions

Note

No. An iteration is one weight update (one batch). An epoch is a full pass over the data, which is many iterations.

Note

Not directly. Batch size changes how many iterations fit in one epoch and how noisy each update is, not how many passes the model needs to learn.

Note

No. Batch norm normalizes a feature across the batch. Layer norm normalizes the features inside one example.

Note

It centers and rescales (mean 0, spread 1), then applies a learnable scale and shift. It is not a hard clamp.


11. One-Line Answers

  • Epoch: one full pass through the entire training dataset.
  • Batch: a small group of examples processed together in one step.
  • Iteration: one weight update, produced by one batch.
  • Batch normalization: stabilize activations using the mean and spread of a batch, per feature.
  • Layer normalization: stabilize activations using the mean and spread of one example's own features.

12. Final Mental Model

Think of training as school.

  • An epoch is the student going through the whole textbook once.
  • An iteration is studying one chapter and adjusting based on it.
  • Batch normalization is the teacher comparing students against the class average and rescaling.
  • Layer normalization is the teacher looking inside one student's own subjects and balancing them.

And this connects straight back to residual connections:

The residual connection helps information flow through a deep network. Normalization keeps that flowing information numerically stable.

That is why the two travel together in every Transformer block:

x -> LayerNorm -> Attention/MLP -> add residual

or, in the original ordering:

x -> Attention/MLP -> add residual -> LayerNorm

The exact order varies, but the partnership is the same: residual paths carry the signal, normalization keeps it calm.