Epochs, Batches, and Normalization
Deep Learning Foundations - The Intuition
Sibling notes: Batch Normalization and Layer Normalization - The Deep Dive · Gradient Descent - The Deep Dive · Residual Connections in Transformers - The Deep Dive
These are the small "bits and bytes" of training that are easy to nod along to and then quietly lose over time:
- Epoch - how many times the model sees the data
- Batch - how much data it sees at once
- Batch normalization - keep the numbers stable using a group of examples
- Layer normalization - keep the numbers stable using one example
This is the intuition companion. For the full math (the scale and shift parameters, padding, pre-norm vs post-norm), see Batch Normalization and Layer Normalization - The Deep Dive.
flowchart LR
D["full dataset"] --> B["split into batches"]
B --> I["one batch = one update = one iteration"]
I --> E["all batches seen once = one epoch"]
E --> R["repeat for many epochs"]
1. Epoch: One Full Pass Through the Data
(small correction first: the word is epoch, not "e-pouch".)
An epoch is one complete pass through the entire training dataset.
Picture studying for an exam from a 1,000-page book:
- read all 1,000 pages once -> 1 epoch
- read the whole book again -> 2 epochs
- read it ten times -> 10 epochs
In training, the "book" is your training data. If you have 10,000 images and the model has seen all 10,000 of them once, that is 1 epoch.
Why more than one epoch?
A model does not learn everything in a single pass, just like a student rarely masters a book on the first read.
- first pass: "okay, I roughly get it"
- second pass: "now I see the pattern"
- third pass: "now it is sticking"
Each epoch gives the model another chance to adjust its weights and shrink its mistakes.
Read the book too many times and you start memorizing exact pages instead of understanding. That is overfitting. In practice you watch the validation loss and stop when it stops improving (early stopping).
2. Batch: A Chunk Seen at Once
Showing the model all 10,000 examples in one shot is heavy and unstable. So we split the data into small groups called batches.
With 10,000 examples and a batch size of 100:
- Batch 1: examples 1-100
- Batch 2: examples 101-200
- ...
- Batch 100: examples 9901-10000
After all 100 batches, the model has seen the whole dataset once. That is 1 epoch.
A batch is one plate of food. An epoch is the full meal. You finish the meal one plate at a time.
3. Epoch vs Batch vs Iteration (the part that gets confused)
There is a third word that completes the picture: iteration (also called a step).
- one batch flows through the model
- the model predicts, measures the loss, and updates its weights once
- that single update is one iteration
So the three connect by simple arithmetic:
Tiny example:
| Quantity | Value |
|---|---|
| dataset size | 10,000 |
| batch size | 100 |
| iterations per epoch | 100 |
| epochs | 10 |
| total weight updates | 1,000 |
| Term | Meaning |
|---|---|
| Batch | a small chunk of examples processed together |
| Iteration / step | one weight update (one batch) |
| Epoch | the whole dataset seen once (all batches) |
4. Why Normalize Anything?
A neural network just passes numbers from layer to layer:
layer 1 -> layer 2 -> layer 3 -> ...
During training, every layer's weights keep changing, so each layer keeps receiving inputs whose scale and distribution keep shifting underneath it.
It is like learning to drive while, every few minutes, the steering wheel changes size and the pedals move. You can still learn, but it stays shaky.
Normalization says: before passing the numbers on, put them back into a comfortable, predictable range. That makes training steadier and usually faster.
5. Batch Normalization
Batch normalization cleans up the activations using a batch of examples.
For each feature, it looks across the examples in the batch and rescales the values so they have mean close to 0 and spread close to 1:
where the mean and spread are measured over the batch.
A batch of marks like 5, 10, 90, 200, -30 is messy. Batch norm rescales them into a balanced range so the next layer receives something tidy to work with.
It is called batch normalization because the average and spread come from the current batch. If a batch holds 32 images, it asks "across these 32, what is the average activation, and how spread out is it?" and normalizes from that.
Why it helps: steadier training, often faster learning, and less chance of activations exploding or collapsing. It is like adding traffic rules so the numbers flow smoothly instead of chaotically.
6. Layer Normalization
Layer normalization has the same goal, stable numbers, but it measures along a different direction:
- Batch norm normalizes across the batch, for each feature.
- Layer norm normalizes across the features, inside one example.
Batch norm asks: "for this feature, what is the average across many examples?" Layer norm asks: "for this one example, what is the average across its own features?"
7. The Clean Difference
Picture a classroom where each student has marks in a few subjects:
| Student | Maths | Science | English |
|---|---|---|---|
| A | 60 | 80 | 50 |
| B | 90 | 40 | 70 |
| C | 30 | 95 | 65 |
Batch norm runs down the columns: normalize Maths across all students, then Science across all students, and so on. It compares one feature across many examples.
Layer norm runs across the rows: normalize all of Student A's subjects together, then all of Student B's, and so on. It balances the features inside a single example.
Maths Science English
Student A 60 80 50 --> layer norm runs across a row
Student B 90 40 70
Student C 30 95 65
|
v
batch norm runs down a column
A tiny number example makes it concrete:
Example 1: [2, 4, 6, 8]
Example 2: [1, 3, 5, 7]
Example 3: [10, 20, 30, 40]
Batch norm view (down each column):
- feature 1 -> 2, 1, 10
- feature 2 -> 4, 3, 20
Layer norm view (across each row):
- Example 1 -> 2, 4, 6, 8
- Example 3 -> 10, 20, 30, 40
8. Why Transformers Use Layer Normalization
Transformers process sequences of tokens, like:
I love machine learning
Each token is a vector, and the model wants to stabilize each token's own representation. Layer norm fits this naturally because it works on one example's features and does not lean on the batch.
Batch norm leans on batch statistics, which gets awkward when:
- the batch is small
- sequences have different lengths, so padding distorts the average
- you generate one token at a time at inference
So in Transformers you almost always see LayerNorm: each token is cleaned up based on its own feature vector. (More detail in Batch Normalization and Layer Normalization - The Deep Dive.)
9. One Important Detail: Scale and Shift
Centering to mean 0 and spread 1 is only half the move. Both batch norm and layer norm then apply two learnable numbers per feature:
- gamma rescales
- beta shifts
This lets the network keep the stability when it helps, or partly undo the normalization when a layer genuinely prefers a different range. The model decides this during training, the same way it learns its weights.
10. Common Confusions
No. An iteration is one weight update (one batch). An epoch is a full pass over the data, which is many iterations.
Not directly. Batch size changes how many iterations fit in one epoch and how noisy each update is, not how many passes the model needs to learn.
No. Batch norm normalizes a feature across the batch. Layer norm normalizes the features inside one example.
It centers and rescales (mean 0, spread 1), then applies a learnable scale and shift. It is not a hard clamp.
11. One-Line Answers
- Epoch: one full pass through the entire training dataset.
- Batch: a small group of examples processed together in one step.
- Iteration: one weight update, produced by one batch.
- Batch normalization: stabilize activations using the mean and spread of a batch, per feature.
- Layer normalization: stabilize activations using the mean and spread of one example's own features.
12. Final Mental Model
Think of training as school.
- An epoch is the student going through the whole textbook once.
- An iteration is studying one chapter and adjusting based on it.
- Batch normalization is the teacher comparing students against the class average and rescaling.
- Layer normalization is the teacher looking inside one student's own subjects and balancing them.
And this connects straight back to residual connections:
The residual connection helps information flow through a deep network. Normalization keeps that flowing information numerically stable.
That is why the two travel together in every Transformer block:
x -> LayerNorm -> Attention/MLP -> add residual
or, in the original ordering:
x -> Attention/MLP -> add residual -> LayerNorm
The exact order varies, but the partnership is the same: residual paths carry the signal, normalization keeps it calm.