Hyperparameters in Machine Learning
Deep Learning Foundations - The Deep Dive
Sibling notes: Gradient Descent - The Deep Dive · Epochs, Batches, and Normalization - The Intuition · Residual Connections in Transformers - The Deep Dive
Hyperparameters are the choices you set around training. The model does not learn them through normal backpropagation. They decide how learning happens, how large the model is, how long training runs, and how hard the model is stopped from memorizing.
Five matter most:
- Learning rate - how big each update step is
- Batch size - how many examples per update
- Epochs - how many full passes over the data
- Network depth - how many layers of representation
- Regularization strength - how hard we prevent memorizing
1. Parameters vs Hyperparameters
A model holds two kinds of important quantities, and they are learned in completely different ways.
| Concept | What it is | Example |
|---|---|---|
| Parameter | a value the model learns from the data | weights and biases |
| Hyperparameter | a setting you choose to shape training | learning rate, batch size, epochs, depth, regularization |
Backpropagation changes the parameters step by step to reduce the loss. It does not choose the hyperparameters for you.
Training a model is like teaching a student. The parameters are what the student learns. The hyperparameters are the teaching plan: how fast to move, how many examples to review at once, how many times to revise, how complex the lessons are, and how much discipline keeps the student from memorizing.
flowchart TB
H["hyperparameters: learning rate, batch size, epochs, depth, regularization"] --> T["shape how training happens"]
T --> P["parameters: weights and biases learned by backprop"]
Quick map
| Hyperparameter | Controls | Main risk if poorly chosen |
|---|---|---|
| Learning rate | step size of parameter updates | training diverges or crawls |
| Batch size | examples used before each update | noisy, slow, memory-heavy, or worse generalization |
| Epochs | full passes over the training data | too few underfits; too many overfits |
| Network depth | number of layers | too shallow underfits; too deep overfits or is hard to train |
| Regularization strength | how hard complexity is penalized | too weak overfits; too strong underfits |
2. Learning Rate
The learning rate is the size of the step the optimizer takes when it updates the parameters. After the model predicts, measures the loss, and computes gradients, the learning rate decides how far the weights move in the suggested direction:
Here is the learning rate and is the gradient.
Picture walking downhill in fog to reach the lowest point of a valley. The gradient tells you which way is downhill. The learning rate decides whether you take a tiny step, a normal step, or a reckless leap.
This connects directly to gradient descent, where the learning rate is the walker's level of trust in each step.
Why it matters
- Too high: the model overshoots good solutions, bounces around the minimum, and the loss can become unstable or explode.
- Too low: training crawls. Each update is tiny, so the model looks like it is barely learning.
- Good range: the loss falls smoothly while still moving fast enough to finish in reasonable time.
The learning rate often matters more than any other training hyperparameter, because it directly controls the optimization path. A strong architecture can still fail with the wrong learning rate.
Reading the symptoms
| Symptom | Likely issue | Action |
|---|---|---|
| loss becomes NaN, explodes, or jumps wildly | too high | reduce by 3x to 10x |
| loss decreases, but painfully slowly | too low | increase it, or use a schedule |
| improves first, then turns unstable | fine early, too high later | use decay or reduce-on-plateau |
| validation improves, then stalls | not fine enough for final convergence | lower the rate for fine-tuning |
Start from a common default (around 0.001 for Adam-style optimizers, 0.01 to 0.1 for some SGD setups), then adjust from the loss curve. A schedule starts larger to learn fast and shrinks later for fine adjustment. A learning-rate finder trains briefly while raising the rate to see where loss starts improving and where it goes unstable.
3. Batch Size
Batch size is the number of examples the model processes before making one parameter update. With 10,000 examples:
- batch size
100-> 100 updates per epoch - batch size
1-> update after every single example - batch size
10,000-> one update per epoch
(The epoch, batch, and iteration relationship is built up in Epochs, Batches, and Normalization - The Intuition.)
A teacher wants to know if the class understands a topic. Asking one student gives fast but noisy feedback. Asking the whole class gives stable feedback but takes longer. Batch size is how many opinions you gather before adjusting the lesson.
Small vs large batches
| Batch size | Benefits | Costs |
|---|---|---|
| Small | more frequent updates; the noise can help escape poor solutions; less memory | noisier loss curve; less efficient hardware use; may need a smaller learning rate |
| Large | more stable gradient estimate; efficient on GPUs/TPUs; fewer updates per epoch | more memory; can generalize worse if too large; may need learning-rate scaling or warmup |
A larger batch gives a steadier gradient estimate, so it can often tolerate a larger learning rate. A tiny batch gives noisy updates, so pairing it with a very large learning rate tends to blow up. When you change batch size a lot, retune the learning rate rather than leaving it fixed.
Common starting points are 32, 64, 128, or 256, depending on data and hardware. Use the largest batch that fits memory only when raw speed is the goal; the largest batch is not automatically the best model.
4. Epochs
An epoch is one complete pass through the entire training dataset. With 50,000 examples, one epoch means the model has had a chance to learn from all 50,000 once; 20 epochs means twenty passes.
An epoch is one full revision of a textbook. A few pages is not enough. One full read helps. Many revisions deepen understanding, but after too many the student memorizes the examples instead of learning the subject.
Too few, just right, too many
| Situation | What it means | What to do |
|---|---|---|
| Too few epochs | not enough time to learn; train and validation both weak | train longer, better learning rate, or more capacity |
| Good number | training loss falls, validation improves, then stabilizes | keep the best validation checkpoint |
| Too many epochs | training keeps improving but validation gets worse | early stopping, stronger regularization, or more data |
Do not guess the exact epoch count blindly. Train while watching validation loss, and if it stops improving for several epochs, stop and keep the best validation checkpoint. This treats epochs as a budget, not a fixed promise.
How epochs interact with the rest: a low learning rate usually needs more epochs (each step is small); strong regularization lets you train longer without overfitting; a very powerful deep network can overfit a small dataset in just a few epochs.
5. Network Depth
Depth is how many layers the network has, the number of hidden layers between input and output. More depth lets the model build more abstract representations in stages.
A shallow model decides straight from raw facts. A deeper model builds intermediate concepts first. For images, early layers find edges, middle layers combine edges into shapes, and later layers combine shapes into objects.
Why depth helps
- hierarchical features: simple patterns early, complex patterns later
- some complex functions are far more efficient to represent deep than shallow
- for vision, speech, language, and recommendation, depth often wins when data and compute are available
Why depth can hurt
- Overfitting: a deep model can memorize limited data
- Optimization difficulty: gradients may vanish, explode, or become poorly conditioned (this is exactly what residual connections and normalization were invented to tame)
- Compute cost: more layers mean more memory and slower training and inference
- Unnecessary complexity: on simple tabular problems, a very deep network may not beat a simpler model
| Situation | Depth recommendation |
|---|---|
| small tabular dataset | start shallow: 1-3 hidden layers before going deeper |
| images, audio, text, complex data | use proven architectures or pretrained models, not depth invented from scratch |
| train low and validation low | underfitting: try more depth/width or better features |
| train high but validation low | too complex: add regularization, reduce depth, or get more data |
6. Regularization Strength
Regularization strength controls how strongly you discourage the model from becoming too complex or too dependent on the training data. The goal is not the lowest training loss; it is generalization to new data.
Regularization is forcing a student to explain the general rule instead of memorizing the answer key. A little discipline improves understanding. Too much prevents the student from learning real details.
Common forms
| Method | What it does | "Stronger" means |
|---|---|---|
| L2 / weight decay | penalizes large weights for smoother solutions | larger coefficient |
| L1 | pushes some weights to exactly zero (sparse models) | larger coefficient |
| Dropout | randomly switches off units so no single pathway is over-trusted | higher dropout rate |
| Data augmentation | adds modified examples (crop, flip, small changes) | more variation added |
| Early stopping | stops when validation stops improving | patience and validation criteria |
Too weak vs too strong
| Setting | Training behavior | Validation behavior |
|---|---|---|
| Too weak | very low training loss; memorizes details | validation loss rises; poor generalization |
| Balanced | good training loss, not suspiciously perfect | validation improves and stays stable |
| Too strong | training loss stays high; cannot fit useful patterns | validation also weak; the model underfits |
If training is excellent but validation is poor, increase regularization. If both are poor, regularization may be too strong or the model too simple. Common weight-decay starting experiments are 1e-3, 1e-4, 1e-5. The right value is the one that improves out-of-sample performance, not the one with the lowest training loss.
7. How They Work Together
These are not isolated switches. Changing one shifts the best value of another.
| Interaction | Meaning |
|---|---|
| learning rate + batch size | large batches often allow larger learning rates; small batches need more careful ones |
| epochs + regularization | more epochs raise overfitting risk; regularization and early stopping lower it |
| depth + regularization | deeper models often need stronger regularization, more data, or better architecture |
| learning rate + epochs | a lower rate may need more epochs; a schedule blends fast early learning with careful final tuning |
| depth + batch size | deeper models use more memory, often forcing smaller batches |
8. A Practical Tuning Workflow
flowchart TD
A["1. Simple baseline, not the biggest model"] --> B["2. Tune learning rate first"]
B --> C["3. Batch size that fits hardware and trains stably"]
C --> D["4. Enough epochs + early stopping on validation"]
D --> E["5. Add depth only if clearly underfitting"]
E --> F["6. Add regularization if training >> validation"]
F --> G["7. Track every experiment"]
Hyperparameter tuning is controlled experimentation, not random guessing. You read the symptoms in the training and validation curves, form a hypothesis, change one or two things, and check whether generalization improved. Record hyperparameters, metrics, dataset version, and random seed so results are reproducible.
9. Common Confusions
No. Backprop learns parameters (weights and biases). Hyperparameters are set around training and tuned by experiment or search.
No. The goal is validation/out-of-sample performance. Suspiciously perfect training loss usually signals overfitting.
Bigger is not automatically better. Excess depth overfits and is harder to train; oversized batches can generalize worse and need learning-rate adjustments.
Treat epochs as a budget and use early stopping. The best stopping point depends on the learning rate, regularization, and the data.
10. Final Summary
| Hyperparameter | Simple explanation |
|---|---|
| Learning rate | how big each learning step is |
| Batch size | how many examples before one update |
| Epochs | how many times the model sees the full dataset |
| Network depth | how many layers of representation the model builds |
| Regularization strength | how strongly the model is prevented from memorizing |
The throughline: parameters are learned, hyperparameters are chosen. Tune them by reading the training and validation curves like symptoms, changing little, and keeping whatever genuinely improves generalization.