Hyperparameters in Machine Learning

Deep Learning Foundations - The Deep Dive
Sibling notes: Gradient Descent - The Deep Dive · Epochs, Batches, and Normalization - The Intuition · Residual Connections in Transformers - The Deep Dive

Hyperparameters are the choices you set around training. The model does not learn them through normal backpropagation. They decide how learning happens, how large the model is, how long training runs, and how hard the model is stopped from memorizing.

Five matter most:

Learning rate - how big each update step is
Batch size - how many examples per update
Epochs - how many full passes over the data
Network depth - how many layers of representation
Regularization strength - how hard we prevent memorizing

1. Parameters vs Hyperparameters

A model holds two kinds of important quantities, and they are learned in completely different ways.

Concept	What it is	Example
Parameter	a value the model learns from the data	weights and biases
Hyperparameter	a setting you choose to shape training	learning rate, batch size, epochs, depth, regularization

Backpropagation changes the parameters step by step to reduce the loss. It does not choose the hyperparameters for you.

Note

Training a model is like teaching a student. The parameters are what the student learns. The hyperparameters are the teaching plan: how fast to move, how many examples to review at once, how many times to revise, how complex the lessons are, and how much discipline keeps the student from memorizing.

flowchart TB
    H["hyperparameters: learning rate, batch size, epochs, depth, regularization"] --> T["shape how training happens"]
    T --> P["parameters: weights and biases learned by backprop"]

Quick map

Hyperparameter	Controls	Main risk if poorly chosen
Learning rate	step size of parameter updates	training diverges or crawls
Batch size	examples used before each update	noisy, slow, memory-heavy, or worse generalization
Epochs	full passes over the training data	too few underfits; too many overfits
Network depth	number of layers	too shallow underfits; too deep overfits or is hard to train
Regularization strength	how hard complexity is penalized	too weak overfits; too strong underfits

2. Learning Rate

The learning rate is the size of the step the optimizer takes when it updates the parameters. After the model predicts, measures the loss, and computes gradients, the learning rate decides how far the weights move in the suggested direction:

w \leftarrow w - \eta \, \nabla L

Here $\eta$ is the learning rate and $\nabla L$ is the gradient.

Note

Picture walking downhill in fog to reach the lowest point of a valley. The gradient tells you which way is downhill. The learning rate decides whether you take a tiny step, a normal step, or a reckless leap.

This connects directly to gradient descent, where the learning rate is the walker's level of trust in each step.

Why it matters

Too high: the model overshoots good solutions, bounces around the minimum, and the loss can become unstable or explode.
Too low: training crawls. Each update is tiny, so the model looks like it is barely learning.
Good range: the loss falls smoothly while still moving fast enough to finish in reasonable time.

The learning rate often matters more than any other training hyperparameter, because it directly controls the optimization path. A strong architecture can still fail with the wrong learning rate.

Reading the symptoms

Symptom	Likely issue	Action
loss becomes NaN, explodes, or jumps wildly	too high	reduce by 3x to 10x
loss decreases, but painfully slowly	too low	increase it, or use a schedule
improves first, then turns unstable	fine early, too high later	use decay or reduce-on-plateau
validation improves, then stalls	not fine enough for final convergence	lower the rate for fine-tuning

Note

Start from a common default (around 0.001 for Adam-style optimizers, 0.01 to 0.1 for some SGD setups), then adjust from the loss curve. A schedule starts larger to learn fast and shrinks later for fine adjustment. A learning-rate finder trains briefly while raising the rate to see where loss starts improving and where it goes unstable.

3. Batch Size

Batch size is the number of examples the model processes before making one parameter update. With 10,000 examples:

batch size 100 -> 100 updates per epoch
batch size 1 -> update after every single example
batch size 10,000 -> one update per epoch

(The epoch, batch, and iteration relationship is built up in Epochs, Batches, and Normalization - The Intuition.)

Note

A teacher wants to know if the class understands a topic. Asking one student gives fast but noisy feedback. Asking the whole class gives stable feedback but takes longer. Batch size is how many opinions you gather before adjusting the lesson.

Small vs large batches

Batch size	Benefits	Costs
Small	more frequent updates; the noise can help escape poor solutions; less memory	noisier loss curve; less efficient hardware use; may need a smaller learning rate
Large	more stable gradient estimate; efficient on GPUs/TPUs; fewer updates per epoch	more memory; can generalize worse if too large; may need learning-rate scaling or warmup

Note

A larger batch gives a steadier gradient estimate, so it can often tolerate a larger learning rate. A tiny batch gives noisy updates, so pairing it with a very large learning rate tends to blow up. When you change batch size a lot, retune the learning rate rather than leaving it fixed.

Common starting points are 32, 64, 128, or 256, depending on data and hardware. Use the largest batch that fits memory only when raw speed is the goal; the largest batch is not automatically the best model.

4. Epochs

An epoch is one complete pass through the entire training dataset. With 50,000 examples, one epoch means the model has had a chance to learn from all 50,000 once; 20 epochs means twenty passes.

Note

An epoch is one full revision of a textbook. A few pages is not enough. One full read helps. Many revisions deepen understanding, but after too many the student memorizes the examples instead of learning the subject.

Too few, just right, too many

Situation	What it means	What to do
Too few epochs	not enough time to learn; train and validation both weak	train longer, better learning rate, or more capacity
Good number	training loss falls, validation improves, then stabilizes	keep the best validation checkpoint
Too many epochs	training keeps improving but validation gets worse	early stopping, stronger regularization, or more data

Note

Do not guess the exact epoch count blindly. Train while watching validation loss, and if it stops improving for several epochs, stop and keep the best validation checkpoint. This treats epochs as a budget, not a fixed promise.

How epochs interact with the rest: a low learning rate usually needs more epochs (each step is small); strong regularization lets you train longer without overfitting; a very powerful deep network can overfit a small dataset in just a few epochs.

5. Network Depth

Depth is how many layers the network has, the number of hidden layers between input and output. More depth lets the model build more abstract representations in stages.

Note

A shallow model decides straight from raw facts. A deeper model builds intermediate concepts first. For images, early layers find edges, middle layers combine edges into shapes, and later layers combine shapes into objects.

Why depth helps

hierarchical features: simple patterns early, complex patterns later
some complex functions are far more efficient to represent deep than shallow
for vision, speech, language, and recommendation, depth often wins when data and compute are available

Why depth can hurt

Overfitting: a deep model can memorize limited data
Optimization difficulty: gradients may vanish, explode, or become poorly conditioned (this is exactly what residual connections and normalization were invented to tame)
Compute cost: more layers mean more memory and slower training and inference
Unnecessary complexity: on simple tabular problems, a very deep network may not beat a simpler model

Situation	Depth recommendation
small tabular dataset	start shallow: 1-3 hidden layers before going deeper
images, audio, text, complex data	use proven architectures or pretrained models, not depth invented from scratch
train low and validation low	underfitting: try more depth/width or better features
train high but validation low	too complex: add regularization, reduce depth, or get more data

6. Regularization Strength

Regularization strength controls how strongly you discourage the model from becoming too complex or too dependent on the training data. The goal is not the lowest training loss; it is generalization to new data.

Note

Regularization is forcing a student to explain the general rule instead of memorizing the answer key. A little discipline improves understanding. Too much prevents the student from learning real details.

Common forms

Method	What it does	"Stronger" means
L2 / weight decay	penalizes large weights for smoother solutions	larger coefficient
L1	pushes some weights to exactly zero (sparse models)	larger coefficient
Dropout	randomly switches off units so no single pathway is over-trusted	higher dropout rate
Data augmentation	adds modified examples (crop, flip, small changes)	more variation added
Early stopping	stops when validation stops improving	patience and validation criteria

Too weak vs too strong

Setting	Training behavior	Validation behavior
Too weak	very low training loss; memorizes details	validation loss rises; poor generalization
Balanced	good training loss, not suspiciously perfect	validation improves and stays stable
Too strong	training loss stays high; cannot fit useful patterns	validation also weak; the model underfits

Note

If training is excellent but validation is poor, increase regularization. If both are poor, regularization may be too strong or the model too simple. Common weight-decay starting experiments are 1e-3, 1e-4, 1e-5. The right value is the one that improves out-of-sample performance, not the one with the lowest training loss.

7. How They Work Together

These are not isolated switches. Changing one shifts the best value of another.

Interaction	Meaning
learning rate + batch size	large batches often allow larger learning rates; small batches need more careful ones
epochs + regularization	more epochs raise overfitting risk; regularization and early stopping lower it
depth + regularization	deeper models often need stronger regularization, more data, or better architecture
learning rate + epochs	a lower rate may need more epochs; a schedule blends fast early learning with careful final tuning
depth + batch size	deeper models use more memory, often forcing smaller batches

8. A Practical Tuning Workflow

flowchart TD
    A["1. Simple baseline, not the biggest model"] --> B["2. Tune learning rate first"]
    B --> C["3. Batch size that fits hardware and trains stably"]
    C --> D["4. Enough epochs + early stopping on validation"]
    D --> E["5. Add depth only if clearly underfitting"]
    E --> F["6. Add regularization if training >> validation"]
    F --> G["7. Track every experiment"]

Note

Hyperparameter tuning is controlled experimentation, not random guessing. You read the symptoms in the training and validation curves, form a hypothesis, change one or two things, and check whether generalization improved. Record hyperparameters, metrics, dataset version, and random seed so results are reproducible.

9. Common Confusions

Note

No. Backprop learns parameters (weights and biases). Hyperparameters are set around training and tuned by experiment or search.

Note

No. The goal is validation/out-of-sample performance. Suspiciously perfect training loss usually signals overfitting.

Note

Bigger is not automatically better. Excess depth overfits and is harder to train; oversized batches can generalize worse and need learning-rate adjustments.

Note

Treat epochs as a budget and use early stopping. The best stopping point depends on the learning rate, regularization, and the data.

10. Final Summary

Hyperparameter	Simple explanation
Learning rate	how big each learning step is
Batch size	how many examples before one update
Epochs	how many times the model sees the full dataset
Network depth	how many layers of representation the model builds
Regularization strength	how strongly the model is prevented from memorizing

The throughline: parameters are learned, hyperparameters are chosen. Tune them by reading the training and validation curves like symptoms, changing little, and keeping whatever genuinely improves generalization.