Hyperparameters in Machine Learning

Deep Learning Foundations - The Deep Dive
Sibling notes: Gradient Descent - The Deep Dive · Epochs, Batches, and Normalization - The Intuition · Residual Connections in Transformers - The Deep Dive

Hyperparameters are the choices you set around training. The model does not learn them through normal backpropagation. They decide how learning happens, how large the model is, how long training runs, and how hard the model is stopped from memorizing.

Five matter most:

  • Learning rate - how big each update step is
  • Batch size - how many examples per update
  • Epochs - how many full passes over the data
  • Network depth - how many layers of representation
  • Regularization strength - how hard we prevent memorizing

1. Parameters vs Hyperparameters

A model holds two kinds of important quantities, and they are learned in completely different ways.

Concept What it is Example
Parameter a value the model learns from the data weights and biases
Hyperparameter a setting you choose to shape training learning rate, batch size, epochs, depth, regularization

Backpropagation changes the parameters step by step to reduce the loss. It does not choose the hyperparameters for you.

Note

Training a model is like teaching a student. The parameters are what the student learns. The hyperparameters are the teaching plan: how fast to move, how many examples to review at once, how many times to revise, how complex the lessons are, and how much discipline keeps the student from memorizing.

flowchart TB
    H["hyperparameters: learning rate, batch size, epochs, depth, regularization"] --> T["shape how training happens"]
    T --> P["parameters: weights and biases learned by backprop"]

Quick map

Hyperparameter Controls Main risk if poorly chosen
Learning rate step size of parameter updates training diverges or crawls
Batch size examples used before each update noisy, slow, memory-heavy, or worse generalization
Epochs full passes over the training data too few underfits; too many overfits
Network depth number of layers too shallow underfits; too deep overfits or is hard to train
Regularization strength how hard complexity is penalized too weak overfits; too strong underfits

2. Learning Rate

The learning rate is the size of the step the optimizer takes when it updates the parameters. After the model predicts, measures the loss, and computes gradients, the learning rate decides how far the weights move in the suggested direction:

wwηLw \leftarrow w - \eta \, \nabla L

Here η\eta is the learning rate and L\nabla L is the gradient.

Note

Picture walking downhill in fog to reach the lowest point of a valley. The gradient tells you which way is downhill. The learning rate decides whether you take a tiny step, a normal step, or a reckless leap.

This connects directly to gradient descent, where the learning rate is the walker's level of trust in each step.

Why it matters

  • Too high: the model overshoots good solutions, bounces around the minimum, and the loss can become unstable or explode.
  • Too low: training crawls. Each update is tiny, so the model looks like it is barely learning.
  • Good range: the loss falls smoothly while still moving fast enough to finish in reasonable time.

The learning rate often matters more than any other training hyperparameter, because it directly controls the optimization path. A strong architecture can still fail with the wrong learning rate.

Reading the symptoms

Symptom Likely issue Action
loss becomes NaN, explodes, or jumps wildly too high reduce by 3x to 10x
loss decreases, but painfully slowly too low increase it, or use a schedule
improves first, then turns unstable fine early, too high later use decay or reduce-on-plateau
validation improves, then stalls not fine enough for final convergence lower the rate for fine-tuning
Note

Start from a common default (around 0.001 for Adam-style optimizers, 0.01 to 0.1 for some SGD setups), then adjust from the loss curve. A schedule starts larger to learn fast and shrinks later for fine adjustment. A learning-rate finder trains briefly while raising the rate to see where loss starts improving and where it goes unstable.


3. Batch Size

Batch size is the number of examples the model processes before making one parameter update. With 10,000 examples:

  • batch size 100 -> 100 updates per epoch
  • batch size 1 -> update after every single example
  • batch size 10,000 -> one update per epoch

(The epoch, batch, and iteration relationship is built up in Epochs, Batches, and Normalization - The Intuition.)

Note

A teacher wants to know if the class understands a topic. Asking one student gives fast but noisy feedback. Asking the whole class gives stable feedback but takes longer. Batch size is how many opinions you gather before adjusting the lesson.

Small vs large batches

Batch size Benefits Costs
Small more frequent updates; the noise can help escape poor solutions; less memory noisier loss curve; less efficient hardware use; may need a smaller learning rate
Large more stable gradient estimate; efficient on GPUs/TPUs; fewer updates per epoch more memory; can generalize worse if too large; may need learning-rate scaling or warmup
Note

A larger batch gives a steadier gradient estimate, so it can often tolerate a larger learning rate. A tiny batch gives noisy updates, so pairing it with a very large learning rate tends to blow up. When you change batch size a lot, retune the learning rate rather than leaving it fixed.

Common starting points are 32, 64, 128, or 256, depending on data and hardware. Use the largest batch that fits memory only when raw speed is the goal; the largest batch is not automatically the best model.


4. Epochs

An epoch is one complete pass through the entire training dataset. With 50,000 examples, one epoch means the model has had a chance to learn from all 50,000 once; 20 epochs means twenty passes.

Note

An epoch is one full revision of a textbook. A few pages is not enough. One full read helps. Many revisions deepen understanding, but after too many the student memorizes the examples instead of learning the subject.

Too few, just right, too many

Situation What it means What to do
Too few epochs not enough time to learn; train and validation both weak train longer, better learning rate, or more capacity
Good number training loss falls, validation improves, then stabilizes keep the best validation checkpoint
Too many epochs training keeps improving but validation gets worse early stopping, stronger regularization, or more data
Note

Do not guess the exact epoch count blindly. Train while watching validation loss, and if it stops improving for several epochs, stop and keep the best validation checkpoint. This treats epochs as a budget, not a fixed promise.

How epochs interact with the rest: a low learning rate usually needs more epochs (each step is small); strong regularization lets you train longer without overfitting; a very powerful deep network can overfit a small dataset in just a few epochs.


5. Network Depth

Depth is how many layers the network has, the number of hidden layers between input and output. More depth lets the model build more abstract representations in stages.

Note

A shallow model decides straight from raw facts. A deeper model builds intermediate concepts first. For images, early layers find edges, middle layers combine edges into shapes, and later layers combine shapes into objects.

Why depth helps

  • hierarchical features: simple patterns early, complex patterns later
  • some complex functions are far more efficient to represent deep than shallow
  • for vision, speech, language, and recommendation, depth often wins when data and compute are available

Why depth can hurt

  • Overfitting: a deep model can memorize limited data
  • Optimization difficulty: gradients may vanish, explode, or become poorly conditioned (this is exactly what residual connections and normalization were invented to tame)
  • Compute cost: more layers mean more memory and slower training and inference
  • Unnecessary complexity: on simple tabular problems, a very deep network may not beat a simpler model
Situation Depth recommendation
small tabular dataset start shallow: 1-3 hidden layers before going deeper
images, audio, text, complex data use proven architectures or pretrained models, not depth invented from scratch
train low and validation low underfitting: try more depth/width or better features
train high but validation low too complex: add regularization, reduce depth, or get more data

6. Regularization Strength

Regularization strength controls how strongly you discourage the model from becoming too complex or too dependent on the training data. The goal is not the lowest training loss; it is generalization to new data.

Note

Regularization is forcing a student to explain the general rule instead of memorizing the answer key. A little discipline improves understanding. Too much prevents the student from learning real details.

Common forms

Method What it does "Stronger" means
L2 / weight decay penalizes large weights for smoother solutions larger coefficient
L1 pushes some weights to exactly zero (sparse models) larger coefficient
Dropout randomly switches off units so no single pathway is over-trusted higher dropout rate
Data augmentation adds modified examples (crop, flip, small changes) more variation added
Early stopping stops when validation stops improving patience and validation criteria

Too weak vs too strong

Setting Training behavior Validation behavior
Too weak very low training loss; memorizes details validation loss rises; poor generalization
Balanced good training loss, not suspiciously perfect validation improves and stays stable
Too strong training loss stays high; cannot fit useful patterns validation also weak; the model underfits
Note

If training is excellent but validation is poor, increase regularization. If both are poor, regularization may be too strong or the model too simple. Common weight-decay starting experiments are 1e-3, 1e-4, 1e-5. The right value is the one that improves out-of-sample performance, not the one with the lowest training loss.


7. How They Work Together

These are not isolated switches. Changing one shifts the best value of another.

Interaction Meaning
learning rate + batch size large batches often allow larger learning rates; small batches need more careful ones
epochs + regularization more epochs raise overfitting risk; regularization and early stopping lower it
depth + regularization deeper models often need stronger regularization, more data, or better architecture
learning rate + epochs a lower rate may need more epochs; a schedule blends fast early learning with careful final tuning
depth + batch size deeper models use more memory, often forcing smaller batches

8. A Practical Tuning Workflow

flowchart TD
    A["1. Simple baseline, not the biggest model"] --> B["2. Tune learning rate first"]
    B --> C["3. Batch size that fits hardware and trains stably"]
    C --> D["4. Enough epochs + early stopping on validation"]
    D --> E["5. Add depth only if clearly underfitting"]
    E --> F["6. Add regularization if training >> validation"]
    F --> G["7. Track every experiment"]
Note

Hyperparameter tuning is controlled experimentation, not random guessing. You read the symptoms in the training and validation curves, form a hypothesis, change one or two things, and check whether generalization improved. Record hyperparameters, metrics, dataset version, and random seed so results are reproducible.


9. Common Confusions

Note

No. Backprop learns parameters (weights and biases). Hyperparameters are set around training and tuned by experiment or search.

Note

No. The goal is validation/out-of-sample performance. Suspiciously perfect training loss usually signals overfitting.

Note

Bigger is not automatically better. Excess depth overfits and is harder to train; oversized batches can generalize worse and need learning-rate adjustments.

Note

Treat epochs as a budget and use early stopping. The best stopping point depends on the learning rate, regularization, and the data.


10. Final Summary

Hyperparameter Simple explanation
Learning rate how big each learning step is
Batch size how many examples before one update
Epochs how many times the model sees the full dataset
Network depth how many layers of representation the model builds
Regularization strength how strongly the model is prevented from memorizing

The throughline: parameters are learned, hyperparameters are chosen. Tune them by reading the training and validation curves like symptoms, changing little, and keeping whatever genuinely improves generalization.