2.5 How Neural Networks Learn

🎯 Core Goals

Understand that “training” means adjusting numbers (weights) inside the network.
Get hands-on intuition by adjusting weights yourself and seeing predictions change.
Learn key vocabulary: weights, loss, backpropagation, gradient descent, epochs.
Clarify what “Mixture of Experts” actually means.

Training a neural network is like tuning thousands of tiny knobs. Each knob (called a “weight”) controls how much one piece of information influences another. During training, the network adjusts these knobs over and over until its predictions match the right answers. The process that figures out which direction to turn each knob is called backpropagation.

👁️ Visuals & Interactives

Inside a Neural Network

Adjust the weight sliders and watch predictions shift. This is how training works — tuning numbers until the answer is right.

"The cat sat on the ___"

Remember tokenization? The LLM reads these as tokens, not words.

south

Adjust Weights

w₁ 2

w₂ 8

w₃ 2

Watch the network learn the right answer

📝 Key Concepts

Weights: The Knobs Inside the Network

The Analogy: Imagine a mixing board in a recording studio. Each slider controls how loud one instrument is. In a neural network, each “slider” (weight) controls how much one piece of input influences the next layer.
Before Training: All weights start random — like the starting position in the interactive above. The network’s predictions are gibberish. It might complete “The cat sat on the ___” with “banana.”
After Training: The knobs have been fine-tuned so the network reliably predicts “mat” (or another sensible word). This isn’t because it “knows” what a mat is — it’s because the numbers have been adjusted to produce the right output, billions of times.

⚖️

Loss

How wrong the prediction is. High loss = very wrong, near zero = correct.

🔙

Backpropagation

Works backward through every knob to figure out: "which ones caused the mistake, and which way should I turn them?"

⛰️

Gradient Descent

Like walking downhill in fog — always stepping in the direction that reduces the loss. The "learning rate" is how big each step is.

🔁

Epochs

One full pass through all training examples. Models typically train for hundreds of epochs — like rehearsing a speech over and over.

If this doesn't quite make sense, that's perfectly fine. You don't need to understand how a neural network is trained in an engineering sense to use LLMs effectively. The key takeaway: training means adjusting numbers until the model gets better at predicting.

A large LLM might have hundreds of billions of weights. Training all of them means doing gradient descent across an almost incomprehensible number of dimensions. That’s why LLM training requires massive GPU clusters and can cost millions of dollars.

🧠 Mixture of Experts: Not What You Think

Remember the 3 weights you just tuned? Now imagine a network with 128 weight groups — but here’s the trick: only 1 or 2 groups are active at a time. A tiny “router” decides which group to activate for each word.

Common Misconception

"Multiple AI Agents Talking to Each Other"

People imagine a panel of tiny AIs sitting around a table, debating and voting on the answer — like a committee meeting.

🤖💬🤖
💬🤖💬

Reality

"A Router Dispatches Tokens to Specialists"

Like a switchboard operator — for each word, it routes to the most relevant specialist knob. No discussion. No voting. Just efficient routing.

🚦 → 🧠
🚦 → 🧮

The Router: A tiny sub-network that looks at each incoming word and picks which expert handles it — like a switchboard operator connecting a call to the right department.
The Experts: Specialized groups of knobs, each good at certain types of words (e.g., one group handles code, another handles French, another handles math). Only 1 or 2 groups are active per word — the rest stay idle.
Why It Matters: This lets models be much larger without proportionally increasing compute cost. More knobs total, but you only turn on a few at a time — so the model is smarter without being dramatically slower. That’s why some of the largest models today can still respond in seconds: even with hundreds of billions of parameters, only a fraction are active for any single word.

“Mixture of Experts” does NOT mean multiple AI agents collaborating. The “experts” aren’t separate models with their own goals — they’re specialized groups of knobs within a single model, selected one at a time by a simple routing mechanism. Think switchboard operator, not committee meeting.