AI Foundations · Lesson 6 · Phase 3 — How a network learns

Loss — Measuring How Wrong We Are

Lessons 1–5 built a machine that turns inputs into a prediction: numbers in, forward pass through the layers, number out. But that machine starts life with random weights, so its predictions start out as garbage. Phase 3 is about fixing that — teaching the network. And all of teaching rests on one almost embarrassingly simple idea: before anything can get better, it needs a score that says how wrong it currently is. Today you compute that score by hand.

Learning needs a score

Three new words, all of them friendly. The number your network outputs for an example is its prediction, written ŷ — read it aloud as "y-hat"; the hat is math's way of saying "an estimate of y". The true answer stored in your training data is the target y (also called a label). And the single number that summarizes how far all the predictions are from all the targets is the loss — in some books, like Nielsen's, it's called the cost; same thing.

Why one smooth number instead of just counting how many answers the network got right? Nielsen's argument: a tiny nudge to one weight usually doesn't flip any answer from wrong to right, so the count doesn't move — you get no signal about whether the nudge helped. A smooth loss moves a little with every nudge, so it always tells you whether you just made things better or worse. That property is the entire reason learning is possible, and lesson 7 cashes it in.

Feel it: three photos, three misses

A tiny cat detector looked at three photos. Its output neuron ends in a sigmoid (lesson 1), so each prediction is between 0 and 1; the target is 1 for a cat, 0 for not-a-cat. Photos 1 and 3 are cats, photo 2 isn't. The sliders are the network's predictions — the bars show the squared error of each photo and the average of the three, the MSE. Try this: fix photo 3 (drag it toward 1.00) and watch the average collapse. Then drag it to 0.00 and watch one terrible miss take over the whole score. Then match all three targets exactly.

0.90
0.20
0.40
0.01
0.04
0.36
MSE = average of the three = 0.137

Now by hand — the worked table

The recipe you were just driving has three steps. Step 1: for each example, take the signed miss, ŷ − y. Step 2: square it. Step 3: average the squares. Here is the widget's starting position, worked out completely:

phototarget yprediction ŷmiss ŷ−ysquared (ŷ−y)²
1 (cat)1.00.9−0.10.01
2 (not cat)0.00.2+0.20.04
3 (cat)1.00.4−0.60.36
MSE = (0.01 + 0.04 + 0.36) ⁄ 3 = 0.41 ⁄ 3 ≈ 0.1367

Why square the misses instead of just averaging them? Two reasons, and both are visible in the table (they're also the standard justification on Wikipedia's MSE page):

1. Squaring kills the sign. The misses −0.1 and +0.2 point in opposite directions, but their squares 0.01 and 0.04 are both positive — wrongness can't cancel out. If we averaged raw misses instead, a model that overshoots one example by +0.5 and undershoots another by −0.5 would score a raw average of exactly 0.0: a perfect grade for a terrible model. (Even our table's raw misses average to a misleadingly small −0.1667.)

2. Squaring punishes big misses harder. Photo 3's miss (0.6) is only 3× bigger than photo 2's (0.2), but its penalty is 9× bigger: 0.36 versus 0.04. Doubling a miss quadruples its penalty — so one confident wrong answer hurts the score far more than several small wobbles. That's the "one bar dominates" effect you saw in the widget.

One more property worth saying out loud: since every term is a square, the loss is never negative. Zero is the floor, and it's reached exactly when every prediction equals its target.

Write it in code

The whole thing is one loop: subtract, square, accumulate, divide. Run the Python version in Google Colab — you should see 0.13666666666666666, which is the table's 0.1367 and the widget's 0.137 before rounding. The Swift version prints the identical value in an Xcode playground.

let predictions = [0.9, 0.2, 0.4]   // what the network said
let targets     = [1.0, 0.0, 1.0]   // what the truth was

var total = 0.0
for (yHat, y) in zip(predictions, targets) {
    let error = yHat - y            // signed miss
    total += error * error          // square it
}
let mse = total / Double(predictions.count)  // average it

print(mse)   // 0.13666666666666666

Slider widget, paper table, code — all three computed the same number. If you can write this loop (you just read it twice), you already know the formula; we're only going to name it.

The real formula — three ways to write it

What you just computed is the mean squared errormean of the squared errors, the name is literally the recipe read backwards. Same drill as lesson 1: one formula, three levels of shorthand.

① The easy way — spelled out

For our three photos, every term written by hand:

MSE = ( (ŷ₁−y₁)² + (ŷ₂−y₂)² + (ŷ₃−y₃)² ) ⁄ 3

Read it left to right: miss on photo 1, squared; miss on photo 2, squared; miss on photo 3, squared; add them; divide by how many there are. That's the table.

② The compact way — with Σ

For n examples instead of 3, use the Σ "loop and add" shorthand you met in lesson 1 (Wikipedia's definition, with prediction and target swapped — squaring makes the order irrelevant, since (a−b)² = (b−a)²):

MSE = (1 ⁄ n) · Σi=1n (ŷᵢ − yᵢ)²

In code terms: total += (yHat[i] - y[i])**2 inside the loop, / n after it. You already ran exactly this.

③ The professional way — how Nielsen writes it

In Nielsen's book (our primary source) the same idea appears dressed for work:

C(w, b) = (1 ⁄ 2n) · Σx ‖ y(x) − a ‖²

Decode it piece by piece: C is the cost — his word for loss. a is the network's output for input x — his ŷ. The double bars ‖…‖ mean "length of a vector": when a network has several output neurons (say 10, one per digit), the miss is a whole list of numbers, and you square its length instead of a single difference. And 1/2n instead of 1/n? The extra ÷2 is cosmetic: halving every score doesn't change which weights produce the smallest one, so the winner of training is identical. Nielsen keeps the ½ because squares produce a factor of 2 when you take derivatives, and the ½ cancels it — a bookkeeping convenience we'll actually see pay off in lesson 8. If a paper's loss looks twice as big or small as yours, this is usually why.

The reframe that unlocks training

Now the most important sentence of phase 3. Look at Nielsen's left-hand side again: he writes the cost as C(w, b) — a function of the weights and biases. Not of the photos. The training data is frozen; you can't make the cat more cat-like. The only knobs that exist are w and b — and every time you turn one, every prediction shifts, so the loss shifts.

Make that concrete with the lesson-1 neuron a = σ(w·x + b), one input x = 1, bias 0, target y = 1. Turn only the weight knob:

w = 0:  a = σ(0) = 0.5,  loss = (0.5 − 1)² = 0.25
w = 2:  a = σ(2) ≈ 0.8808,  loss ≈ (0.8808 − 1)² ≈ 0.0142

Same data, same target — one knob moved, and the loss collapsed from 0.25 to about 0.014. So here's the reframe: the loss is a landscape over the weights, and training is nothing more than searching that landscape for the weights with the lowest loss. The mysterious word "learning" has just become an optimization problem a programmer can respect. The obvious next question — which way should each knob turn, without trying every combination? — has a beautiful answer called gradient descent. That's lesson 7: walking downhill on the loss.

Parked for later: MSE isn't the only loss. For classification tasks ("which digit is this?") a loss called cross-entropy usually trains better — Nielsen devotes a section of chapter 3 to it. Everything you learned today (one number, function of the weights, minimize it) transfers unchanged; only the scoring recipe swaps.

Check yourself

No peeking back. Pull it from memory.

1. Why square each miss before averaging?
2. In the MSE formula, what is ŷ ("y-hat")?
3. Loss is a function of the weights. Training therefore means:

Read this next

Primary source: Nielsen, Neural Networks and Deep Learning, chapter 1 — scroll to the section "Learning with gradient descent". Its first half is exactly today's lesson in his notation: the quadratic cost, why it must be smooth, and why minimizing it over weights and biases is learning. The second half is a preview of lesson 7 — read it if you're hungry.