Lessons 1–5 built a machine that turns inputs into a prediction: numbers in, forward pass through the layers, number out. But that machine starts life with random weights, so its predictions start out as garbage. Phase 3 is about fixing that — teaching the network. And all of teaching rests on one almost embarrassingly simple idea: before anything can get better, it needs a score that says how wrong it currently is. Today you compute that score by hand.
Three new words, all of them friendly. The number your network outputs for an example is its prediction, written ŷ — read it aloud as "y-hat"; the hat is math's way of saying "an estimate of y". The true answer stored in your training data is the target y (also called a label). And the single number that summarizes how far all the predictions are from all the targets is the loss — in some books, like Nielsen's, it's called the cost; same thing.
Why one smooth number instead of just counting how many answers the network got right? Nielsen's argument: a tiny nudge to one weight usually doesn't flip any answer from wrong to right, so the count doesn't move — you get no signal about whether the nudge helped. A smooth loss moves a little with every nudge, so it always tells you whether you just made things better or worse. That property is the entire reason learning is possible, and lesson 7 cashes it in.
A tiny cat detector looked at three photos. Its output neuron ends in a sigmoid (lesson 1), so each prediction is between 0 and 1; the target is 1 for a cat, 0 for not-a-cat. Photos 1 and 3 are cats, photo 2 isn't. The sliders are the network's predictions — the bars show the squared error of each photo and the average of the three, the MSE. Try this: fix photo 3 (drag it toward 1.00) and watch the average collapse. Then drag it to 0.00 and watch one terrible miss take over the whole score. Then match all three targets exactly.
The recipe you were just driving has three steps. Step 1: for each example, take the signed miss, ŷ − y. Step 2: square it. Step 3: average the squares. Here is the widget's starting position, worked out completely:
| photo | target y | prediction ŷ | miss ŷ−y | squared (ŷ−y)² |
|---|---|---|---|---|
| 1 (cat) | 1.0 | 0.9 | −0.1 | 0.01 |
| 2 (not cat) | 0.0 | 0.2 | +0.2 | 0.04 |
| 3 (cat) | 1.0 | 0.4 | −0.6 | 0.36 |
Why square the misses instead of just averaging them? Two reasons, and both are visible in the table (they're also the standard justification on Wikipedia's MSE page):
1. Squaring kills the sign. The misses −0.1 and +0.2 point in opposite directions, but their squares 0.01 and 0.04 are both positive — wrongness can't cancel out. If we averaged raw misses instead, a model that overshoots one example by +0.5 and undershoots another by −0.5 would score a raw average of exactly 0.0: a perfect grade for a terrible model. (Even our table's raw misses average to a misleadingly small −0.1667.)
2. Squaring punishes big misses harder. Photo 3's miss (0.6) is only 3× bigger than photo 2's (0.2), but its penalty is 9× bigger: 0.36 versus 0.04. Doubling a miss quadruples its penalty — so one confident wrong answer hurts the score far more than several small wobbles. That's the "one bar dominates" effect you saw in the widget.
One more property worth saying out loud: since every term is a square, the loss is never negative. Zero is the floor, and it's reached exactly when every prediction equals its target.
The whole thing is one loop: subtract, square, accumulate, divide. Run the Python version in Google Colab — you should see 0.13666666666666666, which is the table's 0.1367 and the widget's 0.137 before rounding. The Swift version prints the identical value in an Xcode playground.
let predictions = [0.9, 0.2, 0.4] // what the network said
let targets = [1.0, 0.0, 1.0] // what the truth was
var total = 0.0
for (yHat, y) in zip(predictions, targets) {
let error = yHat - y // signed miss
total += error * error // square it
}
let mse = total / Double(predictions.count) // average it
print(mse) // 0.13666666666666666
predictions = [0.9, 0.2, 0.4] # what the network said
targets = [1.0, 0.0, 1.0] # what the truth was
total = 0.0
for y_hat, y in zip(predictions, targets):
error = y_hat - y # signed miss
total += error * error # square it
mse = total / len(predictions) # average it
print(mse) # 0.13666666666666666
Slider widget, paper table, code — all three computed the same number. If you can write this loop (you just read it twice), you already know the formula; we're only going to name it.
What you just computed is the mean squared error — mean of the squared errors, the name is literally the recipe read backwards. Same drill as lesson 1: one formula, three levels of shorthand.
For our three photos, every term written by hand:
Read it left to right: miss on photo 1, squared; miss on photo 2, squared; miss on photo 3, squared; add them; divide by how many there are. That's the table.
For n examples instead of 3, use the Σ "loop and add" shorthand you met in lesson 1 (Wikipedia's definition, with prediction and target swapped — squaring makes the order irrelevant, since (a−b)² = (b−a)²):
n — how many examples you're grading (we had 3).i — the loop counter: example 1, example 2, … example n.ŷᵢ — "y-hat-i", the prediction for example i.yᵢ — the target for example i, so (ŷᵢ − yᵢ) is the signed miss.Σ … (1/n) — add them all up, divide by the count: an average.In code terms: total += (yHat[i] - y[i])**2 inside the loop, / n after it. You already ran exactly this.
In Nielsen's book (our primary source) the same idea appears dressed for work:
Decode it piece by piece: C is the cost — his word for loss. a is the network's output for input x — his ŷ. The double bars ‖…‖ mean "length of a vector": when a network has several output neurons (say 10, one per digit), the miss is a whole list of numbers, and you square its length instead of a single difference. And 1/2n instead of 1/n? The extra ÷2 is cosmetic: halving every score doesn't change which weights produce the smallest one, so the winner of training is identical. Nielsen keeps the ½ because squares produce a factor of 2 when you take derivatives, and the ½ cancels it — a bookkeeping convenience we'll actually see pay off in lesson 8. If a paper's loss looks twice as big or small as yours, this is usually why.
Now the most important sentence of phase 3. Look at Nielsen's left-hand side again: he writes the cost as C(w, b) — a function of the weights and biases. Not of the photos. The training data is frozen; you can't make the cat more cat-like. The only knobs that exist are w and b — and every time you turn one, every prediction shifts, so the loss shifts.
Make that concrete with the lesson-1 neuron a = σ(w·x + b), one input x = 1, bias 0, target y = 1. Turn only the weight knob:
Same data, same target — one knob moved, and the loss collapsed from 0.25 to about 0.014. So here's the reframe: the loss is a landscape over the weights, and training is nothing more than searching that landscape for the weights with the lowest loss. The mysterious word "learning" has just become an optimization problem a programmer can respect. The obvious next question — which way should each knob turn, without trying every combination? — has a beautiful answer called gradient descent. That's lesson 7: walking downhill on the loss.
No peeking back. Pull it from memory.
Primary source: Nielsen, Neural Networks and Deep Learning, chapter 1 — scroll to the section "Learning with gradient descent". Its first half is exactly today's lesson in his notation: the quadratic cost, why it must be smooth, and why minimizing it over weights and biases is learning. The second half is a preview of lesson 7 — read it if you're hungry.