Lesson 6 ended with one number: the loss, which measures how wrong the network is. Learning is nothing more mysterious than making that number smaller. Today you do the making-smaller yourself — by hand and in six lines of code — on a curve simple enough to see everything. The move you practice here is, unchanged, how the model you'll train at the end of this course learns.
Take the world's smallest learning problem: a "network" with a single weight w. Lesson 6 said the loss is a number you can compute for any setting of the weights. Suppose measuring it gives this curve:
Stand at any w, and the loss is the gap to 3, squared. The perfect weight is obviously w = 3, where the loss hits 0 — but the code below never gets to "see the whole curve." Like a real network with thousands of weights, it only ever knows the ground under its feet: the loss here, and which way is downhill here. We drop it at w = 0, where the loss is (0 − 3)² = 9, and let it walk:
import Foundation
var w = 0.0
let eta = 0.1 // learning rate: the step size
for step in 1...8 {
let slope = 2 * (w - 3) // dL/dw — the slope at the current w
w -= eta * slope // the update rule: step AGAINST the slope
print(String(format: "step %d: w = %.4f loss = %.4f",
step, w, (w - 3) * (w - 3)))
}
w = 0.0
eta = 0.1 # learning rate: the step size
for step in range(1, 9):
slope = 2 * (w - 3) # dL/dw — the slope at the current w
w = w - eta * slope # the update rule: step AGAINST the slope
print(f"step {step}: w = {w:.4f} loss = {(w - 3) ** 2:.4f}")
Run it — the Python version in a fresh Google Colab cell, the Swift version in a playground. You will get exactly this:
| step | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| w | 0.0000 | 0.6000 | 1.0800 | 1.4640 | 1.7712 | 2.0170 | 2.2136 | 2.3709 | 2.4967 |
| loss | 9.0000 | 5.7600 | 3.6864 | 2.3593 | 1.5099 | 0.9664 | 0.6185 | 0.3958 | 0.2533 |
The loss falls every single step: 9 → 5.76 → 3.69 → … → 0.25. The weight crawls toward 3 without ever being told that 3 is the answer. Only two lines matter — slope = 2 * (w - 3) and w -= eta * slope. The rest is bookkeeping. Let's earn those two lines.
The derivative of the loss at a point — written dL/dw — is just the slope of the curve exactly where you are standing. Steep means a big number; downhill-to-the-right means negative; uphill-to-the-right means positive; flat means zero. For our curve, calculus gives a tidy formula (verified in Wolfram Alpha):
You don't have to take that on faith — you're an engineer, so unit-test it. The slope is rise over run: nudge w by a tiny h in both directions, measure how much L changed, divide. If the formula is right, formula and measurement must agree:
func L(_ w: Double) -> Double { (w - 3) * (w - 3) } // the loss curve
let h = 1e-6 // a tiny nudge
for w in [0.0, 5.0] {
let formula = 2 * (w - 3)
let measured = (L(w + h) - L(w - h)) / (2 * h) // rise over run
print("w = \(w): formula \(formula), measured \(measured)")
}
// w = 0.0: formula -6.0, measured -6.000000000838668
// w = 5.0: formula 4.0, measured 4.000000000559112
L = lambda w: (w - 3) ** 2 # the loss curve
h = 1e-6 # a tiny nudge
for w in [0.0, 5.0]:
formula = 2 * (w - 3)
measured = (L(w + h) - L(w - h)) / (2 * h) # rise over run
print(f"w = {w}: formula {formula}, measured {measured:.6f}")
# w = 0.0: formula -6.0, measured -6.000000
# w = 5.0: formula 4.0, measured 4.000000
Match. So at our start, w = 0, the slope is 2·(0 − 3) = −6: negative, meaning the curve falls to the right — the minimum is somewhere rightward. At w = 5 the slope is +4: uphill to the right, so the minimum is leftward. As 3Blue1Brown puts it: "If the slope is negative, shift to the right. If the slope is positive, shift to the left." Notice the trick that makes the code so short: subtracting the slope does both cases automatically. Subtract a negative slope and you move right; subtract a positive slope and you move left. Always downhill, no if needed.
Here is the same curve with a ball standing at w = 0. The dashed green line is the slope under the ball. Press step to perform one update — the exact arithmetic of the code above, so with η = 0.1 your first eight presses reproduce the table line for line. Then experiment with the η slider:
Time to name what you just did, at the usual three zoom levels. First, spelled out in words — this is the whole algorithm:
Second, the compact way every paper and textbook writes it (Wikipedia: gradient descent):
Symbol by symbol:
w — the weight being trained. Just a Double.← — "becomes": assignment, math's way of writing w -= ….η — the Greek letter eta, the learning rate: how big a step you take. Wikipedia calls it "step size or learning rate" — same dial, two names.dL/dw — read as one symbol, not a fraction to compute: "the derivative of L with respect to w" — the slope of the loss curve at the current w. For our curve it happens to equal 2(w − 3).That single line is learning. Lessons 8 and 9 only add machinery for computing the slope and for repeating the line efficiently; the line itself never changes.
You felt it in the widget; here is the η = 1.1 run from w = 0, computed exactly:
| step | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| w | 0.00 | 6.60 | −1.32 | 8.18 | −3.22 | 10.46 | −5.96 | 13.75 | −9.90 |
| loss | 9.00 | 12.96 | 18.66 | 26.87 | 38.70 | 55.73 | 80.24 | 115.55 | 166.40 |
Each step is so large it doesn't just overshoot the minimum — it lands on the far slope higher than it started, where the slope is steeper, so the next step is even bigger. Every bounce ends 1.2× farther from 3, on the opposite side, and the loss climbs forever: the run diverges. This is a real failure mode, not a toy artifact — in Wikipedia's words, "a η too large would lead to overshoot and divergence," and picking a good η is a genuine practical problem in training real networks. Too small wastes compute crawling; too big explodes. (You'll meet the standard tricks in phase 4 — for now, 0.1 served us well.)
A real network's loss isn't a curve over one weight — it's a landscape over all of them (the small digit-reading network in the 3Blue1Brown video has 13,002 weights and biases). But nothing new happens: each weight gets its own slope, answering "if I nudge this weight, how fast does the loss change?" Stack all those slopes into one vector and you get the gradient, written ∇L ("nabla L"). The third, fully professional form of the update rule is then:
Bold w is the whole weight list (a [Double], like the vectors of lesson 3), and the line means: every weight steps against its own slope, all at once. The gradient points in the direction of steepest ascent, so stepping along the negative gradient is the change to the weights that decreases the loss fastest — which is why the whole algorithm is called gradient descent (3Blue1Brown, Wikipedia). One question remains: how do you get thousands of slopes without nudging thousands of weights one by one? That trick is backpropagation — lesson 8.
No peeking back. Pull it from memory.
Primary source: "Gradient descent, how neural networks learn" by 3Blue1Brown. It animates today's ball-on-a-curve, then lifts it to the full 13,002-dimensional landscape of a digit-reading network. (Heads-up: the video says cost where we say loss — same number, two names.) The piece it deliberately postpones — how all those slopes get computed — is exactly our Lesson 8.