AI Foundations · Lesson 7 · Phase 3 — How a network learns

Gradient Descent — Rolling Downhill

Lesson 6 ended with one number: the loss, which measures how wrong the network is. Learning is nothing more mysterious than making that number smaller. Today you do the making-smaller yourself — by hand and in six lines of code — on a curve simple enough to see everything. The move you practice here is, unchanged, how the model you'll train at the end of this course learns.

Six lines of code that learn

Take the world's smallest learning problem: a "network" with a single weight w. Lesson 6 said the loss is a number you can compute for any setting of the weights. Suppose measuring it gives this curve:

L(w) = (w − 3)²

Stand at any w, and the loss is the gap to 3, squared. The perfect weight is obviously w = 3, where the loss hits 0 — but the code below never gets to "see the whole curve." Like a real network with thousands of weights, it only ever knows the ground under its feet: the loss here, and which way is downhill here. We drop it at w = 0, where the loss is (0 − 3)² = 9, and let it walk:

import Foundation

var w = 0.0
let eta = 0.1                  // learning rate: the step size
for step in 1...8 {
    let slope = 2 * (w - 3)    // dL/dw — the slope at the current w
    w -= eta * slope           // the update rule: step AGAINST the slope
    print(String(format: "step %d: w = %.4f   loss = %.4f",
                 step, w, (w - 3) * (w - 3)))
}

Run it — the Python version in a fresh Google Colab cell, the Swift version in a playground. You will get exactly this:

step012345678
w0.00000.60001.08001.46401.77122.01702.21362.37092.4967
loss9.00005.76003.68642.35931.50990.96640.61850.39580.2533

The loss falls every single step: 9 → 5.76 → 3.69 → … → 0.25. The weight crawls toward 3 without ever being told that 3 is the answer. Only two lines matter — slope = 2 * (w - 3) and w -= eta * slope. The rest is bookkeeping. Let's earn those two lines.

The only calculus you need today: the slope under your feet

The derivative of the loss at a point — written dL/dw — is just the slope of the curve exactly where you are standing. Steep means a big number; downhill-to-the-right means negative; uphill-to-the-right means positive; flat means zero. For our curve, calculus gives a tidy formula (verified in Wolfram Alpha):

dL/dw = 2(w − 3)

You don't have to take that on faith — you're an engineer, so unit-test it. The slope is rise over run: nudge w by a tiny h in both directions, measure how much L changed, divide. If the formula is right, formula and measurement must agree:

func L(_ w: Double) -> Double { (w - 3) * (w - 3) }  // the loss curve
let h = 1e-6                                         // a tiny nudge
for w in [0.0, 5.0] {
    let formula  = 2 * (w - 3)
    let measured = (L(w + h) - L(w - h)) / (2 * h)   // rise over run
    print("w = \(w): formula \(formula), measured \(measured)")
}
// w = 0.0: formula -6.0, measured -6.000000000838668
// w = 5.0: formula 4.0, measured 4.000000000559112

Match. So at our start, w = 0, the slope is 2·(0 − 3) = −6: negative, meaning the curve falls to the right — the minimum is somewhere rightward. At w = 5 the slope is +4: uphill to the right, so the minimum is leftward. As 3Blue1Brown puts it: "If the slope is negative, shift to the right. If the slope is positive, shift to the left." Notice the trick that makes the code so short: subtracting the slope does both cases automatically. Subtract a negative slope and you move right; subtract a positive slope and you move left. Always downhill, no if needed.

Feel it: roll the ball yourself

Here is the same curve with a ball standing at w = 0. The dashed green line is the slope under the ball. Press step to perform one update — the exact arithmetic of the code above, so with η = 0.1 your first eight presses reproduce the table line for line. Then experiment with the η slider:

w = 3 (the minimum) −2 0 6 8 L(w) = (w − 3)²
0.10
step 0
w = 0.0000
slope = −6.0000
loss = 9.0000

The most important line in machine learning

Time to name what you just did, at the usual three zoom levels. First, spelled out in words — this is the whole algorithm:

new w  =  current w  −  (step size) × (slope of the loss at the current w)

Second, the compact way every paper and textbook writes it (Wikipedia: gradient descent):

w  ←  w − η · dL/dw

Symbol by symbol:

That single line is learning. Lessons 8 and 9 only add machinery for computing the slope and for repeating the line efficiently; the line itself never changes.

η — the dial between crawling and exploding

You felt it in the widget; here is the η = 1.1 run from w = 0, computed exactly:

step012345678
w0.006.60−1.328.18−3.2210.46−5.9613.75−9.90
loss9.0012.9618.6626.8738.7055.7380.24115.55166.40

Each step is so large it doesn't just overshoot the minimum — it lands on the far slope higher than it started, where the slope is steeper, so the next step is even bigger. Every bounce ends 1.2× farther from 3, on the opposite side, and the loss climbs forever: the run diverges. This is a real failure mode, not a toy artifact — in Wikipedia's words, "a η too large would lead to overshoot and divergence," and picking a good η is a genuine practical problem in training real networks. Too small wastes compute crawling; too big explodes. (You'll meet the standard tricks in phase 4 — for now, 0.1 served us well.)

From one slope to thousands: the gradient

A real network's loss isn't a curve over one weight — it's a landscape over all of them (the small digit-reading network in the 3Blue1Brown video has 13,002 weights and biases). But nothing new happens: each weight gets its own slope, answering "if I nudge this weight, how fast does the loss change?" Stack all those slopes into one vector and you get the gradient, written ∇L ("nabla L"). The third, fully professional form of the update rule is then:

w  ←  w − η · ∇L(w)

Bold w is the whole weight list (a [Double], like the vectors of lesson 3), and the line means: every weight steps against its own slope, all at once. The gradient points in the direction of steepest ascent, so stepping along the negative gradient is the change to the weights that decreases the loss fastest — which is why the whole algorithm is called gradient descent (3Blue1Brown, Wikipedia). One question remains: how do you get thousands of slopes without nudging thousands of weights one by one? That trick is backpropagation — lesson 8.

Check yourself

No peeking back. Pull it from memory.

1. Standing at some w, what does dL/dw tell you?
2. In the update rule w ← w − η·dL/dw, the η controls:
3. In the verified run with η = 1.1, what happened?

Watch this next

Primary source: "Gradient descent, how neural networks learn" by 3Blue1Brown. It animates today's ball-on-a-curve, then lifts it to the full 13,002-dimensional landscape of a digit-reading network. (Heads-up: the video says cost where we say loss — same number, two names.) The piece it deliberately postpones — how all those slopes get computed — is exactly our Lesson 8.