AI Foundations · Lesson 8 · Phase 3 — How a network learns

Backpropagation — the Error Flows Backwards

Lesson 7 ended on a cliffhanger: gradient descent can walk a weight downhill, but only if someone hands it the slope dL/dw — how much the loss changes when that weight nudges. Today you compute that slope yourself, by hand, through a whole neuron. The trick is one rule, and it's just multiplication. This is the algorithm that makes every trained model — including the one you'll train in Phase 4 — possible.

The problem: w is buried at the end of a chain

Take the smallest possible "network": one neuron, one input, learning from one example. You know every stage already — the weighted sum from lesson 1, the squared-error loss from lesson 6 (with a single example, the average in MSE is just that one term):

x  →  z = w·x + b  →  a = σ(z)  →  L = (a − y)²

Here's the trouble. The loss L never touches w directly. Wiggle w and the effect travels through a chain: w moves z, z moves a, and only a moves L. So "how much does L change per nudge of w" has to account for every link. In a real network a weight in layer 1 might be ten links away from the loss — same problem, longer chain.

Feel it: step through the chain

Concrete numbers: input x = 1.0, target y = 1.0, current weight w = 0.5, bias b = 0.0. Press forward three times — values fill left to right, exactly the forward pass from lesson 4. Then press backward three times and watch each link report its own local slope ("if my input nudges by a tiny amount, my output moves this many times that nudge") while a running product accumulates right to left. The final number is the prize: dL/dw.

input
given
x = 1.0
weighted sum
w 0.5 · b 0.0
z = ?
dz/dw = x = 1.0
dz/db = 1.0
activation
a = σ(z)
a = ?
da/dz = ?
loss
(a − y)² · y 1.0
L = ?
dL/da = ?
Press forward to push the data through, left to right.

Name what you just did: the chain rule

The rule you just used has a name — the chain rule — and it says exactly one thing: to get the slope through a chain, multiply the local slopes of the links. Three ways to write it, easiest first.

① The easy way — spelled out

(how L moves per nudge of w) =
  (how L moves per nudge of a) × (how a moves per nudge of z) × (how z moves per nudge of w)

Why multiplication? Nudge w by a hair. That hair gets scaled by the first link to become a nudge in z, scaled again into a nudge in a, scaled again into a change in L. Three scalings in a row — scalings multiply. If you only remember this one, you understand backpropagation.

② The compact way — slope notation

"How L moves per nudge of w" is what lesson 7 called a derivative, written dL/dw. Same sentence, shorter symbols:

dL/dw = dL/da · da/dz · dz/dw
dL/db = dL/da · da/dz · dz/db

Read each fraction-looking thing as one symbol: "slope of the top per nudge of the bottom." Notice the bookkeeping beauty: da appears on the bottom then on the top, dz likewise — the chain links snap together like LEGO. (In papers you'll see the curly instead of d3Blue1Brown writes ∂C/∂w, where their C, "cost", is our loss L, as lesson 7 flagged. The flags "this function has several knobs; we're nudging just this one." Read it exactly the same way.)

③ The pro way — local slopes filled in

Each link in our chain has a known formula, so each local slope does too:

dL/da = 2(a − y)    because L = (a − y)²
da/dz = σ(z)·(1 − σ(z)) = a·(1 − a)    the sigmoid's own slope
dz/dw = x   and   dz/db = 1    because z = w·x + b

Where do these come from?

dL/da = 2(a − y). This is the slope of the loss bowl from lesson 7: L = (a − y)² is a parabola in a, and its slope at any point is twice the miss. (Sanity check by nudging, lesson-7 style: at our numbers 2(a − y) = −0.755081, and measuring L just before and just after a tiny nudge of a gives −0.755081 too.)

da/dz = σ(z)(1 − σ(z)). The famous sigmoid identity: the sigmoid's slope at any point is its own output times one minus its own output (Wikipedia: Logistic function, where it appears as f(x)(1 − f(x))). Two gifts in one. First, no new function to evaluate — you already computed a = σ(z) during the forward pass, so the slope is just a·(1 − a), almost free. Second, you can verify it numerically, no algebra required: at z = 0.5, σ(z)(1 − σ(z)) = 0.2350037, and nudging z and re-measuring σ gives 0.2350037. Identity confirmed by experiment.

dz/dw = x. Since z = w·x + b, the weight's effect on z is scaled by whatever input rides on that connection. Nudge w by 0.01 and z moves by 0.01·x. And dz/db = 1: the bias adds in raw, so its nudge passes straight through.

The worked numbers, end to end

Here is the full pass you stepped through in the widget, on paper. Forward, left to right:

z = 0.5·1.0 + 0.0 = 0.5
a = σ(0.5) = 0.622459
L = (0.622459 − 1)² = 0.142537

Backward, right to left — each local slope, then the multiplication:

dL/da = 2(0.622459 − 1) = −0.755081
da/dz = 0.622459·(1 − 0.622459) = 0.235004
dz/dw = x = 1.0   ·   dz/db = 1.0
dL/dw = (−0.755081)·(0.235004)·(1.0) = −0.177447
dL/db = (−0.755081)·(0.235004)·(1.0) = −0.177447

Sanity-check the sign, lesson-7 style: the slope is negative, so gradient descent steps opposite it — it will increase w. Does that make sense? Our activation 0.622 is below the target 1.0; a bigger w raises z, which raises a, which shrinks the miss. The math agrees with common sense. That's the feeling to chase.

Why do dL/dw and dL/db come out identical? Only because x = 1.0 here, which makes dz/dw = x = 1 = dz/db. Change the input to x = 2.0 in the code below and the shared part recomputes through z = 1.0, giving dL/dw = −0.211508 but dL/db = −0.105754 — the weight's slope doubles with the input, the bias's doesn't. Try it.

Write it in code — and let the computer audit you

The whole backward pass is six lines: three local slopes, two multiplications, done. The last block is the best habit in this course: don't trust the calculus — measure it. Nudge w by 0.00001, re-run the forward pass, and see how much L actually moved per unit of nudge. Paste the Python into Colab and run it; the Swift runs in an Xcode playground.

import Foundation

func sigmoid(_ z: Double) -> Double { 1 / (1 + exp(-z)) }

let x = 1.0, y = 1.0      // one training example
let w = 0.5, b = 0.0      // current parameters

// ---- forward pass: left to right ----
let z = w * x + b         // 0.5
let a = sigmoid(z)        // 0.622459...
let L = (a - y) * (a - y) // 0.142537...

// ---- backward pass: three local slopes ----
let dL_da = 2 * (a - y)   // -0.755081...
let da_dz = a * (1 - a)   // sigmoid's slope = 0.235004...
let dz_dw = x             // 1.0
let dz_db = 1.0

// ---- chain rule: multiply along the chain ----
let dL_dw = dL_da * da_dz * dz_dw
let dL_db = dL_da * da_dz * dz_db
print(dL_dw, dL_db)       // -0.17744691734927373 twice

// ---- trust, but verify: nudge w and re-measure L ----
let eps = 1e-5
let aNudged = sigmoid((w + eps) * x + b)
let LNudged = (aNudged - y) * (aNudged - y)
print((LNudged - L) / eps)   // -0.1774461477838107

The nudge experiment prints −0.1774461 against the chain rule's −0.1774469 — the first six digits match. The leftover tail is the price of using a finite nudge instead of an infinitely small one (shrink eps and it gets closer). When the two methods agree, your backward pass is correct. Real ML engineers use this exact trick — Stanford's CS231n notes call it a gradient check: "comparing the analytic gradient to the numerical gradient."

Zoom out: that was backpropagation

What you did — walk the chain backwards, multiply local slopes, read off a slope for every parameter — is the algorithm. Backpropagation computes the gradient of the loss with respect to every weight and bias in a network; in 3Blue1Brown's words, it is "an algorithm for computing that negative gradient" — the vector that gradient descent then follows. Two things make it scale to millions of weights:

Shared work. Notice dL/dw and dL/db reused the same running product dL/da · da/dz = −0.177447 and only differed in the last factor. In a deep network that running product keeps flowing right to left, layer by layer; each layer multiplies in its local slopes and passes the product on. Nothing is recomputed — one backward sweep prices every parameter. That is why it's called back-propagation: the error signal literally flows backwards through the same wiring the data flowed forward through.

Local rules. No link needs to know the whole network — each stage only reports its own local slope, like each station on an assembly line knowing only its own machine. The chain rule glues the reports together. With lesson 7's update step w ← w − η·dL/dw, you now hold both halves of learning. Lesson 9 snaps them into a loop.

Check yourself

No peeking back. Pull it from memory.

1. How does the chain rule combine the local slopes along w → z → a → L?
2. Using a = σ(z), the sigmoid's own slope da/dz equals:
3. In z = w·x + b, the local slope dz/dw is:

Watch this next

Primary source: "What is backpropagation really doing?" by 3Blue1Brown — the error-flowing-backwards intuition, animated across a full digit-recognizing network. Then, if you want to see today's exact chain (same (a − y)² loss, same three local slopes) drawn layer by layer with the ∂ notation, the follow-up "Backpropagation calculus" is this lesson's big sibling.