Lesson 6 gave you a number for "how wrong" (the loss). Lesson 7 gave you the direction to nudge each weight (gradient descent). Lesson 8 gave you a fast way to compute all those nudges (backpropagation). Today they snap together into one short loop — and that loop trains a neuron to be the AND gate in front of you. In lesson 2 you set the AND weights by hand. Today the machine finds them itself. That is the "learning" in machine learning, and it's the mission promise starting to cash out.
Everything any neural network does to learn — from our four-row AND table to a model with billions of weights — is this loop:
for each epoch: # one full pass over the training data
forward - run every example through the network (lesson 4)
loss - measure how wrong the outputs are (lesson 6)
backward - compute every gradient with backprop (lesson 8)
update - each weight: w -= η · gradient (lesson 7)
# repeat until the loss is low enough
One new word: an epoch is one full pass over the training data — "one epoch means that every example has been seen once" (Stanford CS231n). Michael Nielsen says it the long way: once we've used up all the training inputs, that is "said to complete an epoch of training. At that point we start over with a new training epoch" (Nielsen, ch. 1). So "train for 300 epochs" just means: run the loop body 300 times.
And this isn't a cartoon of real practice — it is real practice. In PyTorch (lesson 11) the loop appears almost verbatim: you call loss.backward() for our backward line and optimizer.step() for our update line (PyTorch optimization tutorial). Learn the five lines once, use them forever.
Here is the loop running live, in your browser. The setup: one sigmoid neuron (lesson 1), the four AND examples — targets 0, 0, 0, 1 — MSE loss, learning rate η = 5, and a deliberately boring start: w₁ = w₂ = b = 0. No randomness, so your run will match mine exactly. With all-zero weights, z = 0 for every input, so every output starts at σ(0) = 0.5 and the loss starts at 0.5² = 0.25. Press train and watch the loss curve fall and the four bars split apart — three sinking toward 0, one climbing toward 1.
Look at the shape of that curve: a steep plunge, then a long patient tail. That shape is the signature of gradient descent — big confident steps while the slope is steep, tiny ones as the valley flattens out. After 300 epochs the neuron lands on w₁ = 5.14, w₂ = 5.14, b = −7.81 — and check what those weights mean: z is positive only when both inputs are 1 (z = +2.48), and negative otherwise (z = −2.67 with one input on, z = −7.81 with none). The neuron rediscovered the AND rule from lesson 2 — nobody told it.
Forward, loss and backward you already own from lessons 4–8. The only line we haven't stared at as a formula is update. Spelled out, it's one subtraction per parameter:
Symbol by symbol: η (eta) is the learning rate from lesson 7 — the step size, how far you move per update. ∂L/∂w₁ is the gradient from lesson 8 — "if w₁ grew a little, how fast would the loss L rise?" The minus sign is the whole trick: move against the rise, downhill.
The compact form treats all parameters as one vector w (a list of numbers, like lesson 3) and all gradients as one vector ∇L:
That is exactly the gradient descent rule on Wikipedia (xₙ₊₁ = xₙ − η∇f(xₙ), "for a small enough step size or learning rate η"). And the third way is the one you'll actually type — in the programs below, the update step is literally:
w1 -= lr * gw1 // same formula, shortest spelling
One last connector. The loss is the MSE from lesson 6 — "the average of the squares of the errors" (Wikipedia): mean of (a − y)² over the 4 examples, where a is the activation and y the target. Backprop (lesson 8) chains through it to give, for each example:
Read it as three chained links: 2(a − y) is the slope of the squared error; a(1 − a) is the sigmoid's own derivative — σ′(z) = σ(z)(1 − σ(z)) (Wikipedia, logistic function); and x₁ appears because w₁ enters z only through the product w₁·x₁. For the bias, that last factor is just 1. Average those over the 4 examples and you have the gradients the loop uses.
Now the same training as a complete program, in both languages, standard library only. The Python runs in Google Colab as-is; the Swift runs in an Xcode playground. This is the exact code I ran — and below it, the exact output it printed.
import Foundation
func sigmoid(_ z: Double) -> Double { 1 / (1 + exp(-z)) }
let data: [(x: [Double], y: Double)] = [
([0, 0], 0), ([0, 1], 0), ([1, 0], 0), ([1, 1], 1)
]
var w1 = 0.0, w2 = 0.0, b = 0.0
let lr = 5.0
for epoch in 1...300 {
var gw1 = 0.0, gw2 = 0.0, gb = 0.0, loss = 0.0
for (x, y) in data {
let a = sigmoid(w1*x[0] + w2*x[1] + b) // forward
loss += (a - y) * (a - y) // loss
let d = 2 * (a - y) * a * (1 - a) // backward: dL/dz
gw1 += d * x[0]; gw2 += d * x[1]; gb += d
}
loss /= 4; gw1 /= 4; gw2 /= 4; gb /= 4
w1 -= lr * gw1; w2 -= lr * gw2; b -= lr * gb // update
if epoch == 1 || epoch % 50 == 0 {
print(String(format: "epoch %3d loss %.4f", epoch, loss))
}
}
print("")
for (x, y) in data {
let a = sigmoid(w1*x[0] + w2*x[1] + b)
print(String(format: "%d AND %d -> %.3f (target %d)", Int(x[0]), Int(x[1]), a, Int(y)))
}
print("")
print(String(format: "learned: w1=%.2f w2=%.2f b=%.2f", w1, w2, b))
import math
def sigmoid(z):
return 1 / (1 + math.exp(-z))
data = [([0.0,0.0],0.0), ([0.0,1.0],0.0), ([1.0,0.0],0.0), ([1.0,1.0],1.0)]
w1, w2, b = 0.0, 0.0, 0.0
lr = 5.0
for epoch in range(1, 301):
gw1 = gw2 = gb = 0.0
loss = 0.0
for (x1, x2), y in data:
a = sigmoid(w1*x1 + w2*x2 + b) # forward
loss += (a - y)**2 # loss
d = 2 * (a - y) * a * (1 - a) # backward: dL/dz
gw1 += d * x1; gw2 += d * x2; gb += d
loss /= 4; gw1 /= 4; gw2 /= 4; gb /= 4
w1 -= lr*gw1; w2 -= lr*gw2; b -= lr*gb # update
if epoch == 1 or epoch % 50 == 0:
print(f"epoch {epoch:3d} loss {loss:.4f}")
print()
for (x1, x2), y in data:
a = sigmoid(w1*x1 + w2*x2 + b)
print(f"{int(x1)} AND {int(x2)} -> {a:.3f} (target {int(y)})")
print()
print(f"learned: w1={w1:.2f} w2={w2:.2f} b={b:.2f}")
Both programs print exactly this (run them — same start, same steps, same floats):
epoch 1 loss 0.2500
epoch 50 loss 0.0261
epoch 100 loss 0.0124
epoch 150 loss 0.0079
epoch 200 loss 0.0057
epoch 250 loss 0.0044
epoch 300 loss 0.0036
0 AND 0 -> 0.000 (target 0)
0 AND 1 -> 0.065 (target 0)
1 AND 0 -> 0.065 (target 0)
1 AND 1 -> 0.923 (target 1)
learned: w1=5.14 w2=5.14 b=-7.81
Pause on what just happened. Thirty lines of stdlib code, no ML framework, no magic — and a program found weights that compute AND, starting from nothing, by repeatedly measuring its error and stepping downhill. In lesson 2, the intelligence was yours: you chose the weights. Here, the intelligence is in the loop. Scale this exact loop up — more neurons, more layers, more data — and you get the systems making headlines. That's not a metaphor; it's the same five lines.
No peeking back. Pull it from memory.
Primary source: Andrej Karpathy — "The spelled-out intro to neural networks and backpropagation: building micrograd". Karpathy (former Tesla AI director) builds this exact loop — forward, loss, backward, update — from empty Python file to a trained network, narrating every line. His course page promises it "only assumes basic knowledge of Python and a vague recollection of calculus from high school" — which, after lessons 7 and 8, you now comfortably exceed. It's long; even the first hour will make lesson 10 feel familiar.