AI Foundations · Lesson 9 · Phase 3 — How a network learns

The Training Loop

Lesson 6 gave you a number for "how wrong" (the loss). Lesson 7 gave you the direction to nudge each weight (gradient descent). Lesson 8 gave you a fast way to compute all those nudges (backpropagation). Today they snap together into one short loop — and that loop trains a neuron to be the AND gate in front of you. In lesson 2 you set the AND weights by hand. Today the machine finds them itself. That is the "learning" in machine learning, and it's the mission promise starting to cash out.

The whole algorithm is five lines

Everything any neural network does to learn — from our four-row AND table to a model with billions of weights — is this loop:

for each epoch:                       # one full pass over the training data
    forward   - run every example through the network      (lesson 4)
    loss      - measure how wrong the outputs are          (lesson 6)
    backward  - compute every gradient with backprop       (lesson 8)
    update    - each weight:  w -= η · gradient            (lesson 7)
# repeat until the loss is low enough

One new word: an epoch is one full pass over the training data — "one epoch means that every example has been seen once" (Stanford CS231n). Michael Nielsen says it the long way: once we've used up all the training inputs, that is "said to complete an epoch of training. At that point we start over with a new training epoch" (Nielsen, ch. 1). So "train for 300 epochs" just means: run the loop body 300 times.

And this isn't a cartoon of real practice — it is real practice. In PyTorch (lesson 11) the loop appears almost verbatim: you call loss.backward() for our backward line and optimizer.step() for our update line (PyTorch optimization tutorial). Learn the five lines once, use them forever.

Watch a neuron learn AND

Here is the loop running live, in your browser. The setup: one sigmoid neuron (lesson 1), the four AND examples — targets 0, 0, 0, 1 — MSE loss, learning rate η = 5, and a deliberately boring start: w₁ = w₂ = b = 0. No randomness, so your run will match mine exactly. With all-zero weights, z = 0 for every input, so every output starts at σ(0) = 0.5 and the loss starts at 0.5² = 0.25. Press train and watch the loss curve fall and the four bars split apart — three sinking toward 0, one climbing toward 1.

Look at the shape of that curve: a steep plunge, then a long patient tail. That shape is the signature of gradient descent — big confident steps while the slope is steep, tiny ones as the valley flattens out. After 300 epochs the neuron lands on w₁ = 5.14, w₂ = 5.14, b = −7.81 — and check what those weights mean: z is positive only when both inputs are 1 (z = +2.48), and negative otherwise (z = −2.67 with one input on, z = −7.81 with none). The neuron rediscovered the AND rule from lesson 2 — nobody told it.

The update step — three ways to write it

Forward, loss and backward you already own from lessons 4–8. The only line we haven't stared at as a formula is update. Spelled out, it's one subtraction per parameter:

new w₁ = w₁ − η · ∂L/∂w₁ new w₂ = w₂ − η · ∂L/∂w₂ new b = b − η · ∂L/∂b

Symbol by symbol: η (eta) is the learning rate from lesson 7 — the step size, how far you move per update. ∂L/∂w₁ is the gradient from lesson 8 — "if w₁ grew a little, how fast would the loss L rise?" The minus sign is the whole trick: move against the rise, downhill.

The compact form treats all parameters as one vector w (a list of numbers, like lesson 3) and all gradients as one vector ∇L:

w ← w − η ∇L

That is exactly the gradient descent rule on Wikipedia (xₙ₊₁ = xₙ − η∇f(xₙ), "for a small enough step size or learning rate η"). And the third way is the one you'll actually type — in the programs below, the update step is literally:

w1 -= lr * gw1     // same formula, shortest spelling

One last connector. The loss is the MSE from lesson 6 — "the average of the squares of the errors" (Wikipedia): mean of (a − y)² over the 4 examples, where a is the activation and y the target. Backprop (lesson 8) chains through it to give, for each example:

∂L/∂w₁ = 2(a − y) · a(1 − a) · x₁

Read it as three chained links: 2(a − y) is the slope of the squared error; a(1 − a) is the sigmoid's own derivative — σ′(z) = σ(z)(1 − σ(z)) (Wikipedia, logistic function); and x₁ appears because w₁ enters z only through the product w₁·x₁. For the bias, that last factor is just 1. Average those over the 4 examples and you have the gradients the loop uses.

The real run — 30 lines, no libraries

Now the same training as a complete program, in both languages, standard library only. The Python runs in Google Colab as-is; the Swift runs in an Xcode playground. This is the exact code I ran — and below it, the exact output it printed.

import Foundation

func sigmoid(_ z: Double) -> Double { 1 / (1 + exp(-z)) }

let data: [(x: [Double], y: Double)] = [
    ([0, 0], 0), ([0, 1], 0), ([1, 0], 0), ([1, 1], 1)
]
var w1 = 0.0, w2 = 0.0, b = 0.0
let lr = 5.0

for epoch in 1...300 {
    var gw1 = 0.0, gw2 = 0.0, gb = 0.0, loss = 0.0
    for (x, y) in data {
        let a = sigmoid(w1*x[0] + w2*x[1] + b)        // forward
        loss += (a - y) * (a - y)                     // loss
        let d = 2 * (a - y) * a * (1 - a)             // backward: dL/dz
        gw1 += d * x[0]; gw2 += d * x[1]; gb += d
    }
    loss /= 4; gw1 /= 4; gw2 /= 4; gb /= 4
    w1 -= lr * gw1; w2 -= lr * gw2; b -= lr * gb      // update
    if epoch == 1 || epoch % 50 == 0 {
        print(String(format: "epoch %3d   loss %.4f", epoch, loss))
    }
}
print("")
for (x, y) in data {
    let a = sigmoid(w1*x[0] + w2*x[1] + b)
    print(String(format: "%d AND %d  ->  %.3f   (target %d)", Int(x[0]), Int(x[1]), a, Int(y)))
}
print("")
print(String(format: "learned: w1=%.2f  w2=%.2f  b=%.2f", w1, w2, b))

import math

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

data = [([0.0,0.0],0.0), ([0.0,1.0],0.0), ([1.0,0.0],0.0), ([1.0,1.0],1.0)]
w1, w2, b = 0.0, 0.0, 0.0
lr = 5.0

for epoch in range(1, 301):
    gw1 = gw2 = gb = 0.0
    loss = 0.0
    for (x1, x2), y in data:
        a = sigmoid(w1*x1 + w2*x2 + b)        # forward
        loss += (a - y)**2                    # loss
        d = 2 * (a - y) * a * (1 - a)         # backward: dL/dz
        gw1 += d * x1; gw2 += d * x2; gb += d
    loss /= 4; gw1 /= 4; gw2 /= 4; gb /= 4
    w1 -= lr*gw1; w2 -= lr*gw2; b -= lr*gb    # update
    if epoch == 1 or epoch % 50 == 0:
        print(f"epoch {epoch:3d}   loss {loss:.4f}")

print()
for (x1, x2), y in data:
    a = sigmoid(w1*x1 + w2*x2 + b)
    print(f"{int(x1)} AND {int(x2)}  ->  {a:.3f}   (target {int(y)})")
print()
print(f"learned: w1={w1:.2f}  w2={w2:.2f}  b={b:.2f}")

Both programs print exactly this (run them — same start, same steps, same floats):

epoch   1   loss 0.2500
epoch  50   loss 0.0261
epoch 100   loss 0.0124
epoch 150   loss 0.0079
epoch 200   loss 0.0057
epoch 250   loss 0.0044
epoch 300   loss 0.0036

0 AND 0  ->  0.000   (target 0)
0 AND 1  ->  0.065   (target 0)
1 AND 0  ->  0.065   (target 0)
1 AND 1  ->  0.923   (target 1)

learned: w1=5.14  w2=5.14  b=-7.81

Pause on what just happened. Thirty lines of stdlib code, no ML framework, no magic — and a program found weights that compute AND, starting from nothing, by repeatedly measuring its error and stepping downhill. In lesson 2, the intelligence was yours: you chose the weights. Here, the intelligence is in the loop. Scale this exact loop up — more neurons, more layers, more data — and you get the systems making headlines. That's not a metaphor; it's the same five lines.

Batch vs. SGD — one vocabulary note. Our loop sums the gradients of all 4 examples before making one update per epoch — that's "batch" gradient descent: w := w − (η/n)·Σ∇Lᵢ. With 60,000 examples that's slow, so real training updates after every small random mini-batch of examples — or even after every single one, called stochastic gradient descent (SGD), where "the true gradient … is approximated by a gradient at a single sample" (Wikipedia). Same loop, just more frequent, noisier updates. Nielsen trains his digit network with mini-batches of 10 and η = 3.0 (ch. 1) — so our η = 5 on a toy problem is less exotic than it looks.

Check yourself

No peeking back. Pull it from memory.

1. What is the correct order of one trip around the training loop?

2. What does one epoch of training mean?

3. In today's run, where did the working AND weights come from?

Watch this next

Primary source: Andrej Karpathy — "The spelled-out intro to neural networks and backpropagation: building micrograd". Karpathy (former Tesla AI director) builds this exact loop — forward, loss, backward, update — from empty Python file to a trained network, narrating every line. His course page promises it "only assumes basic knowledge of Python and a vague recollection of calculus from high school" — which, after lessons 7 and 8, you now comfortably exceed. It's long; even the first hour will make lesson 10 feel familiar.