AI Foundations · Lesson 10 · Phase 4 — Build your own

Train a Tiny Network from Scratch

This is the lesson the whole course has been walking toward. Today you assemble everything — forward pass, loss, backpropagation, gradient descent, the training loop — into one short program with no libraries at all, and watch it learn something a single neuron provably never can. After today, "I trained a neural network from scratch" is simply a true sentence about you.

The score to settle: XOR

Back in lesson 2 you saw that one neuron draws a single straight line through its input space, so it can only learn problems where one line separates the answers. XOR — output 1 when the two inputs differ — is the famous problem where no such line exists. This isn't course folklore: Minsky and Papert's 1969 book Perceptrons proved a single-layer network cannot learn XOR, and the result froze neural-net research for about a decade (Wikipedia: Perceptron).

x₁	x₂	x₁ XOR x₂
0	0	0
0	1	1
1	0	1
1	1	0

The fix, as lesson 4 showed, is a hidden layer: neurons between input and output whose activations the network invents for itself. Our network today is the smallest one that works, a 2-2-1: 2 inputs → 2 hidden neurons → 1 output. Count its trainable numbers: each hidden neuron has 2 weights and a bias (6 numbers), the output neuron has 2 weights and a bias (3 more). Nine numbers. Training means: nudge those nine numbers, over and over, until the network's outputs match the table above.

hidden: h₁ = σ(w₁₁·x₁ + w₁₂·x₂ + b₁) h₂ = σ(w₂₁·x₁ + w₂₂·x₂ + b₂)
output: out = σ(v₁·h₁ + v₂·h₂ + c)

Nothing new here — that is lesson 4's forward pass, written out for our exact net. Each line is just lesson 1's neuron: weighted sum, plus bias, through the sigmoid. w₁₂ reads "weight into hidden neuron 1, from input 2"; v₁, v₂, c are the output neuron's weights and bias.

Watch it learn — right here

Code first, as always. Below is the exact network we'll build, running live in this page, starting from the same nine hand-picked numbers as the programs further down. Untrained, it shrugs ≈ 0.5 at everything. Press train and watch the loss curve fall off a cliff as the four outputs split toward 0 and 1.

epoch 0 · loss —

input	output	target	verdict
0 XOR 0		0
0 XOR 1		1
1 XOR 0		1
1 XOR 1		0

A "verdict" is earned by landing on the correct side of 0.5 — the same rounding rule lesson 2 used to turn an activation into a yes/no decision. What you just watched is the entire rest of this lesson; now we name it.

The only new math: two deltas instead of one

In lesson 9 your training loop updated one neuron. The single genuinely new thing today is bookkeeping: with two layers, the error signal must flow through the output neuron back into the hidden ones — exactly what lesson 8 called backpropagation. As usual, two zoom levels.

① The easy way — spelled out for our nine numbers

For one training example with target y, the loss is lesson 6's squared error, (out − y)². Start at the output and compute its delta — lesson 8's name for "how much the loss changes per nudge of this neuron's weighted sum z":

δ_out = 2·(out − y) · out·(1 − out)

Read it symbol by symbol: 2·(out − y) is the slope of the squared loss (lesson 6); out·(1 − out) is the sigmoid's own slope σ′(z) = σ(z)·(1 − σ(z)) from lesson 8 (Nielsen, ch. 2, his sigmoid_prime). Multiplying them is the chain rule: loss-per-output times output-per-z. With δ_out in hand, the output neuron's three gradients are one multiplication each — each weight's gradient is the delta times whatever activation came in through that weight; the bias gradient is the delta itself:

∂L/∂v₁ = δ_out·h₁ ∂L/∂v₂ = δ_out·h₂ ∂L/∂c = δ_out

(Reminder from lesson 7: ∂L/∂v₁ reads "how much the loss L changes when v₁ nudges" — a slope, nothing more.) Now the step that earns the name backpropagation. How much is hidden neuron 1 to blame? Its activation h₁ only reaches the loss through the weight v₁, so: take δ_out, carry it backwards across v₁, then multiply by h₁'s own sigmoid slope:

δ_h₁ = δ_out·v₁ · h₁·(1 − h₁) δ_h₂ = δ_out·v₂ · h₂·(1 − h₂)

And then the hidden gradients follow the exact same one-multiplication pattern as before — delta times the incoming activation (here, the raw inputs), bias gets the bare delta:

∂L/∂w₁₁ = δ_h₁·x₁ ∂L/∂w₁₂ = δ_h₁·x₂ ∂L/∂b₁ = δ_h₁ (same for row 2)

Finally lesson 7's update rule, applied to all nine numbers at once: subtract the learning rate times the gradient.

every parameter p: p → p − η·∂L/∂p (here η = 2.0)

② The professional way — the same thing, compressed

In Nielsen's chapter 2 these appear as the four fundamental equations of backpropagation, written for any size of network: δ at the output is ∂C/∂a·σ′(z) (BP1); each earlier layer's δ is the next layer's δ carried back through the weights, times σ′(z) (BP2); every bias gradient is its δ (BP3); every weight gradient is δ times the activation entering it (BP4). Look back at ① — you have now personally written out all four, for the 2-2-1 case. The update rule v → v − η∇C is equation (11) of chapter 1.

Where did the 2 come from? We define the loss as (out − y)², so its slope is 2(out − y). Nielsen defines cost with a ½ in front precisely so that the 2 cancels (ch. 1, eqn. 6). Both are correct; only the bookkeeping differs.

The whole program — 47 lines, zero libraries

Here it is: the complete trainer, Python and Swift. Pure standard library — import math / import Foundation and nothing else. Paste the Python into Google Colab and run it; the Swift runs as-is in an Xcode playground. The nine starting weights are hard-coded, so your run will print exactly the output shown after the code — both languages produce it character for character.

import Foundation

func sigmoid(_ z: Double) -> Double { 1 / (1 + exp(-z)) }

// the data: all four XOR cases
let data: [(x1: Double, x2: Double, y: Double)] = [
    (0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)
]

// deterministic starting weights (same as the Python version)
var w11 = 0.5, w12 = -0.4, b1 = 0.1     // hidden neuron 1
var w21 = 0.9, w22 =  0.3, b2 = -0.2    // hidden neuron 2
var v1  = 0.8, v2  = -0.6, c  = 0.05    // output neuron

let lr = 2.0
for epoch in 0...4000 {
    var gw11 = 0.0, gw12 = 0.0, gb1 = 0.0
    var gw21 = 0.0, gw22 = 0.0, gb2 = 0.0
    var gv1  = 0.0, gv2  = 0.0, gc  = 0.0
    var loss = 0.0
    for (x1, x2, y) in data {
        // forward
        let h1  = sigmoid(w11*x1 + w12*x2 + b1)
        let h2  = sigmoid(w21*x1 + w22*x2 + b2)
        let out = sigmoid(v1*h1 + v2*h2 + c)
        loss += (out - y)*(out - y)
        // backward
        let dOut = 2*(out - y) * out*(1 - out)
        gv1 += dOut*h1; gv2 += dOut*h2; gc += dOut
        let dH1 = dOut*v1 * h1*(1 - h1)
        let dH2 = dOut*v2 * h2*(1 - h2)
        gw11 += dH1*x1; gw12 += dH1*x2; gb1 += dH1
        gw21 += dH2*x1; gw22 += dH2*x2; gb2 += dH2
    }
    let n = Double(data.count)
    loss /= n
    w11 -= lr*gw11/n; w12 -= lr*gw12/n; b1 -= lr*gb1/n
    w21 -= lr*gw21/n; w22 -= lr*gw22/n; b2 -= lr*gb2/n
    v1  -= lr*gv1/n;  v2  -= lr*gv2/n;  c  -= lr*gc/n
    if epoch % 500 == 0 {
        print(String(format: "epoch %4d   loss %.4f", epoch, loss))
    }
}

print()
for (x1, x2, y) in data {
    let h1  = sigmoid(w11*x1 + w12*x2 + b1)
    let h2  = sigmoid(w21*x1 + w22*x2 + b2)
    let out = sigmoid(v1*h1 + v2*h2 + c)
    print(String(format: "%d XOR %d -> %.3f   (target %d)", Int(x1), Int(x2), out, Int(y)))
}

import math

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

# --- the data: all four XOR cases (lesson 2's impossible problem) ---
data = [([0.0, 0.0], 0.0),
        ([0.0, 1.0], 1.0),
        ([1.0, 0.0], 1.0),
        ([1.0, 1.0], 0.0)]

# --- deterministic starting weights (small, asymmetric on purpose) ---
w11, w12, b1 = 0.5, -0.4, 0.1    # hidden neuron 1
w21, w22, b2 = 0.9,  0.3, -0.2   # hidden neuron 2
v1,  v2,  c  = 0.8, -0.6, 0.05   # output neuron

lr = 2.0
for epoch in range(4001):
    gw11 = gw12 = gb1 = gw21 = gw22 = gb2 = gv1 = gv2 = gc = 0.0
    loss = 0.0
    for (x1, x2), y in data:
        # forward
        h1 = sigmoid(w11*x1 + w12*x2 + b1)
        h2 = sigmoid(w21*x1 + w22*x2 + b2)
        out = sigmoid(v1*h1 + v2*h2 + c)
        loss += (out - y)**2
        # backward
        d_out = 2*(out - y) * out*(1 - out)
        gv1 += d_out*h1; gv2 += d_out*h2; gc += d_out
        d_h1 = d_out*v1 * h1*(1 - h1)
        d_h2 = d_out*v2 * h2*(1 - h2)
        gw11 += d_h1*x1; gw12 += d_h1*x2; gb1 += d_h1
        gw21 += d_h2*x1; gw22 += d_h2*x2; gb2 += d_h2
    n = len(data)
    loss /= n
    w11 -= lr*gw11/n; w12 -= lr*gw12/n; b1 -= lr*gb1/n
    w21 -= lr*gw21/n; w22 -= lr*gw22/n; b2 -= lr*gb2/n
    v1  -= lr*gv1/n;  v2  -= lr*gv2/n;  c  -= lr*gc/n
    if epoch % 500 == 0:
        print(f"epoch {epoch:4d}   loss {loss:.4f}")

print()
for (x1, x2), y in data:
    h1 = sigmoid(w11*x1 + w12*x2 + b1)
    h2 = sigmoid(w21*x1 + w22*x2 + b2)
    out = sigmoid(v1*h1 + v2*h2 + c)
    print(f"{int(x1)} XOR {int(x2)} -> {out:.3f}   (target {int(y)})")

And this is what it actually prints (run today, both languages, identical output):

epoch    0   loss 0.2518
epoch  500   loss 0.0888
epoch 1000   loss 0.0034
epoch 1500   loss 0.0015
epoch 2000   loss 0.0010
epoch 2500   loss 0.0007
epoch 3000   loss 0.0006
epoch 3500   loss 0.0005
epoch 4000   loss 0.0004

0 XOR 0 -> 0.020   (target 0)
0 XOR 1 -> 0.977   (target 1)
1 XOR 0 -> 0.981   (target 1)
1 XOR 1 -> 0.018   (target 0)

All four outputs on the correct side of 0.5 — two near 0, two near 1. The thing lesson 2 proved impossible for one neuron, nine numbers and a hidden layer just learned from data.

You already knew every block

Walk the program top to bottom and notice there is nothing in it you haven't already built:

forward (the three sigmoid(...) lines) — lesson 4's forward pass, lesson 1's neuron three times.
loss ((out − y)², averaged over the 4 cases) — lesson 6's mean squared error.
backward (d_out, d_h1, d_h2 and the g… sums) — lesson 8's deltas, the formulas from ① above, accumulated across the four examples.
update (the -= lr*…/n block) — lesson 7's gradient descent step on all nine parameters.
the epoch loop around it all — lesson 9's training loop, unchanged.

Why these starting numbers? They are small and deliberately unequal. If both hidden neurons started identical, they would compute the same output, receive the same gradients, and stay clones forever — every nudge keeps them in lockstep. Breaking that tie is called symmetry breaking, and it's why real networks initialize weights randomly (CS231n: Neural Networks 2). We hard-code ours instead of using a random seed so that your run matches this page digit for digit. And η = 2.0 is a luxuriously large learning rate that this four-example problem happily tolerates; real datasets need far gentler steps.

Check yourself

No peeking back. Pull it from memory.

1. Lesson 2 proved one neuron alone can never learn XOR. What made it learnable today?

2. In the backward pass, the gradient for output weight v₁ is:

3. During the update step, every one of the nine parameters is changed by:

Read this next

Primary source: Michael Nielsen, Neural Networks and Deep Learning, chapters 1–2. Chapter 1 builds exactly this kind of network from scratch (his is bigger — it reads handwritten digits); chapter 2 derives the four backprop equations you wrote out by hand today. You now have every prerequisite to read both cover to cover — try it and notice how much is familiar.