This is the lesson the whole course has been walking toward. Today you assemble everything — forward pass, loss, backpropagation, gradient descent, the training loop — into one short program with no libraries at all, and watch it learn something a single neuron provably never can. After today, "I trained a neural network from scratch" is simply a true sentence about you.
Back in lesson 2 you saw that one neuron draws a single straight line through its input space, so it can only learn problems where one line separates the answers. XOR — output 1 when the two inputs differ — is the famous problem where no such line exists. This isn't course folklore: Minsky and Papert's 1969 book Perceptrons proved a single-layer network cannot learn XOR, and the result froze neural-net research for about a decade (Wikipedia: Perceptron).
| x₁ | x₂ | x₁ XOR x₂ |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
The fix, as lesson 4 showed, is a hidden layer: neurons between input and output whose activations the network invents for itself. Our network today is the smallest one that works, a 2-2-1: 2 inputs → 2 hidden neurons → 1 output. Count its trainable numbers: each hidden neuron has 2 weights and a bias (6 numbers), the output neuron has 2 weights and a bias (3 more). Nine numbers. Training means: nudge those nine numbers, over and over, until the network's outputs match the table above.
Nothing new here — that is lesson 4's forward pass, written out for our exact net. Each line is just lesson 1's neuron: weighted sum, plus bias, through the sigmoid. w₁₂ reads "weight into hidden neuron 1, from input 2"; v₁, v₂, c are the output neuron's weights and bias.
Code first, as always. Below is the exact network we'll build, running live in this page, starting from the same nine hand-picked numbers as the programs further down. Untrained, it shrugs ≈ 0.5 at everything. Press train and watch the loss curve fall off a cliff as the four outputs split toward 0 and 1.
A "verdict" is earned by landing on the correct side of 0.5 — the same rounding rule lesson 2 used to turn an activation into a yes/no decision. What you just watched is the entire rest of this lesson; now we name it.
In lesson 9 your training loop updated one neuron. The single genuinely new thing today is bookkeeping: with two layers, the error signal must flow through the output neuron back into the hidden ones — exactly what lesson 8 called backpropagation. As usual, two zoom levels.
For one training example with target y, the loss is lesson 6's squared error, (out − y)². Start at the output and compute its delta — lesson 8's name for "how much the loss changes per nudge of this neuron's weighted sum z":
Read it symbol by symbol: 2·(out − y) is the slope of the squared loss (lesson 6); out·(1 − out) is the sigmoid's own slope σ′(z) = σ(z)·(1 − σ(z)) from lesson 8 (Nielsen, ch. 2, his sigmoid_prime). Multiplying them is the chain rule: loss-per-output times output-per-z. With δ_out in hand, the output neuron's three gradients are one multiplication each — each weight's gradient is the delta times whatever activation came in through that weight; the bias gradient is the delta itself:
(Reminder from lesson 7: ∂L/∂v₁ reads "how much the loss L changes when v₁ nudges" — a slope, nothing more.) Now the step that earns the name backpropagation. How much is hidden neuron 1 to blame? Its activation h₁ only reaches the loss through the weight v₁, so: take δ_out, carry it backwards across v₁, then multiply by h₁'s own sigmoid slope:
And then the hidden gradients follow the exact same one-multiplication pattern as before — delta times the incoming activation (here, the raw inputs), bias gets the bare delta:
Finally lesson 7's update rule, applied to all nine numbers at once: subtract the learning rate times the gradient.
In Nielsen's chapter 2 these appear as the four fundamental equations of backpropagation, written for any size of network: δ at the output is ∂C/∂a·σ′(z) (BP1); each earlier layer's δ is the next layer's δ carried back through the weights, times σ′(z) (BP2); every bias gradient is its δ (BP3); every weight gradient is δ times the activation entering it (BP4). Look back at ① — you have now personally written out all four, for the 2-2-1 case. The update rule v → v − η∇C is equation (11) of chapter 1.
Here it is: the complete trainer, Python and Swift. Pure standard library — import math / import Foundation and nothing else. Paste the Python into Google Colab and run it; the Swift runs as-is in an Xcode playground. The nine starting weights are hard-coded, so your run will print exactly the output shown after the code — both languages produce it character for character.
import Foundation
func sigmoid(_ z: Double) -> Double { 1 / (1 + exp(-z)) }
// the data: all four XOR cases
let data: [(x1: Double, x2: Double, y: Double)] = [
(0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)
]
// deterministic starting weights (same as the Python version)
var w11 = 0.5, w12 = -0.4, b1 = 0.1 // hidden neuron 1
var w21 = 0.9, w22 = 0.3, b2 = -0.2 // hidden neuron 2
var v1 = 0.8, v2 = -0.6, c = 0.05 // output neuron
let lr = 2.0
for epoch in 0...4000 {
var gw11 = 0.0, gw12 = 0.0, gb1 = 0.0
var gw21 = 0.0, gw22 = 0.0, gb2 = 0.0
var gv1 = 0.0, gv2 = 0.0, gc = 0.0
var loss = 0.0
for (x1, x2, y) in data {
// forward
let h1 = sigmoid(w11*x1 + w12*x2 + b1)
let h2 = sigmoid(w21*x1 + w22*x2 + b2)
let out = sigmoid(v1*h1 + v2*h2 + c)
loss += (out - y)*(out - y)
// backward
let dOut = 2*(out - y) * out*(1 - out)
gv1 += dOut*h1; gv2 += dOut*h2; gc += dOut
let dH1 = dOut*v1 * h1*(1 - h1)
let dH2 = dOut*v2 * h2*(1 - h2)
gw11 += dH1*x1; gw12 += dH1*x2; gb1 += dH1
gw21 += dH2*x1; gw22 += dH2*x2; gb2 += dH2
}
let n = Double(data.count)
loss /= n
w11 -= lr*gw11/n; w12 -= lr*gw12/n; b1 -= lr*gb1/n
w21 -= lr*gw21/n; w22 -= lr*gw22/n; b2 -= lr*gb2/n
v1 -= lr*gv1/n; v2 -= lr*gv2/n; c -= lr*gc/n
if epoch % 500 == 0 {
print(String(format: "epoch %4d loss %.4f", epoch, loss))
}
}
print()
for (x1, x2, y) in data {
let h1 = sigmoid(w11*x1 + w12*x2 + b1)
let h2 = sigmoid(w21*x1 + w22*x2 + b2)
let out = sigmoid(v1*h1 + v2*h2 + c)
print(String(format: "%d XOR %d -> %.3f (target %d)", Int(x1), Int(x2), out, Int(y)))
}
import math
def sigmoid(z):
return 1 / (1 + math.exp(-z))
# --- the data: all four XOR cases (lesson 2's impossible problem) ---
data = [([0.0, 0.0], 0.0),
([0.0, 1.0], 1.0),
([1.0, 0.0], 1.0),
([1.0, 1.0], 0.0)]
# --- deterministic starting weights (small, asymmetric on purpose) ---
w11, w12, b1 = 0.5, -0.4, 0.1 # hidden neuron 1
w21, w22, b2 = 0.9, 0.3, -0.2 # hidden neuron 2
v1, v2, c = 0.8, -0.6, 0.05 # output neuron
lr = 2.0
for epoch in range(4001):
gw11 = gw12 = gb1 = gw21 = gw22 = gb2 = gv1 = gv2 = gc = 0.0
loss = 0.0
for (x1, x2), y in data:
# forward
h1 = sigmoid(w11*x1 + w12*x2 + b1)
h2 = sigmoid(w21*x1 + w22*x2 + b2)
out = sigmoid(v1*h1 + v2*h2 + c)
loss += (out - y)**2
# backward
d_out = 2*(out - y) * out*(1 - out)
gv1 += d_out*h1; gv2 += d_out*h2; gc += d_out
d_h1 = d_out*v1 * h1*(1 - h1)
d_h2 = d_out*v2 * h2*(1 - h2)
gw11 += d_h1*x1; gw12 += d_h1*x2; gb1 += d_h1
gw21 += d_h2*x1; gw22 += d_h2*x2; gb2 += d_h2
n = len(data)
loss /= n
w11 -= lr*gw11/n; w12 -= lr*gw12/n; b1 -= lr*gb1/n
w21 -= lr*gw21/n; w22 -= lr*gw22/n; b2 -= lr*gb2/n
v1 -= lr*gv1/n; v2 -= lr*gv2/n; c -= lr*gc/n
if epoch % 500 == 0:
print(f"epoch {epoch:4d} loss {loss:.4f}")
print()
for (x1, x2), y in data:
h1 = sigmoid(w11*x1 + w12*x2 + b1)
h2 = sigmoid(w21*x1 + w22*x2 + b2)
out = sigmoid(v1*h1 + v2*h2 + c)
print(f"{int(x1)} XOR {int(x2)} -> {out:.3f} (target {int(y)})")
And this is what it actually prints (run today, both languages, identical output):
epoch 0 loss 0.2518
epoch 500 loss 0.0888
epoch 1000 loss 0.0034
epoch 1500 loss 0.0015
epoch 2000 loss 0.0010
epoch 2500 loss 0.0007
epoch 3000 loss 0.0006
epoch 3500 loss 0.0005
epoch 4000 loss 0.0004
0 XOR 0 -> 0.020 (target 0)
0 XOR 1 -> 0.977 (target 1)
1 XOR 0 -> 0.981 (target 1)
1 XOR 1 -> 0.018 (target 0)
All four outputs on the correct side of 0.5 — two near 0, two near 1. The thing lesson 2 proved impossible for one neuron, nine numbers and a hidden layer just learned from data.
Walk the program top to bottom and notice there is nothing in it you haven't already built:
forward (the three sigmoid(...) lines) — lesson 4's forward pass, lesson 1's neuron three times.
loss ((out − y)², averaged over the 4 cases) — lesson 6's mean squared error.
backward (d_out, d_h1, d_h2 and the g… sums) — lesson 8's deltas, the formulas from ① above, accumulated across the four examples.
update (the -= lr*…/n block) — lesson 7's gradient descent step on all nine parameters.
the epoch loop around it all — lesson 9's training loop, unchanged.
No peeking back. Pull it from memory.
Primary source: Michael Nielsen, Neural Networks and Deep Learning, chapters 1–2. Chapter 1 builds exactly this kind of network from scratch (his is bigger — it reads handwritten digits); chapter 2 derives the four backprop equations you wrote out by hand today. You now have every prerequisite to read both cover to cover — try it and notice how much is familiar.