AI Foundations · Lesson 12 · Phase 4 — Build your own

Real Data — Handwritten Digits

This is the finish line. Eleven lessons ago, a single neuron multiplied two numbers. Today you hand a network 70,000 scans of real human handwriting and watch it learn to read digits it has never seen, at roughly 97% accuracy. Your own model, on real data — the exact thing this course promised on day one.

Meet MNIST: 70,000 real handwritten digits

MNIST is the classic dataset every ML engineer trains on first: 70,000 grayscale images of handwritten digits 0–9, each 28×28 pixels, split into 60,000 training images and 10,000 test images (Wikipedia). The handwriting is real — collected by NIST from US Census Bureau employees and high-school students, then normalized into those neat 28×28 boxes. The split matters: the network learns from the 60,000, and we grade it on the 10,000 it has never seen. Good marks there mean it actually learned to read, not just memorized.

The leap from XOR: 784 in, 10 out

Your net from lessons 10–11 had 2 inputs and 1 output. Reading digits needs the same machine, just wider at both ends:

Input — 784 neurons. A 28×28 image is a grid of 784 pixel brightnesses (28×28 = 784). Lesson 3's move pays off: unroll the grid row by row into one long vector — a plain list of 784 numbers, each pixel scaled to 0…1. The network never knows it was a picture. This unrolling is called flattening, and PyTorch has a layer for it: nn.Flatten.

Output — 10 neurons, one per digit. Each output neuron produces a raw score: "how much does this look like a 0… a 1… a 9?" These raw, unsquashed scores are called logits. The verdict rule is one word: the biggest score wins.

Spelled out, the whole pipeline is four steps:

1. flatten: 28×28 grid → x = (x₁, x₂, …, x₇₈₄)
2. hidden layer: h = ReLU(W₁·x + b₁) → 128 activations
3. scores: s = W₂·h + b₂ → 10 logits
4. verdict: answer = argmax(s) → index of the biggest score

Symbol by symbol: x is the 784 pixels. W₁ is the first weight grid — 128 neurons × 784 inputs = 100,352 weights — and b₁ is its 128 biases. ReLU is lesson 5 paying off: the kink that keeps stacked layers from collapsing into one. W₂, b₂ map those 128 activations to 10 scores (1,290 more parameters). argmax just means "index of the largest entry" — Swift's scores.firstIndex(of: scores.max()!). Total: 101,770 learnable knobs, every one trained by the same gradient-descent loop from lessons 7–9. The compact, professional form is one line:

ŷ = argmax( W₂ · ReLU(W₁·x + b₁) + b₂ )

Feel it: from pixels to a verdict

Hover (or tap) a digit and watch it flow through the net:

Honesty note: this widget is a cartoon of the data flow — the bars are staged, no real network runs in this page. The real one, with all 101,770 trained weights, is the program below.

The loss for classification: cross-entropy

Lesson 6 measured "how wrong" with squared error and promised a better tool for classification. Here it is. First the network's 10 logits are converted into probabilities that sum to 1 (a step called softmax — bigger score, bigger share). Then cross-entropy loss looks at one single number: the probability the network gave to the correct digit, and takes its negative logarithm (ln is the natural logarithm — plain log() in both Swift and Python):

loss = −ln( p_{correct digit} )

Feel the shape of it: 90% sure and right → loss 0.105, tiny. A coin-flip 50% → 0.693. Only 10% on the right answer → 2.303, and as that probability heads toward zero the loss blows up toward infinity. Confident wrongness gets punished hardest — exactly the pressure that drives the weights toward reading digits. In PyTorch this whole story is one object, nn.CrossEntropyLoss, which takes the raw logits directly and does the softmax internally (PyTorch docs: "the input is expected to contain the unnormalized logits for each class").

The complete program

Everything below is lesson 11's PyTorch skeleton with exactly three changes: real data instead of XOR, 784→128→10 instead of 2→2→1, and cross-entropy instead of squared error. Python only — PyTorch is a Python library, there is no Swift build of it. The structure follows the official PyTorch Quickstart; the dataset call is torchvision's MNIST, and ToTensor() converts each image from 0–255 pixel bytes to floats scaled 0…1 (shape 1×28×28 — the unrolling to 784 is nn.Flatten's job, inside the model).

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# 1. Data: 60,000 training digits + 10,000 test digits, 28x28 each
train_data = datasets.MNIST(root="data", train=True,  download=True, transform=ToTensor())
test_data  = datasets.MNIST(root="data", train=False, download=True, transform=ToTensor())

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_data,  batch_size=64)

# 2. The network: 784 -> 128 -> 10
model = nn.Sequential(
    nn.Flatten(),         # 28x28 grid -> vector of 784 (lesson 3)
    nn.Linear(784, 128),  # weights + biases (lessons 3-4)
    nn.ReLU(),            # the kink that lets layers matter (lesson 5)
    nn.Linear(128, 10),   # 10 scores, one per digit
)

loss_fn   = nn.CrossEntropyLoss()                       # classification loss (lesson 6)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1) # gradient descent (lesson 7)

# 3. Train: the loop from lesson 9, run by PyTorch (lesson 11)
for epoch in range(3):
    model.train()
    for X, y in train_loader:
        loss = loss_fn(model(X), y)   # forward pass + how wrong?
        optimizer.zero_grad()
        loss.backward()               # backpropagation (lesson 8)
        optimizer.step()              # nudge all 101,770 knobs
    print(f"epoch {epoch + 1}: last batch loss {loss.item():.3f}")

# 4. Test on the 10,000 digits the network has never seen
model.eval()
correct = 0
with torch.no_grad():
    for X, y in test_loader:
        pred = model(X)                                # 10 scores per image
        correct += (pred.argmax(1) == y).sum().item()  # biggest score wins
print(f"test accuracy: {correct / len(test_data):.1%}")

Reading guide: DataLoader serves the data in mini-batches of 64, reshuffling each epoch (one epoch = one full pass over all 60,000 images, lesson 9). argmax(1) picks the winning score per image; comparing with the true labels y and summing counts the hits. Note what the loop does not contain: no derivative code anywhere — loss.backward() is lesson 8, done for you.

Run it in Colab — three steps

① Open colab.research.google.com and click New notebook (torch and torchvision are preinstalled — nothing touches your Mac). ② Paste the program into the empty cell. ③ Press Runtime → Run all. The first run downloads MNIST into the notebook (a few seconds), then trains: three epochs take a couple of minutes on the free CPU runtime.

Expect a test accuracy of roughly 97% — runs vary because the weights start random, but this ballpark is typical for a plain MLP like ours: Wikipedia's MNIST results table lists a plain two-layer 784–800–10 network at 1.6% error, i.e. 98.4%, so our smaller, briefly-trained 128-neuron version landing around 97 out of every 100 unseen handwritten digits is exactly where it should be. Stop and let that land: you just trained that.

Check yourself

No peeking back. Pull it from memory.

1. Why does the input layer need exactly 784 neurons?

2. How does the finished network choose which digit to answer?

3. When does cross-entropy loss blow up toward huge values?

The whole course, one breath

A neuron is multiply-add-squash (1); it draws a decision line (2); a layer of them is vectors and weights in bulk (3); layers stack into a forward pass (4); ReLU keeps depth meaningful (5); loss measures how wrong (6); gradients point downhill and we step (7); backprop computes every gradient through the layers (8); the training loop repeats the nudge (9); you built all of it raw in Python (10); PyTorch compressed it into a few lines (11); and today it read real handwriting (12). Click each lesson you could now explain to another iOS dev:

Read this next

Primary source: the official PyTorch Quickstart tutorial — the same pipeline you just ran, written by the PyTorch team, plus the pieces we skipped (GPU device selection, saving and loading a trained model). You can now read every line of it.

And what comes after this finish line? The same neurons, weights and gradient descent you now own are the literal substrate of GPT-style models — and that's exactly where this book goes next: Phase 5 — Under the Hood of GPT, six lessons from tokens to the transformer to why ChatGPT chats and codes. Your foundation is complete; now we climb into the machine you actually came for.