In lesson 10 you trained an XOR network from scratch — you wrote the forward pass, the loss, the backward pass, the update, all by hand. Today you rebuild the exact same net in PyTorch, the library most of the AI world runs on, in about 25 lines. The point is not new math — there is none. The point is discovering you can already read every line, because you have personally written the thing each line replaces. That fluency is what lesson 12 (real handwritten digits — the mission's finish line) is built on.
Your lesson-10 net worked. So why bother with a library? Three reasons:
1. Autograd. The hardest code you wrote in this course was the lesson-8 backward pass — the chain rule, applied by hand, one derivative at a time. PyTorch has an engine called autograd (automatic gradient) that, in the official tutorial's words, "supports automatic computation of gradient for any computational graph" (pytorch.org). You build the forward pass; it derives the backward pass. Your whole lesson 8, one function call.
2. Tensors. A tensor is PyTorch's array type — "a specialized data structure very similar to arrays and matrices", like a nested Swift [[Double]] but heavily optimized, and able to "run on GPUs or other hardware accelerators" (pytorch.org). Your Python lists did one multiply at a time; tensors do whole layers at once.
3. Everything is pre-built. Layers, losses, update rules — debugged, fast, ready. You stop re-typing the plumbing and spend your time on the actual network.
Nothing gets installed on your Mac. Open colab.research.google.com, create a new notebook — and that's it: Colab's environment already has PyTorch in it, so import torch just works. (The official PyTorch-in-Colab guide warns of only one thing — a version of PyTorch "that has just been released … might not be yet available in Google Colab" — irrelevant here; if you ever do need a newer one, a single pip cell inside Colab upgrades it. Never on your machine.)
Python only today — and that's the one exception to our two-language rule: the ML ecosystem (PyTorch, the tutorials, Colab itself) is Python-first, so Swift sits this lesson out; everything still transfers, because you already wrote this exact net by hand. Paste this into a Colab cell. Don't run it yet — first read it and try to recognize each line:
import torch
import torch.nn as nn
# the XOR data, as tensors (PyTorch's array type)
X = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
Y = torch.tensor([[0.], [1.], [1.], [0.]])
# the lesson-10 architecture: 2 inputs -> 2 hidden -> 1 output
model = nn.Sequential(
nn.Linear(2, 2), # hidden layer: 4 weights + 2 biases, made for you
nn.Sigmoid(), # your sigmoid, as a layer
nn.Linear(2, 1), # output layer: 2 weights + 1 bias
nn.Sigmoid(),
)
loss_fn = nn.MSELoss() # lesson 6
optimizer = torch.optim.SGD(model.parameters(), lr=1.0) # lesson 7
for epoch in range(5000): # lesson 9: the training loop
pred = model(X) # forward pass (lessons 3-4)
loss = loss_fn(pred, Y) # how wrong? (lesson 6)
optimizer.zero_grad() # clear old gradients
loss.backward() # lesson 8, done for you
optimizer.step() # w -= lr * grad (lesson 7)
if epoch % 1000 == 0:
print(f"epoch {epoch:4d} loss {loss.item():.4f}")
print(model(X)) # should land near 0, 1, 1, 0
Two small new words in there. An optimizer is the object that owns the update step — you hand it the model's parameters and a learning rate, and it applies lesson 7's rule for you (SGD = stochastic gradient descent, the plain version of what you already know; its docs are here, and note its default lr is a timid 0.001 — we pass lr=1.0 explicitly, like in lesson 10). And loss.item() just pulls the plain Python number out of a one-element tensor — the docs say it "returns the value of this tensor as a standard Python number" (pytorch.org).
What does optimizer.step() actually do? Exactly your lesson-7 update, once per parameter. Spelled out:
The SGD docs write the same thing compactly:
Symbol by symbol: θ (theta) is "any parameter" — each of this net's 9 weights and biases in turn. γ (gamma) is the learning rate, our lr=1.0. g is the gradient of the loss for that parameter — the number loss.backward() just computed and stored on the parameter's .grad attribute. And ← means "assign back": it's a -=. Same rule you hand-coded, new outfit.
Here is the entire mapping. Left column: the jobs you coded by hand in lessons 6–10. Right column: the PyTorch line that does that job. Nothing on the left disappeared — it all still runs, just inside the library.
| You wrote (lessons 6–10) | PyTorch writes |
|---|---|
| Nested lists of numbers for inputs and targets | torch.tensor([...]) |
| Weight and bias lists, filled with small random starting values | nn.Linear(2, 2) — creates and initializes them, from a documented uniform range (docs) |
The weighted sum z = w·x + b for a whole layer (lesson 3) |
Also nn.Linear — its docs write it y = xAᵀ + b, where A is the weight matrix (the ᵀ is a storage detail, not new math) |
Your sigmoid() function after each layer (lessons 1, 5) |
nn.Sigmoid() — same 1/(1+exp(−x)) (docs) |
| Calling layer after layer, output feeding the next input (lesson 4) | nn.Sequential(...) — "chains outputs to inputs sequentially" (docs); then pred = model(X) runs the whole forward pass |
loss = mean((pred − y)²) (lesson 6) |
nn.MSELoss() — squared error, averaged by default (docs) |
| The entire chain-rule backward pass (lesson 8 — the hard one) | loss.backward() |
| Resetting gradient accumulators to zero each pass (lesson 9) | optimizer.zero_grad() — needed because "gradients by default add up" (tutorial) |
w -= lr * grad for every weight and bias (lesson 7) |
optimizer.step() |
No peeking at the table. Click a from-scratch job on the left, then click the PyTorch line that replaces it. This is your retrieval practice for lessons 6–10 in one shot.
This lesson was written on a no-install machine, so the output below was not produced by PyTorch. It is representative: computed by running the mathematically identical from-scratch net (same architecture, same initialization range, same loss, same update rule) in plain Python. Your job — the real exercise of this lesson — is to paste the code into Colab, run it, and confirm the shape of the story matches:
epoch 0 loss 0.2504
epoch 1000 loss 0.0993
epoch 2000 loss 0.0037
epoch 3000 loss 0.0016
epoch 4000 loss 0.0010
final predictions ≈ 0.03, 0.97, 0.97, 0.03 → XOR learned
Your exact numbers will differ — the starting weights are random — but the story must be the same: loss starts near 0.25 (that's the score for always answering 0.5, the do-nothing baseline: mean of four (0.5 − target)² is exactly 0.25), then falls toward zero, and the four predictions land near 0, 1, 1, 0.
If your loss gets stuck near 0.25 and won't budge after thousands of epochs: run the cell again. A fresh run re-rolls the random starting weights, and some starts are simply bad for this tiny net — when I ran the identical from-scratch math across 20 random starts, 16 converged and 4 stalled. That's not a bug in your code; it's lesson 7's landscape having flat spots.
No peeking back. Pull it from memory.
optimizer.zero_grad() every iteration?optimizer.step() do to each parameter?Primary source: the official PyTorch tutorial "Learn the Basics" — seven short sections (tensors, datasets, building models, autograd, the optimization loop…), each with a "Run in Google Colab" link at the top so you stay zero-install. You have now hand-built everything it describes, so it will read like a tour of your own house. The autograd and optimization sections are the ones that map straight onto today.