Every formula of the course on one printable page. Numbers in examples are real computed values. Lesson numbers in brackets.
Weights set the line's direction; bias slides it. Fires (a > 0.5) exactly when z > 0. Hand-built AND gate: w = (10, 10), b = −15. Same weights, b = −5 → OR.
Shape rule: (n×m matrix)·(vector of m) → vector of n. Output entry j = dot product of row j with x = one neuron.
Hand-built XOR (2→2→1): hidden h₁ = OR (20, 20, −10), h₂ = NAND (−20, −20, +30), output = AND (20, 20, −30).
Without a nonlinearity, stacked layers collapse to one linear map: W₂(W₁x) = (W₂W₁)x. ReLU is the modern default for hidden layers; sigmoid survives at outputs for probabilities.
Squaring kills the sign and magnifies big misses. Floor is 0, hit when every ŷ = y. Nielsen writes C = (1/2n)Σ‖y−a‖² — the extra ½ is cosmetic (cancels the derivative's 2). Loss is a function of the weights; training = minimizing it.
∂L/∂w = slope of the loss at the current w. η (eta) = learning rate: too small crawls, too big diverges. The derivative of (w−3)² is 2(w−3).
Every weight's gradient = its δ × the activation entering it; every bias gradient = the bare δ. These are Nielsen's BP1–BP4, written small.
Batch GD = one update from all examples' summed gradients; SGD = update per (mini-batch of) example(s).
| you wrote | PyTorch |
|---|---|
| forward pass through the layers | pred = model(X) |
| mean of (ŷ−y)² | nn.MSELoss() |
| reset gradient accumulators | optimizer.zero_grad() |
| the whole backward pass | loss.backward() |
w -= lr*grad everywhere | optimizer.step() |
Bigram: P(b|a) = count(a→b) ⁄ count(a→anything). Greedy (always argmax) repeats one answer forever; sampling buys variety.
Similar contexts → nearby vectors; directions = concepts (king − man + woman ≈ queen).
q = what I seek · k = what I am · v = what I give. Causal mask: attend only to self + earlier tokens. New vector = weighted sum of values. Heads = several attentions in parallel.
GPT-2: 124M–1.5B params, 48 layers. GPT-3: 175B params, 96 layers, 96 heads, 12,288-D embeddings, 2,048-token context. Decoder-only = causally masked, generates left to right.
Pipeline: pretraining (next-token on huge corpora) → supervised finetuning (dialogue examples) → RLHF (humans rank, reward model scores, weights nudge). Hallucination = fluent sampling past the edge of the weights' knowledge — mechanism, not bug.
σ(1.0) = 0.7310585786300049 [L1] · AND(1,1): z = +5, a = 0.9933 [L2] · layer [0.5,−1,2] → [0.881, 0.060] [L3] · XOR table: 0.00005 / 0.99995 / 0.99995 / 0.00005 [L4] · MSE of (0.9, 0.2, 0.4) vs (1, 0, 1) = 0.41/3 ≈ 0.1367 [L6] · descent from w=0, η=0.1 on (w−3)²: w₁ = 0.6, w₈ = 2.4967 [L7] · trained AND: w₁=w₂=5.14, b=−7.81 [L9] · trained XOR: 0.020 / 0.977 / 0.981 / 0.018 [L10] · "to be" → [6, 4, 0, 1, 2] [L13] · P(end|'a') = 0.68 in the names model [L14] · king − man + woman = [−1, 2] = queen, cosine 1.000 [L15] · cat attends 0.446 / 0.446 / 0.108 [L16] · logits 4.0/2.5/2.0/0.5/0.1 → softmax 0.710/0.158/0.096/0.021/0.014 [L17] · at T = 0.5 the favorite jumps to 0.935 [L18]. All verified by running the lessons' own code.