AI Foundations · Reference

Formulas Cheat Sheet

Every formula of the course on one printable page. Numbers in examples are real computed values. Lesson numbers in brackets.

The neuron — three forms [L1]

z = w₁·x₁ + w₂·x₂ + … + wₙ·xₙ + b   →   a = σ(z)
z = ( Σᵢ wᵢ·xᵢ ) + b   ·   Σ = "loop and add"
a = σ( w·x + b )   ·   dot product = multiply pairwise, add up

Sigmoid [L1, L5, L8]

σ(z) = 1 ⁄ (1 + e⁻ᶻ)   ·   range (0, 1) · σ(0) = 0.5 · σ(1) ≈ 0.7311 · σ(2) ≈ 0.8808
σ′(z) = σ(z)·(1 − σ(z))   — its own derivative, used by backprop

The decision boundary [L2]

w·x + b = 0   ·   solved for x₂:  x₂ = −(w₁/w₂)·x₁ − b/w₂

Weights set the line's direction; bias slides it. Fires (a > 0.5) exactly when z > 0. Hand-built AND gate: w = (10, 10), b = −15. Same weights, b = −5 → OR.

A layer [L3]

a = σ( W·x + b )   ·   one row of W per neuron · σ applied element-wise

Shape rule: (n×m matrix)·(vector of m) → vector of n. Output entry j = dot product of row j with x = one neuron.

The forward pass [L4]

a⁽ˡ⁾ = σ( W⁽ˡ⁾·a⁽ˡ⁻¹⁾ + b⁽ˡ⁾ )   ·   a⁽⁰⁾ = x  ·  superscript = layer label, not a power

Hand-built XOR (2→2→1): hidden h₁ = OR (20, 20, −10), h₂ = NAND (−20, −20, +30), output = AND (20, 20, −30).

Activation functions [L5]

sigmoid → (0,1)  ·  tanh → (−1,1)  ·  ReLU(z) = max(0, z)

Without a nonlinearity, stacked layers collapse to one linear map: W₂(W₁x) = (W₂W₁)x. ReLU is the modern default for hidden layers; sigmoid survives at outputs for probabilities.

Loss — MSE [L6]

MSE = (1 ⁄ n) · Σᵢ (ŷᵢ − yᵢ)²   ·   ŷ "y-hat" = prediction, y = target

Squaring kills the sign and magnifies big misses. Floor is 0, hit when every ŷ = y. Nielsen writes C = (1/2n)Σ‖y−a‖² — the extra ½ is cosmetic (cancels the derivative's 2). Loss is a function of the weights; training = minimizing it.

Gradient descent [L7]

w ← w − η · ∂L/∂w   ·   vector form: ww − η·∇L

∂L/∂w = slope of the loss at the current w. η (eta) = learning rate: too small crawls, too big diverges. The derivative of (w−3)² is 2(w−3).

Backpropagation [L8, L10]

chain rule:  dL/dw = dL/da · da/dz · dz/dw
one sigmoid neuron + squared error:  dL/dw = 2(a − y) · a(1 − a) · x
two layers (2-2-1):  δ_out = 2(out−y)·out(1−out)  ·  δ_h = δ_out·v · h(1−h)

Every weight's gradient = its δ × the activation entering it; every bias gradient = the bare δ. These are Nielsen's BP1–BP4, written small.

The training loop [L9]

for each epoch: # one full pass over the data forward (L4) loss (L6) backward (L8) update: w -= η·grad (L7)

Batch GD = one update from all examples' summed gradients; SGD = update per (mini-batch of) example(s).

PyTorch ↔ from scratch [L11]

you wrotePyTorch
forward pass through the layerspred = model(X)
mean of (ŷ−y)²nn.MSELoss()
reset gradient accumulatorsoptimizer.zero_grad()
the whole backward passloss.backward()
w -= lr*grad everywhereoptimizer.step()

Language models [L13–L14]

tokenize: text → tokens → IDs  ·  BPE: keep merging the most frequent adjacent pair
language model: P(next token | context)  ·  generate: predict → sample → append → repeat

Bigram: P(b|a) = count(a→b) ⁄ count(a→anything). Greedy (always argmax) repeats one answer forever; sampling buys variety.

Embeddings [L15]

token ID → row of the embedding matrix (learned like any weight)
cos(u, v) = (u·v) ⁄ (‖u‖·‖v‖)  ·  +1 same direction · 0 unrelated · −1 opposite

Similar contexts → nearby vectors; directions = concepts (king − man + woman ≈ queen).

Attention [L16]

Attention(Q, K, V) = softmax( QKᵀ ⁄ √dk ) · V
softmax(x)ᵢ = eˣⁱ ⁄ Σⱼ eˣʲ  — scores → weights that sum to 1

q = what I seek · k = what I am · v = what I give. Causal mask: attend only to self + earlier tokens. New vector = weighted sum of values. Heads = several attentions in parallel.

The transformer [L17]

tokens → embedding + position → N × [x ← x + attention(x); x ← x + MLP(x)] → logits → softmax → sample

GPT-2: 124M–1.5B params, 48 layers. GPT-3: 175B params, 96 layers, 96 heads, 12,288-D embeddings, 2,048-token context. Decoder-only = causally masked, generates left to right.

Training GPT [L18]

pretraining loss = −ln( ptrue next token ) — cross-entropy on the next token, at internet scale
temperature: probabilities = softmax( logits ⁄ T )  ·  T→0 greedy · big T adventurous

Pipeline: pretraining (next-token on huge corpora) → supervised finetuning (dialogue examples) → RLHF (humans rank, reward model scores, weights nudge). Hallucination = fluent sampling past the edge of the weights' knowledge — mechanism, not bug.

Worked numbers to sanity-check yourself

σ(1.0) = 0.7310585786300049 [L1] · AND(1,1): z = +5, a = 0.9933 [L2] · layer [0.5,−1,2] → [0.881, 0.060] [L3] · XOR table: 0.00005 / 0.99995 / 0.99995 / 0.00005 [L4] · MSE of (0.9, 0.2, 0.4) vs (1, 0, 1) = 0.41/3 ≈ 0.1367 [L6] · descent from w=0, η=0.1 on (w−3)²: w₁ = 0.6, w₈ = 2.4967 [L7] · trained AND: w₁=w₂=5.14, b=−7.81 [L9] · trained XOR: 0.020 / 0.977 / 0.981 / 0.018 [L10] · "to be" → [6, 4, 0, 1, 2] [L13] · P(end|'a') = 0.68 in the names model [L14] · king − man + woman = [−1, 2] = queen, cosine 1.000 [L15] · cat attends 0.446 / 0.446 / 0.108 [L16] · logits 4.0/2.5/2.0/0.5/0.1 → softmax 0.710/0.158/0.096/0.021/0.014 [L17] · at T = 0.5 the favorite jumps to 0.935 [L18]. All verified by running the lessons' own code.