Today is assembly day. There is no new math in this lesson — only the discovery that the parts you've built since lesson 13 snap together into the architecture behind every modern LLM: the transformer. By the end of this page you'll be able to trace a sentence from raw text to the next sampled token through the actual structure of GPT — whose name you'll also finally own: Generative (it generates, lesson 14), Pretrained (lesson 18's story), Transformer (today).
Five arrows, and you have personally built or hand-computed every single one. Let's walk the two pieces that need a word of introduction — positions and the block — then run the bottom of the machine for real.
Lesson 15 ended on a confession: the embedding lookup hands out the same vector for a token whether it's the first word or the fifteenth. But "dog bites man" and "man bites dog" differ only in positions — attention alone can't recover word order from identical vectors. The fix is almost embarrassingly simple: keep a vector for each position (position 1, position 2, …), and add it — plain lesson-3 vector addition — to each token's embedding. As Wikipedia puts it, the positional encoding "provides the transformer model with information about where the words are in the input sequence": "the token embedding vectors are added to their respective positional encoding vectors, producing the sequence of input vectors." After that, every vector entering the blocks carries both what it is and where it stands.
The transformer's body is one unit repeated over and over. Each block contains exactly two machines you already know (Wikipedia: "a self-attention mechanism and a feed-forward layer"):
Attention (lesson 16) — the only place tokens talk to each other: queries, keys, values, causal mask, weighted sums. Information moves between positions.
The MLP — and here three lessons of "boring" foundations cash out: this is literally lessons 3–5's network. Each token's vector, alone, is pushed through linear layers with a nonlinearity between them. No token sees any other here; each gets private processing of whatever attention just mixed into it.
One wiring detail worth knowing because you'll see it everywhere: each part's output is added to its own input rather than replacing it (plus a normalization step) — Wikipedia: the layers "contain residual connections and layer normalization steps". The intuition: blocks edit the token vectors, nudge by nudge, rather than rewriting them from scratch — so even after 96 blocks, the original signal is still in there, refined. (Bonus: those residual "shortcuts" also give lesson 8's backward pass a fast lane through a very deep stack.)
After the last block, each position holds a context-soaked vector. The model now needs lesson 14's product: probabilities for the next token. So the final position's vector goes through one last linear layer with one output per vocabulary token — your lesson-12 output layer, except instead of 10 digits it scores the whole vocabulary. Those raw scores are the logits (lesson 12's word), and softmax turns them into probabilities. Run the exit for real:
import Foundation
// the final layer's raw scores (logits) for the next token after "the cat sat on the"
let candidates = ["mat", "sofa", "roof", "dog", "moon"]
let logits = [4.0, 2.5, 2.0, 0.5, 0.1]
func softmax(_ xs: [Double]) -> [Double] {
let es = xs.map(exp)
let s = es.reduce(0, +)
return es.map { $0 / s }
}
let probs = softmax(logits)
for ((c, l), p) in zip(zip(candidates, logits), probs) {
let pad = String(repeating: " ", count: 5 - c.count)
print(" \(c)\(pad) logit " + String(format: "%4.1f", l)
+ " -> probability " + String(format: "%.3f", p))
}
print(" sum of probabilities: " + String(format: "%.1f", probs.reduce(0, +)))
import math
# the final layer's raw scores (logits) for the next token after "the cat sat on the"
candidates = ["mat", "sofa", "roof", "dog", "moon"]
logits = [4.0, 2.5, 2.0, 0.5, 0.1]
def softmax(xs):
es = [math.exp(x) for x in xs]
s = sum(es)
return [e / s for e in es]
probs = softmax(logits)
for c, l, p in zip(candidates, logits, probs):
print(f" {c:5s} logit {l:4.1f} -> probability {p:.3f}")
print(f" sum of probabilities: {sum(probs):.1f}")
Both print exactly this:
mat logit 4.0 -> probability 0.710
sofa logit 2.5 -> probability 0.158
roof logit 2.0 -> probability 0.096
dog logit 0.5 -> probability 0.021
moon logit 0.1 -> probability 0.014
sum of probabilities: 1.0
Then the dice (lesson 14) pick one — probably mat, occasionally roof — it's appended to the text, and the whole machine runs again for the token after that. ChatGPT "typing" is this loop spinning.
Press step to push "the cat sat on the" through each stage. The numbers in the logits stage are the verified ones from the code above.
Every row below is the same architecture you just stepped through — only the knob counts differ. All figures verified against the sources linked:
| model | parameters | blocks (layers) | details |
|---|---|---|---|
| GPT-2 (2019) | 124M – 1.5B | 48 (full model) | the first famous one |
| GPT-3 (2020) | 175B | 96 | 12,288-D embeddings · 96 attention heads · 2,048-token context |
Sources for the details: embedding width and heads from 3Blue1Brown's GPT chapter and attention chapter; the rest from the linked Wikipedia articles, which also note GPT-3 is a "decoder-only transformer model" — decoder-only being the official name for exactly the causally-masked, generate-left-to-right variant you've learned. Newer models (GPT-4 and beyond) keep details undisclosed, but the published architecture family is this one.
And here is the sentence that should make 175,000,000,000 feel less mystical: every one of those parameters is the same kind of number as the nine weights of your lesson-10 XOR net — a knob, nudged downhill by lesson 7's rule, with gradients from lesson 8's backprop. 175B ÷ 9 ≈ 19 billion of your XOR nets. The machine is not deep magic; it is shallow magic, repeated at a scale that's hard to feel.
No peeking back. Pull it from memory.
Primary source: Andrej Karpathy — "Let's build GPT: from scratch, in code, spelled out." He assembles this exact architecture line by line in PyTorch — embeddings, position vectors, attention, blocks, logits — and trains it. It's long and worth every minute; after lessons 11 and 16 you can genuinely follow the code. (It's also precisely the program you'll run in Colab as lesson 18's capstone.)