AI Foundations · Lesson 17 · Phase 5 — Under the hood of GPT

The Transformer

Today is assembly day. There is no new math in this lesson — only the discovery that the parts you've built since lesson 13 snap together into the architecture behind every modern LLM: the transformer. By the end of this page you'll be able to trace a sentence from raw text to the next sampled token through the actual structure of GPT — whose name you'll also finally own: Generative (it generates, lesson 14), Pretrained (lesson 18's story), Transformer (today).

The whole machine, top to bottom

text
→ tokens & IDs (lesson 13)
→ embedding lookup + position vector (lesson 15 + today)
→ N × transformer block: attention, then MLP — each adding to what came in (16 + 3–5)
→ final layer → one logit per vocabulary token (lesson 12's output, vocabulary-sized)
→ softmax → probabilities → sample → append → repeat (lessons 16 + 14)

Five arrows, and you have personally built or hand-computed every single one. Let's walk the two pieces that need a word of introduction — positions and the block — then run the bottom of the machine for real.

New piece 1: position vectors

Lesson 15 ended on a confession: the embedding lookup hands out the same vector for a token whether it's the first word or the fifteenth. But "dog bites man" and "man bites dog" differ only in positions — attention alone can't recover word order from identical vectors. The fix is almost embarrassingly simple: keep a vector for each position (position 1, position 2, …), and add it — plain lesson-3 vector addition — to each token's embedding. As Wikipedia puts it, the positional encoding "provides the transformer model with information about where the words are in the input sequence": "the token embedding vectors are added to their respective positional encoding vectors, producing the sequence of input vectors." After that, every vector entering the blocks carries both what it is and where it stands.

New piece 2: the block — attention, then MLP

The transformer's body is one unit repeated over and over. Each block contains exactly two machines you already know (Wikipedia: "a self-attention mechanism and a feed-forward layer"):

Attention (lesson 16) — the only place tokens talk to each other: queries, keys, values, causal mask, weighted sums. Information moves between positions.

The MLP — and here three lessons of "boring" foundations cash out: this is literally lessons 3–5's network. Each token's vector, alone, is pushed through linear layers with a nonlinearity between them. No token sees any other here; each gets private processing of whatever attention just mixed into it.

One wiring detail worth knowing because you'll see it everywhere: each part's output is added to its own input rather than replacing it (plus a normalization step) — Wikipedia: the layers "contain residual connections and layer normalization steps". The intuition: blocks edit the token vectors, nudge by nudge, rather than rewriting them from scratch — so even after 96 blocks, the original signal is still in there, refined. (Bonus: those residual "shortcuts" also give lesson 8's backward pass a fast lane through a very deep stack.)

block(x): x ← x + attention(x) then x ← x + MLP(x)

The exit: from vectors back to words

After the last block, each position holds a context-soaked vector. The model now needs lesson 14's product: probabilities for the next token. So the final position's vector goes through one last linear layer with one output per vocabulary token — your lesson-12 output layer, except instead of 10 digits it scores the whole vocabulary. Those raw scores are the logits (lesson 12's word), and softmax turns them into probabilities. Run the exit for real:

import Foundation

// the final layer's raw scores (logits) for the next token after "the cat sat on the"
let candidates = ["mat", "sofa", "roof", "dog", "moon"]
let logits     = [4.0,   2.5,    2.0,    0.5,   0.1]

func softmax(_ xs: [Double]) -> [Double] {
    let es = xs.map(exp)
    let s = es.reduce(0, +)
    return es.map { $0 / s }
}

let probs = softmax(logits)
for ((c, l), p) in zip(zip(candidates, logits), probs) {
    let pad = String(repeating: " ", count: 5 - c.count)
    print("  \(c)\(pad)  logit " + String(format: "%4.1f", l)
          + "  ->  probability " + String(format: "%.3f", p))
}
print("  sum of probabilities: " + String(format: "%.1f", probs.reduce(0, +)))

import math

# the final layer's raw scores (logits) for the next token after "the cat sat on the"
candidates = ["mat", "sofa", "roof", "dog", "moon"]
logits     = [4.0,   2.5,    2.0,    0.5,   0.1]

def softmax(xs):
    es = [math.exp(x) for x in xs]
    s = sum(es)
    return [e / s for e in es]

probs = softmax(logits)
for c, l, p in zip(candidates, logits, probs):
    print(f"  {c:5s}  logit {l:4.1f}  ->  probability {p:.3f}")
print(f"  sum of probabilities: {sum(probs):.1f}")

Both print exactly this:

  mat    logit  4.0  ->  probability 0.710
  sofa   logit  2.5  ->  probability 0.158
  roof   logit  2.0  ->  probability 0.096
  dog    logit  0.5  ->  probability 0.021
  moon   logit  0.1  ->  probability 0.014
  sum of probabilities: 1.0

Then the dice (lesson 14) pick one — probably mat, occasionally roof — it's appended to the text, and the whole machine runs again for the token after that. ChatGPT "typing" is this loop spinning.

Feel it: one trip through the machine

Press step to push "the cat sat on the" through each stage. The numbers in the logits stage are the verified ones from the code above.

Now read the real machines' spec sheets

Every row below is the same architecture you just stepped through — only the knob counts differ. All figures verified against the sources linked:

model	parameters	blocks (layers)	details
GPT-2 (2019)	124M – 1.5B	48 (full model)	the first famous one
GPT-3 (2020)	175B	96	12,288-D embeddings · 96 attention heads · 2,048-token context

Sources for the details: embedding width and heads from 3Blue1Brown's GPT chapter and attention chapter; the rest from the linked Wikipedia articles, which also note GPT-3 is a "decoder-only transformer model" — decoder-only being the official name for exactly the causally-masked, generate-left-to-right variant you've learned. Newer models (GPT-4 and beyond) keep details undisclosed, but the published architecture family is this one.

And here is the sentence that should make 175,000,000,000 feel less mystical: every one of those parameters is the same kind of number as the nine weights of your lesson-10 XOR net — a knob, nudged downhill by lesson 7's rule, with gradients from lesson 8's backprop. 175B ÷ 9 ≈ 19 billion of your XOR nets. The machine is not deep magic; it is shallow magic, repeated at a scale that's hard to feel.

Why is THIS the architecture that won? Two engineering reasons, both visible from your seat. Parallelism: attention computes all pairs at once with matrix multiplications (lesson 3's machine), which is exactly what GPUs are fastest at — so training scales. Reach: any token can pull from any earlier token directly (lesson 16), so long-range structure — a variable defined 200 lines up, a plot thread from chapter one — survives. Speed to train × power to remember: that combination is why "Attention Is All You Need" renamed the field.

Check yourself

No peeking back. Pull it from memory.

1. Why are position vectors added to the token embeddings?

2. Inside a transformer block, which part lets tokens exchange information?

3. What does the model's final layer produce?

Watch this next

Primary source: Andrej Karpathy — "Let's build GPT: from scratch, in code, spelled out." He assembles this exact architecture line by line in PyTorch — embeddings, position vectors, attention, blocks, logits — and trains it. It's long and worth every minute; after lessons 11 and 16 you can genuinely follow the code. (It's also precisely the program you'll run in Colab as lesson 18's capstone.)