AI Foundations · Lesson 18 · Phase 5 — Under the hood of GPT

How GPT Is Trained

Lesson 17 assembled the machine — but a freshly built transformer is like your lesson-10 net before training: 175 billion knobs full of random noise, emitting gibberish. Today, the last lesson: where the knowledge comes from (pretraining), how a text-completer is tamed into an assistant (finetuning and RLHF), what the dials in every LLM API actually do (temperature), and why the machine sometimes lies with a straight face (hallucination). None of it will be magic — every piece lands on something you've already computed.

Step 1 — Pretraining: the P in GPT

The recipe is lesson 9's training loop, fed with the internet. Take mountains of text; for every position in every snippet, the training example is: context → the token that actually came next. The text grades itself — no human labels needed. The loss for each prediction is one you already know by name and by hand:

loss = −ln( p_{true next token} )

— lesson 12's cross-entropy, unchanged: be confident about the token that really came next, or pay. Then backprop (8), update (7), repeat (9), across the corpus. That's the whole intellectual content of pretraining. The rest is scale, and the scale is the story: GPT-3 was trained on "hundreds of billions of words" — about 60% filtered Common Crawl (a crawl of the public web), plus web text, two book corpora, and Wikipedia. The compute is the part nobody's intuition survives: 3Blue1Brown's illustration is that at one billion operations per second, the largest models' training would take over 100 million years. (Data centers do it in months by running astronomically wide in parallel — matrix multiplications again, lesson 17's sidenote.)

Why does next-token prediction produce knowledge? Because predicting well is impossible without absorbing structure. To continue "the capital of Armenia is", the weights must encode geography; to continue func sigmoid(_ z: Double) ->, they must encode Swift's grammar. Facts, styles, languages, code idioms — all get pressed into the weights as side effects of one humble objective. And this is also why LLMs are good at coding: source code is just text with an unusually strict grammar, the public internet contains oceans of it, and a model that predicts code tokens well has been forced to internalize syntax, APIs, and the patterns of how functions get used. Your coding assistant is a next-token predictor that read more code than any human ever will.

Step 2 — From autocomplete to assistant: finetuning & RLHF

A pretrained model is a completer, not a helper. Ask it "How do I center a view in SwiftUI?" and a raw completer might continue with… more forum questions, because that's a statistically plausible continuation of a question on the internet. Two finishing steps close the gap — here's how ChatGPT was made, per the public record: "the fine-tuning process involved supervised learning and reinforcement learning from human feedback (RLHF)."

Supervised finetuning: more lesson-9 training, but now on a curated dataset of dialogues, where human "trainers acted as both the user and the AI assistant, providing examples of how the chatbot is expected to respond" (Wikipedia). Same loop, better-mannered data: the model learns the shape of being an assistant.

RLHF (reinforcement learning from human feedback): humans rank several model answers from best to worst; those rankings train a reward model that scores answers automatically; then the LLM's weights are nudged toward answers the reward model likes. You can hear the lesson-6 reframe inside it: human preference becomes a number, and a number can be optimized. The result is the difference between a wild text-predictor and the polite thing you chat with.

Step 3 — The dial you've seen everywhere: temperature

Generation still ends with lesson 17's logits → softmax → dice. Temperature is one number, T, that divides the logits before softmax:

probabilities = softmax( logits ⁄ T )

Divide by a small T and the gaps between scores stretch — the favorite dominates. Divide by a big T and the gaps shrink — outsiders get real chances. In 3Blue1Brown's words: with larger T "more weight is given to lower values, making the distribution more uniform", while with smaller T "the larger values will dominate the distribution, and in the extreme setting when T is equal to 0, all the weight goes to the maximum value" — T = 0 is exactly lesson 14's greedy trap, useful when you want deterministic answers (like code), dull when you want prose. Run it on lesson 17's verified logits:

import Foundation

let candidates = ["mat", "sofa", "roof", "dog", "moon"]
let logits     = [4.0,   2.5,    2.0,    0.5,   0.1]

func softmax(_ xs: [Double]) -> [Double] {
    let es = xs.map(exp)
    let s = es.reduce(0, +)
    return es.map { $0 / s }
}

for T in [0.5, 1.0, 2.0] {
    let probs = softmax(logits.map { $0 / T })
    let row = zip(candidates, probs).map { "\($0) " + String(format: "%.3f", $1) }
                                    .joined(separator: "  ")
    print("T = \(T):  \(row)")
}

import math

candidates = ["mat", "sofa", "roof", "dog", "moon"]
logits     = [4.0,   2.5,    2.0,    0.5,   0.1]

def softmax(xs):
    es = [math.exp(x) for x in xs]
    s = sum(es)
    return [e / s for e in es]

for T in [0.5, 1.0, 2.0]:
    probs = softmax([l / T for l in logits])
    row = "  ".join(f"{c} {p:.3f}" for c, p in zip(candidates, probs))
    print(f"T = {T}:  {row}")

Both print exactly this:

T = 0.5:  mat 0.935  sofa 0.047  roof 0.017  dog 0.001  moon 0.000
T = 1.0:  mat 0.710  sofa 0.158  roof 0.096  dog 0.021  moon 0.014
T = 2.0:  mat 0.464  sofa 0.219  roof 0.171  dog 0.081  moon 0.066

Feel it: turn the dial

Step 4 — Why it lies: hallucination, from the mechanism

Now you can explain LLMs' most famous flaw without hand-waving. The machine's only move — its entire repertoire — is "emit a plausible next token." It is not consulting a database and reporting lookup failures; there is no lookup. So when the context calls for a fact the weights don't reliably encode (an obscure paper, a niche API, your cousin's birthday), the most statistically plausible continuation still gets generated — fluent, well-formatted, and wrong. The field calls this hallucination: "a response generated by AI that contains false or misleading information presented as fact" (Wikipedia) — "plausible-sounding random falsehoods" delivered with perfect grammar. Notice it's not a bug bolted on top; it's the default behavior of a fluency machine at the edge of its knowledge. That's why serious workflows ask LLMs to cite, check, run, or verify — exactly what this book did to itself.

One more honest limit while we're here: the model only attends (lesson 16) over its context window — GPT-3's was 2,048 tokens (Wikipedia); modern models stretch much further, but the principle stands: what's outside the window does not exist for the model. "It forgot what I said an hour ago" usually means "it slid out of the window."

The capstone: train your own GPT

You've earned the real thing. Karpathy's "Let's build GPT: from scratch, in code, spelled out" builds and trains a small character-level transformer on Shakespeare — embeddings, position vectors, attention, blocks, cross-entropy, the loop — every one a lesson from this book, in PyTorch you can read since lesson 11. Follow it in Colab (zero install, as always), and at the end you'll watch a transformer you trained generate Shakespeare-ish verse. That moment is this course's true finish line.

Check yourself

No peeking back. Pull it from memory.

1. What is the training objective during LLM pretraining?

2. What does RLHF add on top of the pretrained model?

3. Why do hallucinations happen, mechanically speaking?

The whole book, one breath

A neuron multiplies, adds, squashes (1–2); layers stack it (3–5); loss, gradients, backprop, and the loop make it learn (6–9); you trained nets from scratch and in PyTorch (10–12); then text became tokens (13), prediction became generation (14), IDs became meaning-vectors (15), tokens learned to look at each other (16), it all stacked into the transformer (17), and scale plus human feedback turned it into the thing you talk to (18). Click what you can now explain to another developer:

What this book deliberately did not cover — honest list, for your future roadmap: multimodal models (images/audio in LLMs), retrieval (RAG) and tool use, agents, and the fast-moving training tricks beyond RLHF. You now have the substrate to learn any of them without hand-waving. For "is my mental model right?" questions, the communities in Resources are the place to test yourself against real practitioners.