AI Foundations · Lesson 14 · Phase 5 — Under the hood of GPT

Your First Language Model

Today you build a model that invents girls' names — and in doing so you'll hold, in twenty lines, the exact job description of GPT. A language model does one thing: given some text, it assigns a probability to every possible next token, and generation is just sampling from those probabilities, over and over (3Blue1Brown). GPT does it with a trillion-knob network over 100k tokens; yours will do it with a counting table over 16 characters. Same machine, different size — and yours you can read whole.

The job: predict the next token

Look back at lesson 12: your digit reader output 10 scores, biggest wins. A language model is the same shape with two twists: it has one output per vocabulary token (lesson 13), and instead of always taking the biggest, it treats the scores as probabilities and rolls dice. Generation is a loop:

predict probabilities for the next token → sample one → append it → repeat

That loop, called autoregressive generation, is all that's happening when ChatGPT "types" at you — one token at a time, each one sampled, then fed back in as context for the next. Stop dramatizing it now and you'll never be mystified by it again.

The simplification that makes it buildable today: bigrams

GPT predicts from everything written so far. Today we predict from only the one previous character — a model called a bigram ("pair") model. It's the classic first language model — lesson one of Andrej Karpathy's language-modeling course builds exactly this — and it needs no training loop at all, because for one-character context you can get the probabilities by counting:

P(next = b | current = a) = count(a→b) ⁄ count(a→anything)

Symbol by symbol: P(next = b | current = a) reads "the probability that the next character is b, given that the current one is a" — the vertical bar | means "given", and it's the only new notation today. The right side is just counting: of all the times a appeared, what fraction had b right after it? Count the pairs in real names, divide — done. No gradient descent needed (we'll reconnect to it at the end).

Build it: count, divide, sample

Twenty real names go in. One trick to know: we wrap each name as .emma. — the dot is a special token meaning both "a name starts" and "a name ends". Because the dot is in the counting, the model learns where names stop, all by itself. (Real LLMs do the same with special tokens — remember lesson 13's 258 special entries in GPT-4's vocabulary.) The random generator is written out by hand so that Swift and Python produce bit-for-bit identical names — run either and compare with the output below:

import Foundation

let names = ["emma", "olivia", "ava", "isabella", "sophia", "mia", "amelia",
             "harper", "luna", "nora", "anna", "maria", "elena", "dina",
             "lara", "nina", "vera", "alina", "dana", "mila"]

// vocabulary: '.' marks both start and end of a name
let vocab = ["."] + Array(Set(names.joined())).sorted().map(String.init)
var stoi: [String: Int] = [:]
for (i, ch) in vocab.enumerated() { stoi[ch] = i }
let V = vocab.count

// count every adjacent pair across all names
var N = Array(repeating: Array(repeating: 0, count: V), count: V)
for name in names {
    let s = ".\(name).".map(String.init)
    for k in 0..<s.count - 1 {
        N[stoi[s[k]]!][stoi[s[k + 1]]!] += 1
    }
}

// counts -> probabilities, row by row
var P: [[Double]] = []
for row in N {
    let total = row.reduce(0, +)
    P.append(total > 0 ? row.map { Double($0) / Double(total) }
                       : Array(repeating: 1.0 / Double(V), count: V))
}

// what does the model believe comes after 'a'?
let rowA = P[stoi["a"]!]
let top = (0..<V).sorted { rowA[$0] > rowA[$1] }.prefix(3)
print("after 'a':", top.map { "'\(vocab[$0])' " + String(format: "%.2f", rowA[$0]) }
                       .joined(separator: ", "))

// deterministic random numbers (same algorithm as the Python version)
var state: UInt32 = 42
func rand() -> Double {
    state = 1664525 &* state &+ 1013904223
    return Double(state) / 4294967296.0
}

func sample(_ probs: [Double]) -> Int {
    let r = rand()
    var c = 0.0
    for (i, p) in probs.enumerated() {
        c += p
        if r < c { return i }
    }
    return V - 1
}

print("\ninvented names:")
for _ in 0..<8 {
    var out: [String] = [], cur = 0
    while true {
        cur = sample(P[cur])          // roll the dice on row P[cur]
        if cur == 0 { break }         // sampled the dot: the name is over
        out.append(vocab[cur])
    }
    print("  " + (out.isEmpty ? "(empty)" : out.joined()))
}

Both programs print exactly this:

after 'a': '.' 0.68, 'r' 0.11, 'n' 0.07

invented names:
  da
  da
  harpha
  opelunna
  lariana
  a
  nna
  linivela

Read the first line like a fortune teller's notebook: after an a, the model is 68% sure the name is over — it has learned, purely from counting, that these names usually end in a. And the inventions: harpha, lariana, linivela were never in the list — the model composed them. Also da and a bare a — because a bigram only remembers one character of context. It knows a often ends a name; it cannot know the name so far is only one letter long. Hold that thought; it's the whole motivation for lesson 16.

Feel it: open the model's head

This widget holds the same table P. Pick a character to see the model's beliefs about what follows it. Then press invent a name — it uses the same dice as the programs, so your first eight names will be exactly the eight above.

Why sample? The greedy trap

Why roll dice instead of always taking the most likely next character? Press the second button and see: starting from the dot, the most likely first character is a (4 of our 20 names start with it), and after a the most likely "next" is the end-dot (that 0.68). So greedy generation produces the name a — and, being deterministic, it produces a again, forever. One boring answer, repeated. Sampling is what buys variety — and the dial that adjusts how adventurous the dice are is called temperature, which you'll meet properly in lesson 18 (it's the same knob you see in every LLM API).

The bridge to GPT — two upgrades, that's all

Write our model's recipe next to GPT's:

bigram:  P(next | one previous character) — stored in a counted table
GPT:    P(next | everything so far) — computed by a trained network

Upgrade 1: more context. One character of memory produced da. Real text needs the whole story so far — but "the story so far" is different for every position in every text, so something must let each new token look back at all the others and decide what matters. That mechanism is attention, lesson 16.

Upgrade 2: compute, don't store. Our table is 16×16 — fine. But a table over contexts explodes: even just two characters of context needs V² rows, three needs V³… with GPT-4's 100,258-token vocabulary and contexts thousands of tokens long, the table would need more rows than there are atoms in the universe. Nobody stores that table. Instead a neural network computes its row on demand: context in, probabilities out — and the network's weights are trained with exactly your lesson-9 loop, using exactly lesson 12's cross-entropy loss on the true next token. (That sentence is the entire secret of LLM training; lesson 18 unpacks it.)

Full circle with lesson 12. Your digit reader: 784 pixels in → 10 scores → biggest wins. A GPT-style model: context tokens in → 100,258 scores → softmax → sample. The output layer of ChatGPT is your MNIST output layer with a bigger vocabulary and dice at the end.

Check yourself

No peeking back. Pull it from memory.

1. What is the one job of any language model?
2. Our model invented "harpha", which is in no name list. How?
3. Why does greedy generation (always the most likely) disappoint?

Watch this next

Primary source: Andrej Karpathy — "The spelled-out intro to language modeling: building makemore". He builds this exact bigram model on 32,000 real names, then — and this is the part to watch for — replaces the counting with a trained one-layer network and shows they land on the same probabilities. It's the cleanest demonstration ever filmed that "counting" and "training with cross-entropy" are two roads to the same place.