Today you build a model that invents girls' names — and in doing so you'll hold, in twenty lines, the exact job description of GPT. A language model does one thing: given some text, it assigns a probability to every possible next token, and generation is just sampling from those probabilities, over and over (3Blue1Brown). GPT does it with a trillion-knob network over 100k tokens; yours will do it with a counting table over 16 characters. Same machine, different size — and yours you can read whole.
Look back at lesson 12: your digit reader output 10 scores, biggest wins. A language model is the same shape with two twists: it has one output per vocabulary token (lesson 13), and instead of always taking the biggest, it treats the scores as probabilities and rolls dice. Generation is a loop:
That loop, called autoregressive generation, is all that's happening when ChatGPT "types" at you — one token at a time, each one sampled, then fed back in as context for the next. Stop dramatizing it now and you'll never be mystified by it again.
GPT predicts from everything written so far. Today we predict from only the one previous character — a model called a bigram ("pair") model. It's the classic first language model — lesson one of Andrej Karpathy's language-modeling course builds exactly this — and it needs no training loop at all, because for one-character context you can get the probabilities by counting:
Symbol by symbol: P(next = b | current = a) reads "the probability that the next character is b, given that the current one is a" — the vertical bar | means "given", and it's the only new notation today. The right side is just counting: of all the times a appeared, what fraction had b right after it? Count the pairs in real names, divide — done. No gradient descent needed (we'll reconnect to it at the end).
Twenty real names go in. One trick to know: we wrap each name as .emma. — the dot is a special token meaning both "a name starts" and "a name ends". Because the dot is in the counting, the model learns where names stop, all by itself. (Real LLMs do the same with special tokens — remember lesson 13's 258 special entries in GPT-4's vocabulary.) The random generator is written out by hand so that Swift and Python produce bit-for-bit identical names — run either and compare with the output below:
import Foundation
let names = ["emma", "olivia", "ava", "isabella", "sophia", "mia", "amelia",
"harper", "luna", "nora", "anna", "maria", "elena", "dina",
"lara", "nina", "vera", "alina", "dana", "mila"]
// vocabulary: '.' marks both start and end of a name
let vocab = ["."] + Array(Set(names.joined())).sorted().map(String.init)
var stoi: [String: Int] = [:]
for (i, ch) in vocab.enumerated() { stoi[ch] = i }
let V = vocab.count
// count every adjacent pair across all names
var N = Array(repeating: Array(repeating: 0, count: V), count: V)
for name in names {
let s = ".\(name).".map(String.init)
for k in 0..<s.count - 1 {
N[stoi[s[k]]!][stoi[s[k + 1]]!] += 1
}
}
// counts -> probabilities, row by row
var P: [[Double]] = []
for row in N {
let total = row.reduce(0, +)
P.append(total > 0 ? row.map { Double($0) / Double(total) }
: Array(repeating: 1.0 / Double(V), count: V))
}
// what does the model believe comes after 'a'?
let rowA = P[stoi["a"]!]
let top = (0..<V).sorted { rowA[$0] > rowA[$1] }.prefix(3)
print("after 'a':", top.map { "'\(vocab[$0])' " + String(format: "%.2f", rowA[$0]) }
.joined(separator: ", "))
// deterministic random numbers (same algorithm as the Python version)
var state: UInt32 = 42
func rand() -> Double {
state = 1664525 &* state &+ 1013904223
return Double(state) / 4294967296.0
}
func sample(_ probs: [Double]) -> Int {
let r = rand()
var c = 0.0
for (i, p) in probs.enumerated() {
c += p
if r < c { return i }
}
return V - 1
}
print("\ninvented names:")
for _ in 0..<8 {
var out: [String] = [], cur = 0
while true {
cur = sample(P[cur]) // roll the dice on row P[cur]
if cur == 0 { break } // sampled the dot: the name is over
out.append(vocab[cur])
}
print(" " + (out.isEmpty ? "(empty)" : out.joined()))
}
names = ["emma", "olivia", "ava", "isabella", "sophia", "mia", "amelia",
"harper", "luna", "nora", "anna", "maria", "elena", "dina",
"lara", "nina", "vera", "alina", "dana", "mila"]
# vocabulary: '.' marks both start and end of a name
vocab = ["."] + sorted(set("".join(names)))
stoi = {ch: i for i, ch in enumerate(vocab)}
V = len(vocab)
# count every adjacent pair across all names
N = [[0] * V for _ in range(V)]
for name in names:
s = "." + name + "."
for a, b in zip(s, s[1:]):
N[stoi[a]][stoi[b]] += 1
# counts -> probabilities, row by row
P = []
for row in N:
total = sum(row)
P.append([c / total for c in row] if total else [1.0 / V] * V)
# what does the model believe comes after 'a'?
row = P[stoi["a"]]
top = sorted(range(V), key=lambda i: -row[i])[:3]
print("after 'a':", ", ".join(f"'{vocab[i]}' {row[i]:.2f}" for i in top))
# deterministic random numbers (same algorithm as the Swift version)
state = 42
def rand():
global state
state = (1664525 * state + 1013904223) % 2**32
return state / 2**32
def sample(probs):
r = rand()
c = 0.0
for i, p in enumerate(probs):
c += p
if r < c: return i
return V - 1
print("\ninvented names:")
for _ in range(8):
out, cur = [], 0
while True:
cur = sample(P[cur]) # roll the dice on row P[cur]
if cur == 0: break # sampled the dot: the name is over
out.append(vocab[cur])
print(" " + ("".join(out) or "(empty)"))
Both programs print exactly this:
after 'a': '.' 0.68, 'r' 0.11, 'n' 0.07
invented names:
da
da
harpha
opelunna
lariana
a
nna
linivela
Read the first line like a fortune teller's notebook: after an a, the model is 68% sure the name is over — it has learned, purely from counting, that these names usually end in a. And the inventions: harpha, lariana, linivela were never in the list — the model composed them. Also da and a bare a — because a bigram only remembers one character of context. It knows a often ends a name; it cannot know the name so far is only one letter long. Hold that thought; it's the whole motivation for lesson 16.
This widget holds the same table P. Pick a character to see the model's beliefs about what follows it. Then press invent a name — it uses the same dice as the programs, so your first eight names will be exactly the eight above.
Why roll dice instead of always taking the most likely next character? Press the second button and see: starting from the dot, the most likely first character is a (4 of our 20 names start with it), and after a the most likely "next" is the end-dot (that 0.68). So greedy generation produces the name a — and, being deterministic, it produces a again, forever. One boring answer, repeated. Sampling is what buys variety — and the dial that adjusts how adventurous the dice are is called temperature, which you'll meet properly in lesson 18 (it's the same knob you see in every LLM API).
Write our model's recipe next to GPT's:
Upgrade 1: more context. One character of memory produced da. Real text needs the whole story so far — but "the story so far" is different for every position in every text, so something must let each new token look back at all the others and decide what matters. That mechanism is attention, lesson 16.
Upgrade 2: compute, don't store. Our table is 16×16 — fine. But a table over contexts explodes: even just two characters of context needs V² rows, three needs V³… with GPT-4's 100,258-token vocabulary and contexts thousands of tokens long, the table would need more rows than there are atoms in the universe. Nobody stores that table. Instead a neural network computes its row on demand: context in, probabilities out — and the network's weights are trained with exactly your lesson-9 loop, using exactly lesson 12's cross-entropy loss on the true next token. (That sentence is the entire secret of LLM training; lesson 18 unpacks it.)
No peeking back. Pull it from memory.
Primary source: Andrej Karpathy — "The spelled-out intro to language modeling: building makemore". He builds this exact bigram model on 32,000 real names, then — and this is the part to watch for — replaces the counting with a trained one-layer network and shows they land on the same probabilities. It's the cleanest demonstration ever filmed that "counting" and "training with cross-entropy" are two roads to the same place.