Lesson 13 left a problem hiding in plain sight. Token IDs are arbitrary: in our toy vocabulary e was 2 and n was 3, but e isn't "one less" than n in any meaningful way — the numbers are just list positions. A network fed raw IDs would be learning from noise. So every LLM's first real move is to trade each ID for something with meaning: a learned vector — a list of numbers, exactly lesson 3's kind — called an embedding. And meaning, it turns out, becomes geometry: similar words sit near each other, and directions in the space stand for concepts.
The machinery could not be smaller. The model owns a big table — the embedding matrix — with one learned vector per vocabulary token. In 3Blue1Brown's description of GPT: "the first matrix of the transformer, known as the embedding matrix, will have one column for each of these words. These columns determine what vector each word turns into." (One vector per token — whether they're stored as rows or columns is a bookkeeping convention, like lesson 11's xAᵀ.) Then:
That's it — an array lookup. The interesting part is where the table's numbers come from: nobody writes them. They start random (lesson 10's initialization) and are nudged by the same gradient-descent loop as every other weight, because the embedding matrix is just more trainable parameters. The meanings below emerge from training on next-token prediction. For scale: in GPT-3, each of these vectors has 12,288 dimensions (3Blue1Brown); our demo today has 2, so we can draw it.
Why does training produce meaningful vectors? Because words used in similar contexts get pushed toward similar predictions, and the cheapest way to make similar predictions is to have similar vectors. This was made famous by word2vec (2013), which "represents a word as a high-dimension vector of numbers which capture relationships between words" so that "words which appear in similar contexts are mapped to vectors which are nearby" (Wikipedia). Two consequences, both measurable:
Nearness = similarity. And "near" is measured with a tool you've owned since lesson 1 — the dot product. Two vectors pointing the same way → big positive dot product; unrelated → near zero; opposite → negative. Usually it's normalized into cosine similarity, which ranges from −1 to +1:
Symbol by symbol: u·v is the lesson-1 dot product; ‖u‖ is the vector's length (lesson 6's double bars — the square root of the vector dotted with itself); dividing by both lengths removes "loudness" so only direction counts. +1 means same direction, 0 unrelated, −1 opposite.
Directions = concepts. The space organizes itself so that, e.g., the displacement from man to woman is roughly the same arrow wherever you apply it. Word2vec's famous party trick (Wikipedia): "the vector representation of 'Brother' − 'Man' + 'Woman' produces a result which is closest to the vector representation of 'Sister'". 3Blue1Brown shows the same with royalty — the difference between woman and man "is quite similar to the difference between king and queen" — and even Italy − Germany + Hitler landing near Mussolini. Arithmetic on meaning.
Real embeddings are learned; ours today are hand-made so every number is checkable. Two axes: masculine↔feminine, and royal↕ordinary. Five words. Watch the dot products sort friends from strangers, and watch the famous arithmetic land exactly on queen:
import Foundation
// a hand-made 2D embedding space:
// axis 1: masculine +1 ... feminine -1 axis 2: royal +2 ... ordinary 0 ... object -2
let emb: [(String, [Double])] = [
("king", [ 1.0, 2.0]),
("queen", [-1.0, 2.0]),
("man", [ 1.0, 0.0]),
("woman", [-1.0, 0.0]),
("apple", [ 0.0, -2.0]),
]
func vec(_ w: String) -> [Double] { emb.first { $0.0 == w }!.1 }
func dot(_ u: [Double], _ v: [Double]) -> Double { zip(u, v).map(*).reduce(0, +) }
func length(_ u: [Double]) -> Double { sqrt(dot(u, u)) }
func cosine(_ u: [Double], _ v: [Double]) -> Double { dot(u, v) / (length(u) * length(v)) }
print("how similar is 'king' to ...")
for (w, v) in emb {
let pad = String(repeating: " ", count: 6 - w.count)
print(" \(w)\(pad) dot = " + String(format: "%+5.1f", dot(vec("king"), v))
+ " cosine = " + String(format: "%+.3f", cosine(vec("king"), v)))
}
// the famous arithmetic: king - man + woman = ?
let t = zip(zip(vec("king"), vec("man")), vec("woman")).map { $0.0 - $0.1 + $1 }
print("\nking - man + woman = [\(t[0]), \(t[1])]")
let best = emb.max { cosine(t, $0.1) < cosine(t, $1.1) }!.0
print("nearest word: '\(best)' (cosine " + String(format: "%.3f", cosine(t, vec(best))) + ")")
import math
# a hand-made 2D embedding space:
# axis 1: masculine +1 ... feminine -1 axis 2: royal +2 ... ordinary 0 ... object -2
emb = {"king": [ 1.0, 2.0],
"queen": [-1.0, 2.0],
"man": [ 1.0, 0.0],
"woman": [-1.0, 0.0],
"apple": [ 0.0, -2.0]}
def dot(u, v): return sum(a * b for a, b in zip(u, v))
def length(u): return math.sqrt(dot(u, u))
def cosine(u, v): return dot(u, v) / (length(u) * length(v))
print("how similar is 'king' to ...")
for w, v in emb.items():
print(f" {w:6s} dot = {dot(emb['king'], v):+5.1f} cosine = {cosine(emb['king'], v):+.3f}")
# the famous arithmetic: king - man + woman = ?
t = [k - m + w for k, m, w in zip(emb["king"], emb["man"], emb["woman"])]
print(f"\nking - man + woman = {t}")
best = max(emb, key=lambda w: cosine(t, emb[w]))
print(f"nearest word: '{best}' (cosine {cosine(t, emb[best]):.3f})")
Both print exactly this:
how similar is 'king' to ...
king dot = +5.0 cosine = +1.000
queen dot = +3.0 cosine = +0.600
man dot = +1.0 cosine = +0.447
woman dot = -1.0 cosine = -0.447
apple dot = -4.0 cosine = -0.894
king - man + woman = [-1.0, 2.0]
nearest word: 'queen' (cosine 1.000)
Check the story the numbers tell: king↔queen similar (+0.600, both royal), king↔apple near-opposite (−0.894), and king − man + woman lands on [−1, 2] — which is queen's vector, cosine exactly 1.000. The subtraction removed maleness, the addition restored femaleness, royalty never moved. You can verify every line with paper and a square root.
The same five vectors, drawn. Pick any A − B + C and watch the arrows do meaning-arithmetic; the nearest word by cosine lights up green.
One missing ingredient before the talking starts: the lookup gives the same vector for "bank" in "river bank" and "bank account" — and the same vector for a word whether it's the first or the fifteenth token. So the model also mixes position information into each vector (you'll see where in lesson 17), and then lets context reshape meaning. The mechanism that does the reshaping — each token's vector looking at all the others and updating itself — is attention. That's next, and it's the heart of the whole machine.
No peeking back. Pull it from memory.
Primary source: "But what is a GPT?" by 3Blue1Brown — the first chapter of his transformer series. The embedding section animates today's lesson in real learned spaces (including the king/queen and Mussolini demos), and the final minutes preview attention, which is exactly where we go next.