AI Foundations · Lesson 15 · Phase 5 — Under the hood of GPT

Embeddings: Words as Vectors

Lesson 13 left a problem hiding in plain sight. Token IDs are arbitrary: in our toy vocabulary e was 2 and n was 3, but e isn't "one less" than n in any meaningful way — the numbers are just list positions. A network fed raw IDs would be learning from noise. So every LLM's first real move is to trade each ID for something with meaning: a learned vector — a list of numbers, exactly lesson 3's kind — called an embedding. And meaning, it turns out, becomes geometry: similar words sit near each other, and directions in the space stand for concepts.

One lookup, that's the whole mechanism

The machinery could not be smaller. The model owns a big table — the embedding matrix — with one learned vector per vocabulary token. In 3Blue1Brown's description of GPT: "the first matrix of the transformer, known as the embedding matrix, will have one column for each of these words. These columns determine what vector each word turns into." (One vector per token — whether they're stored as rows or columns is a bookkeeping convention, like lesson 11's xAᵀ.) Then:

token ID 6  →  take vector number 6 from the table  →  [0.31, −1.20, 0.07, …]

That's it — an array lookup. The interesting part is where the table's numbers come from: nobody writes them. They start random (lesson 10's initialization) and are nudged by the same gradient-descent loop as every other weight, because the embedding matrix is just more trainable parameters. The meanings below emerge from training on next-token prediction. For scale: in GPT-3, each of these vectors has 12,288 dimensions (3Blue1Brown); our demo today has 2, so we can draw it.

Meaning becomes geometry

Why does training produce meaningful vectors? Because words used in similar contexts get pushed toward similar predictions, and the cheapest way to make similar predictions is to have similar vectors. This was made famous by word2vec (2013), which "represents a word as a high-dimension vector of numbers which capture relationships between words" so that "words which appear in similar contexts are mapped to vectors which are nearby" (Wikipedia). Two consequences, both measurable:

Nearness = similarity. And "near" is measured with a tool you've owned since lesson 1 — the dot product. Two vectors pointing the same way → big positive dot product; unrelated → near zero; opposite → negative. Usually it's normalized into cosine similarity, which ranges from −1 to +1:

cos(u, v) = (u·v) ⁄ (‖u‖·‖v‖)

Symbol by symbol: u·v is the lesson-1 dot product; ‖u‖ is the vector's length (lesson 6's double bars — the square root of the vector dotted with itself); dividing by both lengths removes "loudness" so only direction counts. +1 means same direction, 0 unrelated, −1 opposite.

Directions = concepts. The space organizes itself so that, e.g., the displacement from man to woman is roughly the same arrow wherever you apply it. Word2vec's famous party trick (Wikipedia): "the vector representation of 'Brother' − 'Man' + 'Woman' produces a result which is closest to the vector representation of 'Sister'". 3Blue1Brown shows the same with royalty — the difference between woman and man "is quite similar to the difference between king and queen" — and even Italy − Germany + Hitler landing near Mussolini. Arithmetic on meaning.

Build it: a 2-D meaning space you can check by hand

Real embeddings are learned; ours today are hand-made so every number is checkable. Two axes: masculine↔feminine, and royal↕ordinary. Five words. Watch the dot products sort friends from strangers, and watch the famous arithmetic land exactly on queen:

import Foundation

// a hand-made 2D embedding space:
// axis 1: masculine +1 ... feminine -1    axis 2: royal +2 ... ordinary 0 ... object -2
let emb: [(String, [Double])] = [
    ("king",  [ 1.0,  2.0]),
    ("queen", [-1.0,  2.0]),
    ("man",   [ 1.0,  0.0]),
    ("woman", [-1.0,  0.0]),
    ("apple", [ 0.0, -2.0]),
]
func vec(_ w: String) -> [Double] { emb.first { $0.0 == w }!.1 }

func dot(_ u: [Double], _ v: [Double]) -> Double { zip(u, v).map(*).reduce(0, +) }
func length(_ u: [Double]) -> Double { sqrt(dot(u, u)) }
func cosine(_ u: [Double], _ v: [Double]) -> Double { dot(u, v) / (length(u) * length(v)) }

print("how similar is 'king' to ...")
for (w, v) in emb {
    let pad = String(repeating: " ", count: 6 - w.count)
    print("  \(w)\(pad)  dot = " + String(format: "%+5.1f", dot(vec("king"), v))
          + "   cosine = " + String(format: "%+.3f", cosine(vec("king"), v)))
}

// the famous arithmetic: king - man + woman = ?
let t = zip(zip(vec("king"), vec("man")), vec("woman")).map { $0.0 - $0.1 + $1 }
print("\nking - man + woman = [\(t[0]), \(t[1])]")
let best = emb.max { cosine(t, $0.1) < cosine(t, $1.1) }!.0
print("nearest word: '\(best)'  (cosine " + String(format: "%.3f", cosine(t, vec(best))) + ")")

Both print exactly this:

how similar is 'king' to ...
  king    dot =  +5.0   cosine = +1.000
  queen   dot =  +3.0   cosine = +0.600
  man     dot =  +1.0   cosine = +0.447
  woman   dot =  -1.0   cosine = -0.447
  apple   dot =  -4.0   cosine = -0.894

king - man + woman = [-1.0, 2.0]
nearest word: 'queen'  (cosine 1.000)

Check the story the numbers tell: king↔queen similar (+0.600, both royal), king↔apple near-opposite (−0.894), and king − man + woman lands on [−1, 2] — which is queen's vector, cosine exactly 1.000. The subtraction removed maleness, the addition restored femaleness, royalty never moved. You can verify every line with paper and a square root.

Feel it: walk the meaning space

The same five vectors, drawn. Pick any A − B + C and watch the arrows do meaning-arithmetic; the nearest word by cosine lights up green.

+ = ?
Honesty corner. Our cartoon has 2 axes, so "man" and "woman" come out as opposites (cosine −1) — in a real 12,288-dimensional space they're actually close (both human, person, noun, animate…) and differ along only a few directions. More dimensions means a word can be similar to thousands of words in thousands of different ways at once. That's the real reason for 12,288 instead of 2 — meaning needs the room. Also: real spaces are learned from data, so the analogies are approximate ("closest to", not "equal to"); our hand-built one is exact only because we rigged it.

Where this sits in GPT

text → tokens (L13) → embedding lookup (today) → vectors that talk to each other (L16) → … → next-token probabilities (L14)

One missing ingredient before the talking starts: the lookup gives the same vector for "bank" in "river bank" and "bank account" — and the same vector for a word whether it's the first or the fifteenth token. So the model also mixes position information into each vector (you'll see where in lesson 17), and then lets context reshape meaning. The mechanism that does the reshaping — each token's vector looking at all the others and updating itself — is attention. That's next, and it's the heart of the whole machine.

Check yourself

No peeking back. Pull it from memory.

1. What does the embedding step actually do with a token ID?
2. Where do the numbers inside embedding vectors come from?
3. Cosine similarity of two embedding vectors measures whether they:

Watch this next

Primary source: "But what is a GPT?" by 3Blue1Brown — the first chapter of his transformer series. The embedding section animates today's lesson in real learned spaces (including the king/queen and Mussolini demos), and the final minutes preview attention, which is exactly where we go next.