AI Foundations · Lesson 13 · Phase 5 — Under the hood of GPT

Text Becomes Numbers: Tokens

Welcome to Phase 5. You know how a network computes (lessons 1–5), how it learns (6–9), and you've trained your own (10–12). Now we open the machine you actually came for. A large language model, in 3Blue1Brown's definition, is "a sophisticated mathematical function that predicts what word comes next for any piece of text" — and over the next six lessons you'll see that this function is built from exactly the parts you already own. Today: the very first step. Networks eat numbers — lesson 12 fed yours pixel brightnesses — so before GPT can predict anything, text must become numbers. That conversion is tokenization, and it quietly explains some of the strangest things LLMs do.

The map for Phase 5

Six lessons, one machine: text becomes tokens (today) → a model learns to predict the next token (14) → tokens become meaning-vectors called embeddings (15) → tokens look at each other with attention (16) → it all stacks into the transformer (17) → and training at scale plus a few finishing tricks turns that into ChatGPT and your coding assistant (18). Every step reuses lessons 1–12. Nothing you learned was a detour.

A vocabulary, a token, an ID

Three words, all of them honest:

A vocabulary is the fixed, finite list of all text pieces the model knows — like an alphabet, except the "letters" can be any chunks we choose. A token is one piece from that list. A token ID is simply its position in the list — and that integer is what the network actually receives. Not letters, not sounds, not words: positions in a list.

text → [chop into tokens] → token IDs → the network
"to be" → [t, o, ␣, b, e] → [6, 4, 0, 1, 2]

Where did 6, 4, 0, 1, 2 come from? From the smallest possible tokenizer — one we can build completely, right now, in a dozen lines.

Build a tokenizer in 10 lines

Code first, as always. Take a tiny "training text", collect every distinct character, sort them — that's the vocabulary. Encoding = look up each character's position; decoding = look positions back up. Run either version and you'll get exactly the outputs shown in the comments:

let text = "to be or not to be"

let vocab = Array(Set(text)).sorted()       // every distinct character, sorted
var stoi: [Character: Int] = [:]            // "string to int" lookup
for (i, ch) in vocab.enumerated() { stoi[ch] = i }

func encode(_ s: String) -> [Int] { s.map { stoi[$0]! } }
func decode(_ ids: [Int]) -> String { String(ids.map { vocab[$0] }) }

print("vocabulary:", vocab)   // [" ", "b", "e", "n", "o", "r", "t"]
print("size:", vocab.count)   // 7
print(encode("to be"))        // [6, 4, 0, 1, 2]
print(decode(encode("to be")))// to be  — perfect round trip

text = "to be or not to be"

vocab = sorted(set(text))                   # every distinct character, sorted
stoi = {ch: i for i, ch in enumerate(vocab)}  # "string to int" lookup
itos = {i: ch for ch, i in stoi.items()}

def encode(s):   return [stoi[c] for c in s]
def decode(ids): return "".join(itos[i] for i in ids)

print("vocabulary:", vocab)     # [' ', 'b', 'e', 'n', 'o', 'r', 't']
print("size:", len(vocab))      # 7
print(encode("to be"))          # [6, 4, 0, 1, 2]
print(decode(encode("to be")))  # to be  — perfect round trip

Read the vocabulary: space gets ID 0, b gets 1, e gets 2, … t gets 6. So "to be" becomes [6, 4, 0, 1, 2] — and decoding turns it back, losslessly. You just built a real tokenizer. Every tokenizer in every LLM is this same idea with a bigger, smarter vocabulary.

Feel it: tokenize live

The widget below runs the exact tokenizer you just built — same 7-character vocabulary, learned from "to be or not to be". Type and watch the chips. Then try typing an x or a capital T…

That red ? chip is the most important thing on this page. A tokenizer can only emit IDs from its vocabulary — anything else simply has no number. Our toy vocabulary is 7 characters, so almost everything is unknown. The fix is obvious: a bigger vocabulary. The interesting question is — bigger how?

Three ways to chop text

Same pattern as the formulas in lesson 1 — three zoom levels, from simplest to what the pros ship:

① Characters — small and safe, but slow

What we just built. The vocabulary stays tiny and nothing is ever unknown (every text is made of characters). The price: "to be or not to be" costs 18 tokens, and the model must reassemble meaning letter by letter. Sequences get long, and — as you'll feel in lesson 16 — attention pays for length.

② Words — short and meaningful, but brittle

One token per word: "to be" is 2 tokens, meaning comes pre-assembled. But the vocabulary explodes (every word, every name, every typo needs its own entry), and the first word not in the list — Tokenizer3000 — is a red ? chip again. Real text always contains words nobody put in a list.

③ Subwords — the pro way: byte pair encoding

The compromise every modern LLM uses: tokens are frequent chunks — whole common words, pieces of rare ones. The algorithm that picks the chunks is byte pair encoding (BPE), and it's beautifully dumb: start from tiny pieces, then repeatedly — in Wikipedia's words — "the most frequent pair of adjacent tokens is merged into a new, longer n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of prescribed size is obtained." Frequent words end up as single tokens; a rare word like tokenization splits into a few familiar chunks (token + ization); a truly alien word falls back to characters. Nothing is ever unknown, sequences stay short.

BPE on our text: t·o → to (appears 3×) → next merge: to+␣? b+e? — always the most frequent pair
"to be or not to be" as characters: 18 tokens · after a few merges: ~8 · as words: 6

Two verified facts about the real thing. GPT-style tokenizers run BPE on raw bytes — the text is "converted into UTF-8 first, and treated as a stream of bytes" (Wikipedia) — which guarantees any text in any language can be encoded. And the vocabulary they stop at is big but finite: for GPT-3.5 and GPT-4 it is 100,258 tokens (100,000 from the BPE merges plus 258 special tokens, same source). Compare: our toy had 7.

Why LLMs are weird about letters. Remember what the network receives: token IDs, not characters. If strawberry reaches the model as one or two chunky tokens, the model never directly sees the letters inside — asking it to count the r's is like asking you how many times the digit 7 appears in a phone number you only ever heard as a melody. It can often still answer from memorized spelling knowledge, but the information isn't in front of it the way it's in front of you. Tokenization is also why models can stumble on rhymes, anagrams, and arithmetic on long numbers (digits get chunked unevenly). When an LLM does something inexplicably dumb with the insides of words — think tokens first.

Check yourself

No peeking back. Pull it from memory.

1. What does the network actually receive as its input?

2. Why did modern LLMs settle on subword tokens (BPE)?

3. What does BPE repeatedly merge while building its vocabulary?

Watch this next

Primary source: "Large language models explained briefly" by 3Blue1Brown — a short aerial view of everything Phase 5 will build, from tokens to attention to the giant training runs. Watch it now as a trailer; by lesson 18 every sentence in it will be something you can explain. For tokenization depth, the BPE article covers the exact merge algorithm with examples.