AI Foundations · Lesson 16 · Phase 5 — Under the hood of GPT

Attention: Every Token Looks Back

This is the lesson the modern AI era is named after — the 2017 paper that introduced the transformer is titled "Attention Is All You Need". Lesson 14's bigram embarrassed itself because it remembered exactly one character. Lesson 15 gave every token a meaning-vector, but each vector stood alone — "cat" in fluffy blue cat knew nothing about fluffy or blue. Attention fixes both at once: it lets every token look back at all the tokens before it, decide which ones matter, and absorb their meaning into its own vector. And the machinery is — you can guess by now — dot products and weighted sums. Lesson 1, wearing a crown.

The idea in one sentence

Each token asks a question, every earlier token answers, and the token's new meaning
is a weighted average of what the relevant ones offer.

To make "asks", "answers", and "offers" precise, every token's embedding is turned into three small vectors (by three learned weight matrices — ordinary lesson-3 layers, trained like everything else):

q — the query: what am I looking for? (A noun might be looking for its adjectives.)
k — the key: what am I? (An adjective advertises: "I'm an adjective!")
v — the value: what do I hand over if you attend to me? (Fluffy hands over fluffiness.)

Then three steps, all familiar:

1. score: how well does my q match each earlier token's k? — a dot product per pair
2. weigh: squash the scores into weights that sum to 1 — softmax
3. mix: new me = weighted sum of everyone's v

One new tool, softmax: it exponentiates each score and divides by the sum, which — in 3Blue1Brown's words — "turns an arbitrary list of numbers into a valid distribution, in such a way that the largest values end up closest to 1, and the smaller values end up closer to 0". Big score → big share of the mix. (It's also exactly how lesson 14's "scores → probabilities" step works in real LLMs.)

The worked example: cat absorbs its adjectives

Three tokens — fluffy blue cat — with hand-set q, k, v vectors (2-D, so you can check everything; in the real model they'd be learned). The design: fluffy and blue carry the key "I'm an adjective" = [1, 0]; cat's query "I seek adjectives" = [2, 0]; values carry what each donates. Watch cat's three steps, every number verifiable:

1. scores (q_cat·k, scaled): fluffy 2⁄√2 = 1.414 · blue 2⁄√2 = 1.414 · cat 0⁄√2 = 0
2. softmax → weights: fluffy 0.446 · blue 0.446 · cat 0.108
3. new cat = 0.446·[3,0] + 0.446·[0,3] + 0.108·[1,1] = [1.446, 1.446]

Read step 3 slowly, because it's the entire point of Phase 5 so far: the vector called "cat" now contains 45% fluffiness and 45% blueness. It is no longer the dictionary vector for cat — it is the vector for this particular fluffy blue cat. Context has been mixed into meaning. That's what attention is for.

Run it — and meet the causal mask

The program computes attention for all three tokens, and shows one more rule: each token may only attend to itself and earlier tokens. Why? Remember the job (lesson 14): predict the next token. If tokens could peek forward during training, the answer would be visible in the input — the exam would contain the answer key. Blocking the peek is called causal masking: attention weights to later positions are forced to zero (Wikipedia).

import Foundation

// three tokens; q = what I'm looking for, k = what I am, v = what I give
let tokens = ["fluffy", "blue", "cat"]
let q = ["fluffy": [0.0, 1.0], "blue": [0.0, 1.0], "cat": [2.0, 0.0]]  // cat seeks adjectives
let k = ["fluffy": [1.0, 0.0], "blue": [1.0, 0.0], "cat": [0.0, 1.0]]  // fluffy/blue ARE adjectives
let v = ["fluffy": [3.0, 0.0], "blue": [0.0, 3.0], "cat": [1.0, 1.0]]  // what each token hands over

func dot(_ u: [Double], _ w: [Double]) -> Double { zip(u, w).map(*).reduce(0, +) }

func softmax(_ xs: [Double]) -> [Double] {
    let es = xs.map(exp)
    let s = es.reduce(0, +)
    return es.map { $0 / s }
}

let dK = 2.0  // our key vectors have 2 dimensions

for (i, t) in tokens.enumerated() {
    let visible = Array(tokens[0...i])                 // causal mask: only look back
    let scores = visible.map { dot(q[t]!, k[$0]!) / sqrt(dK) }
    let weights = softmax(scores)
    var new = [0.0, 0.0]
    for (w, u) in zip(weights, visible) {
        new[0] += w * v[u]![0]; new[1] += w * v[u]![1]
    }
    let pad = String(repeating: " ", count: 6 - t.count)
    let wtxt = zip(visible, weights).map { "\($0) " + String(format: "%.3f", $1) }
                                    .joined(separator: ", ")
    print("\(t)\(pad) attends to: \(wtxt)")
    print("        new vector: [" + String(format: "%.3f", new[0]) + ", "
                                  + String(format: "%.3f", new[1]) + "]")
}

import math

# three tokens; q = what I'm looking for, k = what I am, v = what I give
tokens = ["fluffy", "blue", "cat"]
q = {"fluffy": [0, 1], "blue": [0, 1], "cat": [2, 0]}   # cat seeks adjectives
k = {"fluffy": [1, 0], "blue": [1, 0], "cat": [0, 1]}   # fluffy/blue ARE adjectives
v = {"fluffy": [3, 0], "blue": [0, 3], "cat": [1, 1]}   # what each token hands over

def dot(u, w): return sum(a * b for a, b in zip(u, w))

def softmax(xs):
    es = [math.exp(x) for x in xs]
    s = sum(es)
    return [e / s for e in es]

d_k = 2  # our key vectors have 2 dimensions

for i, t in enumerate(tokens):
    visible = tokens[:i + 1]                       # causal mask: only look back
    scores = [dot(q[t], k[u]) / math.sqrt(d_k) for u in visible]
    weights = softmax(scores)
    new = [sum(w * v[u][d] for w, u in zip(weights, visible)) for d in range(2)]
    wtxt = ", ".join(f"{u} {w:.3f}" for u, w in zip(visible, weights))
    print(f"{t:6s} attends to: {wtxt}")
    print(f"        new vector: [{new[0]:.3f}, {new[1]:.3f}]")

Both print exactly this:

fluffy attends to: fluffy 1.000
        new vector: [3.000, 0.000]
blue   attends to: fluffy 0.500, blue 0.500
        new vector: [1.500, 1.500]
cat    attends to: fluffy 0.446, blue 0.446, cat 0.108
        new vector: [1.446, 1.446]

Three rows, three little dramas. fluffy is first — the mask leaves it nothing to look at but itself, weight 1.000. blue can see fluffy and itself; its query matches both keys equally, so 0.500 each. And cat splits its attention 45/45/11 — the noun drinks in its adjectives, exactly the hand computation above.

Feel it: pull on the weights

Click a token to watch its three steps. Greyed-out rows are what the causal mask hides — future tokens that may not be consulted.

The official formula — you already computed it

In the paper and everywhere else, the three steps are written as one line (Wikipedia: Attention):

Attention(Q, K, V) = softmax( QKᵀ ⁄ √d_k ) · V

Symbol by symbol, with nothing new in it:

Q, K, V — all tokens' queries, keys, and values stacked into matrices, one row per token (lesson 3's stacking move). QKᵀ — every query dotted with every key in one matrix multiplication: the full table of scores, all pairs at once (the ᵀ is the same transpose bookkeeping as lesson 11's xAᵀ). ⁄√d_k — divide by the square root of the key length, our √2; in big models dot products of long vectors get large, and this rescaling keeps softmax from saturating — lesson 5's flat-ends problem in new clothing (Wikipedia: the scaling "prevents excessive variance" in the scores). softmax — scores to weights, row by row. ·V — and weighted sums of the values, all tokens at once. One line of matrix algebra; you did it with school arithmetic.

Heads. A real transformer runs many of these attention machines side by side on the same tokens — each with its own learned Q/K/V matrices, each free to track a different relationship (one head might link adjectives to nouns, another might track quotes, another commas). They're called attention heads, their outputs are concatenated, and "multi-head attention" is just this — several lesson-16s in parallel. Nothing deeper than that.

Why this was the breakthrough

Two properties made attention conquer everything. Reach: the weighted sum connects any two positions directly — token 1,000 can pull from token 3 as easily as from token 999; the bigram's one-step memory and its successors' fading memories are gone. Learnability: the Q, K, V matrices are ordinary weights, trained by your lesson-9 loop — so the model learns what to look for, rather than being told. Pronoun resolution ("…the cat, because it was fluffy"), matching brackets in code, callback to a variable defined 200 lines up — all become "learn a query that matches that key". That's why your coding assistant can use the function signature from the top of your file: attention is literally how it looks it up.

Check yourself

No peeking back. Pull it from memory.

1. A token's attention weights are computed by matching its:

2. After attention, a token's new vector is the weighted sum of:

3. Why does the causal mask block attention to later tokens?

Watch this next

Primary source: "Attention in transformers, step-by-step" by 3Blue1Brown — today's exact mechanics (queries, keys, values, the adjective-updates-noun example — his phrase is "a fluffy blue creature" — masking, heads) animated in high resolution. After today's hand computation you'll watch it like a film of a book you've read.