This is the lesson the modern AI era is named after — the 2017 paper that introduced the transformer is titled "Attention Is All You Need". Lesson 14's bigram embarrassed itself because it remembered exactly one character. Lesson 15 gave every token a meaning-vector, but each vector stood alone — "cat" in fluffy blue cat knew nothing about fluffy or blue. Attention fixes both at once: it lets every token look back at all the tokens before it, decide which ones matter, and absorb their meaning into its own vector. And the machinery is — you can guess by now — dot products and weighted sums. Lesson 1, wearing a crown.
To make "asks", "answers", and "offers" precise, every token's embedding is turned into three small vectors (by three learned weight matrices — ordinary lesson-3 layers, trained like everything else):
q — the query: what am I looking for? (A noun might be looking for its adjectives.)
k — the key: what am I? (An adjective advertises: "I'm an adjective!")
v — the value: what do I hand over if you attend to me? (Fluffy hands over fluffiness.)
Then three steps, all familiar:
One new tool, softmax: it exponentiates each score and divides by the sum, which — in 3Blue1Brown's words — "turns an arbitrary list of numbers into a valid distribution, in such a way that the largest values end up closest to 1, and the smaller values end up closer to 0". Big score → big share of the mix. (It's also exactly how lesson 14's "scores → probabilities" step works in real LLMs.)
Three tokens — fluffy blue cat — with hand-set q, k, v vectors (2-D, so you can check everything; in the real model they'd be learned). The design: fluffy and blue carry the key "I'm an adjective" = [1, 0]; cat's query "I seek adjectives" = [2, 0]; values carry what each donates. Watch cat's three steps, every number verifiable:
Read step 3 slowly, because it's the entire point of Phase 5 so far: the vector called "cat" now contains 45% fluffiness and 45% blueness. It is no longer the dictionary vector for cat — it is the vector for this particular fluffy blue cat. Context has been mixed into meaning. That's what attention is for.
The program computes attention for all three tokens, and shows one more rule: each token may only attend to itself and earlier tokens. Why? Remember the job (lesson 14): predict the next token. If tokens could peek forward during training, the answer would be visible in the input — the exam would contain the answer key. Blocking the peek is called causal masking: attention weights to later positions are forced to zero (Wikipedia).
import Foundation
// three tokens; q = what I'm looking for, k = what I am, v = what I give
let tokens = ["fluffy", "blue", "cat"]
let q = ["fluffy": [0.0, 1.0], "blue": [0.0, 1.0], "cat": [2.0, 0.0]] // cat seeks adjectives
let k = ["fluffy": [1.0, 0.0], "blue": [1.0, 0.0], "cat": [0.0, 1.0]] // fluffy/blue ARE adjectives
let v = ["fluffy": [3.0, 0.0], "blue": [0.0, 3.0], "cat": [1.0, 1.0]] // what each token hands over
func dot(_ u: [Double], _ w: [Double]) -> Double { zip(u, w).map(*).reduce(0, +) }
func softmax(_ xs: [Double]) -> [Double] {
let es = xs.map(exp)
let s = es.reduce(0, +)
return es.map { $0 / s }
}
let dK = 2.0 // our key vectors have 2 dimensions
for (i, t) in tokens.enumerated() {
let visible = Array(tokens[0...i]) // causal mask: only look back
let scores = visible.map { dot(q[t]!, k[$0]!) / sqrt(dK) }
let weights = softmax(scores)
var new = [0.0, 0.0]
for (w, u) in zip(weights, visible) {
new[0] += w * v[u]![0]; new[1] += w * v[u]![1]
}
let pad = String(repeating: " ", count: 6 - t.count)
let wtxt = zip(visible, weights).map { "\($0) " + String(format: "%.3f", $1) }
.joined(separator: ", ")
print("\(t)\(pad) attends to: \(wtxt)")
print(" new vector: [" + String(format: "%.3f", new[0]) + ", "
+ String(format: "%.3f", new[1]) + "]")
}
import math
# three tokens; q = what I'm looking for, k = what I am, v = what I give
tokens = ["fluffy", "blue", "cat"]
q = {"fluffy": [0, 1], "blue": [0, 1], "cat": [2, 0]} # cat seeks adjectives
k = {"fluffy": [1, 0], "blue": [1, 0], "cat": [0, 1]} # fluffy/blue ARE adjectives
v = {"fluffy": [3, 0], "blue": [0, 3], "cat": [1, 1]} # what each token hands over
def dot(u, w): return sum(a * b for a, b in zip(u, w))
def softmax(xs):
es = [math.exp(x) for x in xs]
s = sum(es)
return [e / s for e in es]
d_k = 2 # our key vectors have 2 dimensions
for i, t in enumerate(tokens):
visible = tokens[:i + 1] # causal mask: only look back
scores = [dot(q[t], k[u]) / math.sqrt(d_k) for u in visible]
weights = softmax(scores)
new = [sum(w * v[u][d] for w, u in zip(weights, visible)) for d in range(2)]
wtxt = ", ".join(f"{u} {w:.3f}" for u, w in zip(visible, weights))
print(f"{t:6s} attends to: {wtxt}")
print(f" new vector: [{new[0]:.3f}, {new[1]:.3f}]")
Both print exactly this:
fluffy attends to: fluffy 1.000
new vector: [3.000, 0.000]
blue attends to: fluffy 0.500, blue 0.500
new vector: [1.500, 1.500]
cat attends to: fluffy 0.446, blue 0.446, cat 0.108
new vector: [1.446, 1.446]
Three rows, three little dramas. fluffy is first — the mask leaves it nothing to look at but itself, weight 1.000. blue can see fluffy and itself; its query matches both keys equally, so 0.500 each. And cat splits its attention 45/45/11 — the noun drinks in its adjectives, exactly the hand computation above.
Click a token to watch its three steps. Greyed-out rows are what the causal mask hides — future tokens that may not be consulted.
In the paper and everywhere else, the three steps are written as one line (Wikipedia: Attention):
Symbol by symbol, with nothing new in it:
Q, K, V — all tokens' queries, keys, and values stacked into matrices, one row per token (lesson 3's stacking move). QKᵀ — every query dotted with every key in one matrix multiplication: the full table of scores, all pairs at once (the ᵀ is the same transpose bookkeeping as lesson 11's xAᵀ). ⁄√dk — divide by the square root of the key length, our √2; in big models dot products of long vectors get large, and this rescaling keeps softmax from saturating — lesson 5's flat-ends problem in new clothing (Wikipedia: the scaling "prevents excessive variance" in the scores). softmax — scores to weights, row by row. ·V — and weighted sums of the values, all tokens at once. One line of matrix algebra; you did it with school arithmetic.
Two properties made attention conquer everything. Reach: the weighted sum connects any two positions directly — token 1,000 can pull from token 3 as easily as from token 999; the bigram's one-step memory and its successors' fading memories are gone. Learnability: the Q, K, V matrices are ordinary weights, trained by your lesson-9 loop — so the model learns what to look for, rather than being told. Pronoun resolution ("…the cat, because it was fluffy"), matching brackets in code, callback to a variable defined 200 lines up — all become "learn a query that matches that key". That's why your coding assistant can use the function signature from the top of your file: attention is literally how it looks it up.
No peeking back. Pull it from memory.
Primary source: "Attention in transformers, step-by-step" by 3Blue1Brown — today's exact mechanics (queries, keys, values, the adjective-updates-noun example — his phrase is "a fluffy blue creature" — masking, heads) animated in high resolution. After today's hand computation you'll watch it like a film of a book you've read.