AI Foundations · Lesson 4 · Phase 2 — Many neurons → a network

Stacking Layers — the Forward Pass

In lesson 3 you built one layer: a column of neurons all reading the same inputs. Today you plug the output of one layer into the input of the next — and a stack of layers is, finally, a neural network. The trip data takes through that stack, input to output, is called the forward pass, and it is the only thing the model you'll train at the end of this course ever does when you use it. As a bonus, stacking cracks the puzzle lesson 2 left open: XOR.

A network is layers plugged into each other

You already know everything inside a layer: each neuron computes its weighted sum z, adds its bias, and squashes through σ. The single new idea today is the hand-off: the activations coming out of layer 1 become the inputs going into layer 2. Nothing else changes — layer 2's neurons don't know or care that their inputs were produced by other neurons. As Michael Nielsen puts it, these are networks "where the output from one layer is used as input to the next layer" — called feedforward networks, because information always flows forward, never backward in a loop (Wikipedia).

Names you need, all three boring once you see them: the input layer is just your raw numbers; the output layer is the last column of neurons, whose activations are the network's answer; and every layer in between is a hidden layer. Nielsen admits the word fooled him too — "it really means nothing more than 'not an input or an output'". One more word: depth is how many layers a network has, and by convention the input layer is not counted (CS231n). Today's network has 2 inputs, one hidden layer of 2 neurons, and 1 output neuron — a "2→2→1" network, depth 2.

Feel it: flip two bits, watch XOR appear

Lesson 2 ended on a cliffhanger: a single neuron draws one straight line through its input space, and the four points of XOR ("exclusive or" — output 1 exactly when the two inputs differ) cannot be split by any single straight line. A network with a hidden layer can do it (Wikipedia). Below is a real 2→2→1 sigmoid network with the weights written on the wires and the biases under each neuron. Flip the inputs and watch the values flow left to right — each neuron shows its z and its activation a live.

The trick, neuron by neuron. Hidden neuron h₁ (weights 20, 20, bias −10) fires when at least one input is 1 — it's an OR gate. Hidden neuron h₂ (weights −20, −20, bias +30) fires unless both inputs are 1 — a NAND gate ("not-and"). The output neuron (weights 20, 20, bias −30) fires only when both hidden neurons fire — an AND gate. And "at least one, but not both" is exactly XOR. Why the dramatic ±20s instead of gentle weights? Sigmoid saturation: you saw in lesson 1 that big z pins σ near 0 or 1. Here σ(10) ≈ 0.99995 and σ(−10) ≈ 0.00005, so each neuron's activation is a near-perfect logical bit.

All four cases, checked by hand

Here is the full forward pass for every input pair — every number in this table comes straight out of running the code below, and it matches what the diagram shows:

x₁	x₂	h₁: z	h₁: a	h₂: z	h₂: a	out: z	out: a	≈ XOR
0	0	−10	0.00005	30	1.00000	−9.9991	0.00005	0
0	1	10	0.99995	10	0.99995	9.9982	0.99995	1
1	0	10	0.99995	10	0.99995	9.9982	0.99995	1
1	1	30	1.00000	−10	0.00005	−9.9991	0.00005	0

Notice the output's z is −9.9991, not a clean −10 — because the hidden activations feeding it are 0.00005 and 1.00000, not perfect bits. Tiny imprecision flows through the chain, and the chain shrugs it off. Also notice what just happened conceptually: each hidden neuron still draws a straight line, but the output neuron draws its line in h₁–h₂ coordinates — and seen from the original inputs, the combined boundary is no longer one straight line. That's why networks with hidden layers can "distinguish data that is not linearly separable" (Wikipedia), and it only gets richer with more neurons and depth.

Write it in code

The whole network is your lesson-3 layer function called twice — the second call eats the first call's result. Paste the Python into a fresh Google Colab notebook and run it; the Swift runs as-is in an Xcode playground. Both print the table's last column.

import Foundation

func sigmoid(_ z: Double) -> Double { 1 / (1 + exp(-z)) }

// one layer: every neuron does weighted sum + bias, then sigmoid
func layer(_ inputs: [Double], _ weights: [[Double]], _ biases: [Double]) -> [Double] {
    var out: [Double] = []
    for n in 0..<weights.count {
        var z = biases[n]
        for i in 0..<inputs.count { z += weights[n][i] * inputs[i] }
        out.append(sigmoid(z))
    }
    return out
}

let W1 = [[20.0, 20.0], [-20.0, -20.0]]   // h1 = OR, h2 = NAND
let b1 = [-10.0, 30.0]
let W2 = [[20.0, 20.0]]                   // output = AND
let b2 = [-30.0]

for x in [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]] {
    let h = layer(x, W1, b1)    // layer 1: input -> hidden
    let y = layer(h, W2, b2)    // layer 2: hidden -> output
    print("\(Int(x[0])) XOR \(Int(x[1]))  ->  \(String(format: "%.5f", y[0]))")
}
// 0 XOR 0  ->  0.00005
// 0 XOR 1  ->  0.99995
// 1 XOR 0  ->  0.99995
// 1 XOR 1  ->  0.00005

import math

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

# one layer: every neuron does weighted sum + bias, then sigmoid
def layer(inputs, weights, biases):
    out = []
    for w_row, b in zip(weights, biases):
        z = b
        for x, w in zip(inputs, w_row):
            z += w * x
        out.append(sigmoid(z))
    return out

W1 = [[20.0, 20.0], [-20.0, -20.0]]   # h1 = OR, h2 = NAND
b1 = [-10.0, 30.0]
W2 = [[20.0, 20.0]]                   # output = AND
b2 = [-30.0]

for x in ([0, 0], [0, 1], [1, 0], [1, 1]):
    h = layer(x, W1, b1)    # layer 1: input -> hidden
    y = layer(h, W2, b2)    # layer 2: hidden -> output
    print(f"{x[0]} XOR {x[1]}  ->  {y[0]:.5f}")

# 0 XOR 0  ->  0.00005
# 0 XOR 1  ->  0.99995
# 1 XOR 0  ->  0.99995
# 1 XOR 1  ->  0.00005

Look at the loop body: h = layer(x, …) then y = layer(h, …). The variable h is the entire hand-off — layer 1's output becoming layer 2's input. Two function calls. That is a neural network running.

The real formula — three ways to write it

Same drill as lesson 1: you've driven it and coded it, now name it — spelled out first, then tighter.

① The easy way — every neuron spelled out

z₁ = w₁₁·x₁ + w₁₂·x₂ + b₁ ,   h₁ = σ(z₁)
z₂ = w₂₁·x₁ + w₂₂·x₂ + b₂ ,   h₂ = σ(z₂)
z₃ = v₁·h₁ + v₂·h₂ + c ,   y = σ(z₃)

Lines 1–2 are plain lesson-1 neurons reading the inputs: w₁₂ just means "h₁'s weight for input 2", b₁, b₂ are the hidden biases. Line 3 is the only new thing in this whole lesson — the output neuron's inputs are h₁ and h₂, not x. (Its weights are named v and its bias c only so you can see at a glance they belong to a different layer.) If you only remember this, you understand the forward pass.

② The layer way — lesson 3's shorthand, twice

h = σ( W⁽¹⁾·x + b⁽¹⁾ )
y = σ( W⁽²⁾·h + b⁽²⁾ )

Read the symbols: x is the input vector; W⁽¹⁾ is layer 1's weight matrix (one row per hidden neuron — exactly lesson 3); b⁽¹⁾ its biases; the superscript (1) means "belongs to layer 1" — it is a label, not a power. Now watch h: it ends line one as the result and starts line two as the input. That hand-off, repeated layer after layer, is the forward pass — "one matrix multiplication followed by a bias offset and an activation function" per layer (CS231n).

③ The popular way — one rule for any depth

a^(ℓ) = σ( W^(ℓ)·a^(ℓ−1) + b^(ℓ) )

Here ℓ counts layers: a⁽⁰⁾ is defined as the input x, and the rule says "layer ℓ's activations come from layer ℓ−1's activations". Apply it for ℓ = 1, then ℓ = 2, … up to the last layer — a for loop over layers, one line per turn. Nielsen writes the same rule as a′ = σ(w·a + b): new activations from old. In code: var a = x; for l in layers { a = l.forward(a) }. A 100-layer network and our 2→2→1 toy run the exact same one-liner — only the loop count differs.

One word for the road: inference. Training (phase 3 of this course) is how the weights get their values. But once training is done, using the network — every autocomplete, every photo tag — is just the forward pass you ran today, with better weights. That deployed, weights-frozen mode is called inference. You now know everything a trained network does at runtime.

Check yourself

No peeking back. Pull it from memory.

1. What happens during the forward pass?

2. A "hidden" layer is called hidden because it:

3. Our network solves XOR because its hidden neurons compute:

Watch this next

Primary source: rewatch "But what is a Neural Network?" by 3Blue1Brown — this time focus on the layers section. In lesson 1 you watched it for the single neuron; now you'll recognise the whole pipeline: how "the activation of each neuron in one layer … has some influence on the activation of each neuron in the next layer", through input, two hidden layers of 16, and out to ten digits. It's the same hand-off you just built, at scale.