AI Foundations · Lesson 5 · Phase 2 — Many neurons → a network

Activation Functions — and Why ReLU Took Over

Lesson 4 gave you a tower of layers: matrix, bias, repeat. Here is the uncomfortable secret — without the little squashing step between layers, that whole tower is a fraud. It computes nothing a single layer couldn't. Today you'll watch depth collapse with your own eyes, learn why one bent function rescues it, and meet the three classics: sigmoid, tanh, and ReLU — the modern default, and the one inside the digit-reading model you'll train at the end of this course.

Run this first: the tower that collapses

The experiment: build a 2-layer network but skip the activation function — no squash, just matrix after matrix. Then build a 1-layer network whose single matrix is M = W₂·W₁ (the two weight matrices multiplied together). Feed both the same input x = [3, 5]. (Quick Lesson 3 refresher: each row of a weight matrix holds one neuron's weights, so matvec(W, x) computes every neuron's weighted sum at once.)

let W1 = [[2.0, -1.0],
          [0.0,  1.0]]
let W2 = [[ 1.0, 1.0],
          [-2.0, 3.0]]
let x  = [3.0, 5.0]

func matvec(_ W: [[Double]], _ v: [Double]) -> [Double] {
    W.map { row in zip(row, v).map(*).reduce(0, +) }
}

func matmat(_ A: [[Double]], _ B: [[Double]]) -> [[Double]] {
    A.map { row in
        (0..<B[0].count).map { j in
            (0..<B.count).map { k in row[k] * B[k][j] }.reduce(0, +)
        }
    }
}

let twoLayers = matvec(W2, matvec(W1, x))  // layer 1, then layer 2
let M = matmat(W2, W1)                     // merge the two matrices
let oneLayer = matvec(M, x)                // a single layer using M

print(twoLayers)  // [6.0, 13.0]
print(oneLayer)   // [6.0, 13.0]  — identical. Depth bought nothing.
print(M)          // [[2.0, 0.0], [-4.0, 5.0]]

W1 = [[2.0, -1.0],
      [0.0,  1.0]]
W2 = [[ 1.0, 1.0],
      [-2.0, 3.0]]
x  = [3.0, 5.0]

def matvec(W, v):
    return [sum(W[i][j] * v[j] for j in range(len(v)))
            for i in range(len(W))]

def matmat(A, B):
    return [[sum(A[i][k] * B[k][j] for k in range(len(B)))
             for j in range(len(B[0]))] for i in range(len(A))]

two_layers = matvec(W2, matvec(W1, x))  # layer 1, then layer 2
M = matmat(W2, W1)                      # merge the two matrices
one_layer = matvec(M, x)                # a single layer using M

print(two_layers)  # [6.0, 13.0]
print(one_layer)   # [6.0, 13.0]  — identical. Depth bought nothing.
print(M)           # [[2.0, 0.0], [-4.0, 5.0]]

Both routes print [6.0, 13.0]. Try changing W1, W2, or x — they will always agree. As CS231n puts it: leave the nonlinearity out and "the two matrices could be collapsed to a single matrix." Biases don't save you either: with b₁ = [1, −1] and b₂ = [2, 0], both routes give [8.0, 8.0] — the two biases just fold into one (c = W₂·b₁ + b₂ = [2, −5]).

Name the math: linear · linear = linear

What you just witnessed has a one-line explanation. Spelled out with our numbers, the two-layer route is:

layer 1: h = W₁·x = [1, 5]
layer 2: y = W₂·h = [6, 13]

Symbols: x is the input vector, W₁ and W₂ are the two layers' weight matrices, h is the hidden layer's output. Now substitute h into the second line and regroup the parentheses — matrix multiplication lets you do that, just like (2·3)·4 = 2·(3·4) with ordinary numbers:

y = W₂·(W₁·x) = (W₂·W₁)·x = M·x

M = W₂·W₁ is one fixed matrix — exactly the [[2, 0], [−4, 5]] your code printed. Two layers, one layer, same machine. The professional one-liner: composing linear maps gives a linear map. Stack a hundred matrix-only layers and they still flatten into one. Wikipedia states it the same way: with identity (do-nothing) activations, "the entire network is equivalent to a single-layer model."

A single matrix can only do straight-line things to its input — scale, rotate, tilt. It can never bend. So a no-activation network of any depth can never learn a curve, a circle, an XOR, a cat. The activation function is what buys curves. Insert a nonlinear f after each layer — a = f(W·x + b), applied element-wise, i.e. to each neuron's z separately — and W₂·f(W₁·x) can no longer be merged, because f is not a matrix. As CS231n says: "the non-linearity is where we get the wiggle."

The menu: three classic curves

① Sigmoid — the original squash

σ(z) = 1 / (1 + e^−z) → range (0, 1)

Your friend from Lesson 1: e ≈ 2.718, and the fraction bends any z into a smooth S between 0 and 1 — σ(0) = 0.5, σ(2) ≈ 0.881. But sigmoid has a disease: it saturates. At the far ends the S goes flat — at z = 4 the output is already 0.982 and the slope (how much the output moves when you nudge z) is a measly 0.018; its best slope, at z = 0, is only 0.25. Why care about slope? Training (Phase 3, coming up) works by nudging weights and watching the output respond. Where the curve is flat, nudges do nothing — the learning signal (formally, the gradient) dies. CS231n: "when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero."

② Tanh — the centered sigmoid

tanh(z) = (e^z − e^−z) / (e^z + e^−z) → range (−1, 1)

The formula looks scary; the truth is friendly: tanh is just a stretched, recentered sigmoid — exactly tanh(z) = 2σ(2z) − 1 (CS231n; check: 2σ(2·1) − 1 = 0.762 = tanh(1)). Its win over sigmoid: outputs are zero-centered — negative z gives negative output, zero gives zero — which keeps the numbers flowing between layers nicely balanced, so CS231n notes tanh "is always preferred to the sigmoid" for hidden layers. Same disease though: flat at both ends, so gradients still die out there.

③ ReLU — the cheap winner

ReLU(z) = max(0, z) → range [0, ∞)

Read it as code: if z is positive, keep it; otherwise output 0. ReLU(−3) = 0, ReLU(2.5) = 2.5. That's it — the rectified linear unit (the name on Wikipedia, which calls it "one of the most popular activation functions for artificial neural networks"). Why did this trivial kink beat the elegant curves? Two reasons. Dirt cheap: no e, no division — Wikipedia: it "only requires comparison and addition," and CS231n adds that on whole layers it's just thresholding a matrix. No saturation for z > 0: the right side is a straight line with slope exactly 1, forever — the learning signal passes through undamaged ("better gradient propagation: fewer vanishing gradient problems compared to sigmoidal activation functions" — Wikipedia). CS231n cites a famous result that ReLU accelerated a deep net's training convergence by a factor of 6 compared with sigmoid/tanh (Krizhevsky et al., the 2012 AlexNet work). The fine print: for z < 0 ReLU is completely flat, so a neuron pushed deep into negative territory can stop learning entirely — the "dying ReLU" problem. In practice the speed and simplicity win anyway.

Feel it: plot all three

Drive the curves. Pick a function, drag z, and watch two numbers: the output f(z) and the slope at that point. The shaded zones mark where the slope drops below 0.05 — the flat ends where learning signals die. Notice: sigmoid and tanh are shaded at both ends; ReLU is shaded on its whole left half but never on the right.

The numbers above are the real functions, not sketches — e.g. sigmoid at z = 2 shows 0.881, tanh at 2 shows 0.964 with slope 0.071, all matching Python's math module to three decimals. One corner case: at exactly z = 0 ReLU's slope is undefined (the kink); software just picks 0 or 1 there (Wikipedia). This widget shows 0.

Write it in code

All three functions are one-liners. Paste the Python into a Google Colab cell and run it; the Swift runs as-is in an Xcode playground. Both print the identical table below — compare it against the widget.

import Foundation

func sigmoid(_ z: Double) -> Double { 1 / (1 + exp(-z)) }
func relu(_ z: Double) -> Double { max(0, z) }
// tanh ships with Foundation

for z in [-3.0, -1.0, 0.0, 1.0, 3.0] {
    print(String(format: "z=%+.1f  sigmoid=%.3f  tanh=%+.3f  relu=%.1f",
                 z, sigmoid(z), tanh(z), relu(z)))
}
// z=-3.0  sigmoid=0.047  tanh=-0.995  relu=0.0
// z=-1.0  sigmoid=0.269  tanh=-0.762  relu=0.0
// z=+0.0  sigmoid=0.500  tanh=+0.000  relu=0.0
// z=+1.0  sigmoid=0.731  tanh=+0.762  relu=1.0
// z=+3.0  sigmoid=0.953  tanh=+0.995  relu=3.0

import math

def sigmoid(z): return 1 / (1 + math.exp(-z))
def relu(z):    return max(0.0, z)
# tanh ships with the math module

for z in [-3.0, -1.0, 0.0, 1.0, 3.0]:
    print(f"z={z:+.1f}  sigmoid={sigmoid(z):.3f}  "
          f"tanh={math.tanh(z):+.3f}  relu={relu(z):.1f}")

# z=-3.0  sigmoid=0.047  tanh=-0.995  relu=0.0
# z=-1.0  sigmoid=0.269  tanh=-0.762  relu=0.0
# z=+0.0  sigmoid=0.500  tanh=+0.000  relu=0.0
# z=+1.0  sigmoid=0.731  tanh=+0.762  relu=1.0
# z=+3.0  sigmoid=0.953  tanh=+0.995  relu=3.0

Read the table like a story: sigmoid squeezes everything into (0, 1) and is already 0.953 at z = 3; tanh mirrors it around zero in (−1, 1); ReLU zeroes the negatives and passes positives straight through, unbounded.

Who goes where in a real network

Don't retire the sigmoid — reassign it. Hidden layers: ReLU. That's the modern default; CS231n's blunt advice is simply "Use the ReLU non-linearity." Output layer: it depends on the question. When the network must answer with a probability — "is this a cat, 0 to 1?" — sigmoid's (0, 1) range is exactly the right shape, so it still lives at the output. Saturation matters less there because it's the last stop: no deeper layers are waiting for a learning signal to pass through. You'll use both in Phase 4: sigmoid throughout the tiny net you build from scratch (one function keeps the math gentle), then ReLU in the hidden layer of the digit-reading model that ends the course.

Check yourself

No peeking back. Pull it from memory.

1. What happens when you stack layers with no activation in between?

2. Why do learning signals die at the sigmoid's far ends?

3. Which activation is the modern default for hidden layers?

Read this next

Primary source: CS231n — Neural Networks Part 1: activation functions (Stanford's course notes). Read the "Commonly used activation functions" section — every claim in this lesson is in there, plus a few exotic cousins (Leaky ReLU, Maxout) you can now understand. It's blunt, practical, and written by people who train real networks.