Skip to content

Transformation & ReLU activation function

Part 1: The ReLU Activation Function

This is one of the best illustrations of the power of the ReLU activation function, a fundamental concept in modern neural networks.

The main goal of this chapter is to answer the question: "How can a simple function like ReLU, which is just a straight line 'broken' at point 0, help a neural network learn incredibly complex non-linear relationships (like a sine wave)?"

Let's dissect it step by step.

The Big Idea

Imagine you only have straight LEGO bricks. How would you use them to build a curved wall?

The answer is: You can't create a smooth curve, but you can approximate it by piecing together many short, straight bricks. The more short bricks you use, the more curved and smooth your wall will look.

In a neural network:

  • The ReLU function is our straight brick.
  • Each neuron (or pair of neurons) is a builder, tasked with placing one brick.
  • The neural network is the entire construction, assembling these bricks to create the final complex shape (like a sine curve).

Now let's dive into the details of how the neuron "builder" works.


The Basic Building Block - A Single ReLU Neuron

A neuron receives an input, multiplies it by a weight, adds a bias, and then passes it through the ReLU activation function.

Formula: output = ReLU(weight * input + bias) ReLU function: ReLU(x) = max(0, x). This means: * If x <= 0, the result is 0. * If x > 0, the result is x.

It's like a gate: * Calculated input is negative -> Gate closes -> Output = 0. * Calculated input is positive -> Gate opens -> Output = the calculated value.

The Role of Weight and Bias:

  1. Weight: Changes the slope.

    • A large weight -> The line slopes up faster.
    • A small weight -> The line slopes up more gently.
    • A negative weight -> The line is flipped, sloping down.
  2. Bias: Shifts the "break" point.

    • The bias determines at which input value the neuron starts to "activate" (i.e., produce an output > 0). It shifts the graph left or right.

ASCII Illustration:

Let's look at the graph of a ReLU neuron. The horizontal axis is input, the vertical axis is output.

Text Only
        Basic (w=1, b=0)          Increase Weight (w=2)           Add Bias (w=1, b=-1)
           /                          //                              /
          /                          / /                             /
         /                          / /                             /
        +------- (input)          +------- (input)                +------- (input)
                                                                 |
                                                               (break point shifted right)

In summary: With one neuron, we can create a "ramp" that starts at an arbitrary point and has an arbitrary slope. But it's still just a straight line.


The Magic Begins - Combining Two Neurons

This is the most crucial and magical part. The book illustrates how by using a pair of neurons (one in the first hidden layer, one in the second), we can create something called an "area of effect."

Imagine this:

  1. Neuron 1 (The "Activator" Neuron): Creates an upward-sloping line.

    • Example: y1 = ReLU(1.0 * x - 0.2)
    • It will produce an upward slope, starting from x = 0.2.
  2. Neuron 2 (The "Deactivator" Neuron): Creates a downward-sloping line.

    • Example: y2 = ReLU(-1.0 * x + 0.8)
    • It will produce a downward slope, starting from x = 0.8.

When you add the results of these two neurons together (which is what happens in the next layer), something interesting occurs:

  • When x < 0.2: Both neurons output 0. The sum is 0.
  • When 0.2 < x < 0.8: Neuron 1 is active (sloping up), Neuron 2 is still 0. The sum is an upward-sloping line.
  • When x > 0.8: Both neurons are active. The upward slope of Neuron 1 is cancelled out by the downward slope of Neuron 2. The sum is a horizontal line.

The result is a "tent" or "hat" shape!

ASCII Illustration:

Text Only
   Output of Neuron 1         +    Output of Neuron 2         =       Final Result
  (slope up from x=0.2)            (slope down from x=0.8)             (a tent shape)
            /                                 |                               /\
           /                                  |                              /  \
          /                                   |                             /    \
---------+---------- (input)    --------------+---------- (input)   ---------+----+------ (input)
       0.2                                  0.8                             0.2  0.8
                                              \
                                               \

This is our triangular LEGO brick! By adjusting the weights and biases of this pair of neurons, we can control: * The position of the tent (shift left/right). * The height of the tent. * The slope of the tent's sides.


Building a Sine Wave - Assembling Many "Bricks"

Now that you have these "tent-shaped bricks," approximating a sine curve becomes simple:

You just use many pairs of neurons, with each pair creating one "tent" to simulate a segment of the sine wave.

  • Neuron pair 1: Creates the first upward slope of the sine wave.
  • Neuron pair 2: Creates the next downward slope.
  • Neuron pair 3: Creates the upward slope of the negative part.
  • ... and so on.

The "hand-tuning" process in the book (from Fig 4.20 to 4.33) is just to demonstrate to you: * "Ah, if I adjust this weight, this slope gets steeper." * "If I adjust that bias, this tent shifts to the right." * "If I use a negative weight on the output, the tent gets flipped upside down (creating the trough of the sine wave)."

Abstract ASCII Illustration:

Text Only
   Target Sine Wave:
       __/ \__
      /     \
     /       \
            / \
    _______/   \______

   Neural Network approximates by adding "tents":

     Tent 1      Tent 2 (inverted)      Tent 3...
       /\               _                /\
      /  \             / \              /  \
     /    \           /   \            /    \
    -------  +      \/     +  ...   =   Result approximates a sine wave

Summary

  1. The Power of ReLU: ReLU itself is very simple, but when combined in a multi-layer network, its units are capable of creating incredibly complex piecewise linear functions. These functions can approximate any continuous function (this is the idea behind the Universal Approximation Theorem).

  2. Why Do We Need Hidden Layers? The first hidden layer creates the "ramps." The second hidden layer combines those "ramps" to form "tents." Subsequent layers can combine these "tents" into even more complex shapes.

  3. From "Hand-Tuning" to "Self-Learning": Manually tuning the parameters in the book is tedious and for illustrative purposes only. In reality, the process of training a neural network is the process of the computer automatically finding the best weight and bias values for all neurons, so that the network's output matches the target data as closely as possible. An optimizer like Gradient Descent is the "chief architect" that does this job.

  4. More Neurons = Better: By increasing the number of neurons to, for example, 64, the network has more "bricks" to build with. The result is an approximated curve that looks much smoother and more accurate.

ReLU, despite its simplicity, is the foundation for the extraordinary power of modern neural networks.


Part 2: A Deep Dive into the Transformation of a Single Data Point.

Let's assume we have:

  • A single input data point O with coordinates (0.5, 1.0).
  • A Layer_Dense layer with 2 inputs and 3 neurons.

1. Initialization

The dense1 layer is initialized with weights and biases. Let's assume that after random initialization, we have these specific values:

Weight Matrix W (self.weights) - Size (2, 3)

  • Row 0: Weights for the first input (i1 = 0.5).
  • Row 1: Weights for the second input (i2 = 1.0).
  • Cols 0, 1, 2: Correspond to Neuron 0, 1, and 2.
Text Only
          Neuron 0   Neuron 1   Neuron 2
         +----------+----------+----------+
Input 0  |   0.2    |   0.8    |  -0.5    |
(i1=0.5) +----------+----------+----------+
Input 1  |  -0.9    |   0.2    |   0.4    |
(i2=1.0) +----------+----------+----------+

Bias Vector b (self.biases) - Size (1, 3)

  • Each value corresponds to the bias of one neuron.
Text Only
         +----------+----------+----------+
         |   2.0    |   3.0    |   0.5    |
         +----------+----------+----------+
           Neuron 0   Neuron 1   Neuron 2

2. The Transformation Process - dense1.forward(O)

We perform the operation: v' = v · W + b

Step 2.1: Dot Product v · W

  • v is the input vector: [0.5, 1.0] (Size 1x2)
  • W is the weight matrix (Size 2x3)
  • The result will be a vector of size 1x3.
Text Only
                                       +-------+-------+-------+
                                       |  0.2  |  0.8  | -0.5  |
                                       | -0.9  |  0.2  |  0.4  |
                                       +-------+-------+-------+
                                                 ^
                                                 |
                                                 · (Dot Product)
+-------+-------+
|  0.5  |  1.0  |
+-------+-------+
      |
      +-------------------------------------------------------------+
      |                                                             |
      v                                                             v
    Calculation for Neuron 0:                                     Calculation for Neuron 1:
    (0.5 * 0.2) + (1.0 * -0.9)                                    (0.5 * 0.8) + (1.0 * 0.2)
    = 0.1 - 0.9                                                   = 0.4 + 0.2
    = -0.8                                                        = 0.6

                                                                     Calculation for Neuron 2:
                                                                     (0.5 * -0.5) + (1.0 * 0.4)
                                                                     = -0.25 + 0.4
                                                                     = 0.15

The result of the dot product is the vector [-0.8, 0.6, 0.15].

Step 2.2: Add Bias Vector + b

Now, we take the result from above and add the bias vector.

Text Only
      Result from v · W                 Bias Vector b                  Output Vector v'
+--------+-------+--------+     +     +-------+-------+-------+     =     +-------+-------+-------+
|  -0.8  |  0.6  |  0.15  |           |  2.0  |  3.0  |  0.5  |           |  1.2  |  3.6  |  0.65 |
+--------+-------+--------+           +-------+-------+-------+           +-------+-------+-------+
     |        |        |                 |        |        |                 |        |        |
     |        |        +-----------------|--------|--------|-----------------+        |
     |        +--------------------------|--------|--------+--------------------------+
     +-----------------------------------|--------+-----------------------------------+

     -0.8 + 2.0 = 1.2
           0.6 + 3.0 = 3.6
                 0.15 + 0.5 = 0.65

Result: The vector v' (the output of dense1) is [1.2, 3.6, 0.65]. This is the coordinate of point O in the new 3-dimensional space after the linear transformation.

3. ReLU Activation - activation1.forward(v')

Now, we pass the vector v' through the ReLU function. This function operates element-wise.

Text Only
     Input vector v' to ReLU             Action of ReLU              Final vector v''
+-------+-------+-------+        max(0, x)       +-------+-------+-------+
|  1.2  |  3.6  |  0.65 |  ---------------->     |  1.2  |  3.6  |  0.65 |
+-------+-------+-------+                        +-------+-------+-------+
     |        |        |
     |        |        +-----> max(0, 0.65) = 0.65
     |        +--------------> max(0, 3.6)  = 3.6
     +-----------------------> max(0, 1.2)  = 1.2

In this example, because all components of v' are positive, the output of ReLU v'' is identical to v'.

If v' were [-0.8, 0.6, 0.15] (before adding the bias), the result would be different:

Text Only
     Input vector v' to ReLU             Action of ReLU              Final vector v''
+--------+-------+--------+        max(0, x)       +-------+-------+--------+
|  -0.8  |  0.6  |  0.15  |  ---------------->     |  0.0  |  0.6  |  0.15  |
+--------+-------+--------+                        +-------+-------+--------+
     |        |        |
     |        |        +-----> max(0, 0.15) = 0.15
     |        +--------------> max(0, 0.6)  = 0.6
     +-----------------------> max(0, -0.8) = 0.0

This diagram describes the entire mathematical process from an input vector v to the final vector v'' after passing through a dense layer and a ReLU activation layer.


Abstraction

How does each neuron contribute to creating the final "signature"?

Analysis

  1. "Neuron 1 contributes a part" to creating the final signature. This is like a musician in an orchestra. The violinist doesn't "carry" the symphony; they just play their violin part. The symphony (the signature) is the combination of all the musicians. A more accurate phrasing: "Neuron 1 has its own set of criteria (its weights and bias)."

  2. "transforms O(x,y) → O'(x,y,z)": The entire layer (composed of all 3 neurons) collectively performs this transformation.

Interpretation

  1. Each Neuron is a "Feature Detector":

    • Neuron 1 is equipped with a set of criteria (w1, b1). It measures how well the point O(x,y) fits these criteria and outputs a score x'.
    • Neuron 2 is equipped with a set of criteria (w2, b2). It measures how well the point O(x,y) fits these criteria and outputs a score y'.
    • Neuron 3 is equipped with a set of criteria (w3, b3). It measures how well the point O(x,y) fits these criteria and outputs a score z'.
  2. Creating the "Signature":

    • The "signature" of point O is not created by a single neuron. The "signature" is the resulting vector O'(x', y', z'). It is the collection of scores that all the "detectors" have produced.
  3. The Goal of Training:

    • The training process will adjust the criteria (w, b) of each neuron such that:
      • All points O belonging to the "Blue" class, when passed through these 3 "detectors," will produce O' vectors (signatures) that lie close to each other in one region of space.
      • All points O belonging to the "Red" class will produce signatures that lie close to each other in a different region of space.
      • And similarly for the "Green" class.

ASCII Diagram Reflecting This Idea

Text Only
  Input Point O(x,y)
          |
          |
+---------+---------+
|                   |
v                   v
Detector 1          Detector 2          Detector 3
(Criteria w1, b1)   (Criteria w2, b2)   (Criteria w3, b3)
|                   |                   |
v                   v                   v
Score x'            Score y'            Score z'
|                   |                   |
+---------+---------+-------------------+
          |
          v
"Signature" = O'(x', y', z')
(Resulting vector in a new space)

Example: After training, it might be the case that:

  • Criterion 1 (of Neuron 1) becomes "detect upward curves."
  • Criterion 2 (of Neuron 2) becomes "detect proximity to the origin."
  • A point O from the "Blue" class might be both curving up and close to the origin. Its signature would be O'(HIGH, HIGH, ...).
  • A point O from the "Red" class might be curving up but far from the origin. Its signature would be O'(HIGH, LOW, ...).

Conclusion: Each neuron has a specific role. That role is to "measure a feature." The final "signature" of a data point is the combination of results from all of those feature measurements.