Gradients, Partial Derivatives and the Chain Rule

Abstract Example¶

We will combine the intuitive story of "responsibility" with diagrams and specific explanations for each node in a neural network. This will create a solid bridge between abstract theory and practical computation.

Story Summary¶

Neural Network: An archer with hundreds of technical "knobs" (weights, biases).
Objective: To hit the bullseye (reduce Error/Loss to 0).
Backpropagation: The process of "Distributing Responsibility," figuring out how much responsibility each knob bears for the final error.
Gradient: The instruction sheet that compiles all "responsibilities." We will adjust the knobs in the opposite direction of this instruction sheet.

Now, let's act as engineers, open the hood of the "archer machine," and examine each part. We will see how "responsibility" (the gradient) flows backward through each type of component.

Conventions:

Forward: Data flows from left to right.
Backward: "Responsibility" (Gradient) flows backward from right to left.
Input Responsibility: The gradient that a node receives from the next layer.
Output Responsibility: The gradient that a node calculates and sends to the previous layer (against the data flow).

Analyzing Each Component (Each Computational Node)¶

1. The ADDITION Component (The Synergy Unit)¶

This is the simplest component. It takes two or more input values and adds them together.

Example in a Neural Network: Sum = (Input * Weight) + Bias. The addition with the Bias is an addition node.

Responsibility Distribution Diagram¶

Text Only

                   Responsibility for 'Sum' = 10
                                |
                                v
           +-----------------------------------------+
--[Input * Weight]------->|                             |
                         |        ADDITION NODE (+)    |-->[Sum]--
    --[Bias]------------>|                             |
           +-----------------------------------------+
        ^                                   ^
        |                                   |
Responsibility for 'Input*Weight'= 10      Responsibility for 'Bias'= 10

Forward: Takes the value of (Input * Weight) and adds it to the value of Bias to get the Sum.
Backward (Distributing Responsibility):
- Principle: In addition, all inputs contribute equally and directly to the result. The influence of each input on the output is 1-to-1 (increasing an input by 1 unit increases the output by 1 unit).
- Action: The addition node receives "responsibility" from the next layer and copies it exactly to all of its input branches.
- Derivative at this node: The derivative of Sum with respect to Bias is 1. The derivative of Sum with respect to (Input * Weight) is also 1. Therefore, the responsibility is multiplied by 1, meaning it remains unchanged.

2. The MULTIPLICATION Component (The Leverage Unit)¶

This component takes two values and multiplies them together.

Example in a Neural Network: Result = Input * Weight. This is the core operation in a layer.

Responsibility Distribution Diagram¶

Text Only

                   Responsibility for 'Result' = 10
                                |
                                v
           +-----------------------------------------+
  --[Input = 3]---------->|                             |
                         |   MULTIPLICATION NODE (*)   |-->[Result = 15]--
--[Weight = 5]--------->|                             |
           +-----------------------------------------+
        ^                                   ^
        |                                   |
Responsibility for 'Input'           Responsibility for 'Weight'
 = 10 * 5 = 50                        = 10 * 3 = 30

Forward: Takes the Input (let's assume its value is 3) and multiplies it by the Weight (let's assume its value is 5) to get the Result (15).
Backward (Distributing Responsibility):
- Principle: In multiplication, the influence of one input on the result depends on the value of the other input. This is the leverage effect.
- Action: To calculate the responsibility for one branch, the multiplication node takes the "responsibility" from the next layer and multiplies it by the value of the other branch (the value that was stored from the Forward pass).
- Derivative at this node: The derivative of Result with respect to Input is the value of the Weight. The derivative of Result with respect to Weight is the value of the Input.

3. The ReLU Component (The Gatekeeper)¶

This component is very special. It only has one input. If the input is a positive number, it lets it pass through. If the input is a negative number, it blocks it and returns 0. The official name is ReLU(x) = max(0, x).

Example in a Neural Network: This is a popular activation function that helps the network learn complex relationships.

Responsibility Distribution Diagram¶

Case 1: Positive Input (Gate is open)

Text Only

                   Responsibility for 'Output' = 10
                                |
                                v
           +-----------------------------------------+
--[Input = 3]----------->|                             |
                         |       ReLU NODE (Open)      |-->[Output = 3]--
           +-----------------------------------------+
                                ^
                                |
                   Responsibility for 'Input' = 10

* Case 2: Negative Input (Gate is closed)

Text Only

                   Responsibility for 'Output' = 10
                                |
                                v
           +-----------------------------------------+
--[Input = -2]---------->|                             |
                         |      ReLU NODE (Closed)     |-->[Output = 0]--
           +-----------------------------------------+
                                ^
                                |
                   Responsibility for 'Input' = 0

Forward: Checks the input. If positive, it remains unchanged. If negative, it becomes 0.
Backward (Distributing Responsibility):
- Principle: ReLU acts as a gatekeeper for the flow of "responsibility."
- Action:
  - If during the forward pass, the input was positive (the gate was open), then during the backward pass, the "responsibility" is allowed to pass through unchanged.
  - If during the forward pass, the input was negative (the gate was closed), then during the backward pass, the "responsibility" is completely blocked (becomes 0). This component did not contribute to the final positive result, so it bears no responsibility.
- Derivative at this node: The derivative of ReLU(x) is 1 if x > 0, and 0 if x < 0. Therefore, the "responsibility" is multiplied by 1 (unchanged) or 0 (annihilated).

Putting It All Together: A Chain of Responsibility¶

Now, let's see how these components work together in a simple neuron:

Input * Weight --> [Multiplication Node] --> Intermediate Sum + Bias --> [Addition Node] --> Final Result

Text Only

Final responsibility (from the next layer)
      |
      v
 [Addition Node] <--- Responsibility is copied to Bias and 'Intermediate Sum'
      |
      v
 [Multiplication Node] <--- Responsibility is multiplied by the value of the other branch
      |
      v
Responsibility of the Weight and Input

This "Responsibility Distribution" process flows backward through the entire network, through hundreds, thousands of components like these. It's like a reverse investigation, starting from the "crime scene" (the Loss Function) and following the traces back to find the "responsibility" of each "suspect" (each weight and bias).

When the investigation is complete, each weight and bias will have received its own "responsibility" number. That is the Gradient. Based on that, the Coach (you) will know how to adjust each knob to make the next shot more accurate.