Derivative

Chapter 7 is one of the most pivotal chapters in the book. It marks the transition from the "trial and error" (random search) method to a systematic, intelligent, and efficient optimization method based on calculus. This is the foundation for understanding the Backpropagation algorithm.

1. Context - Why Do We Need Derivatives?
  • Easy-to-understand Explanation:

    In previous chapters, we tried to find the best set of weights and biases by changing them randomly. The author points out that this method is ineffective because the space containing all possible combinations of weights and biases is infinite. Random searching is like looking for a needle in a haystack. We need a "smarter" way to know which direction to adjust the parameters to minimize the loss function.

  • Abstract Analogy: The Blind Mountain Climber

    Imagine you are standing on a mountainside, blindfolded. Your goal is to go down to the valley (the lowest point, corresponding to the lowest loss).

    • Random Method (Random Search): You instantly teleport to a random location on the mountain. You might be lucky and land near the valley, but it's more likely you'll end up in a worse spot.
    • Smart Method (Using Derivatives): Instead of jumping around, you gently tap your feet on the ground around you to feel the slope of the terrain. You will then step in the steepest downward direction. This step might be small, but it ensures you are heading in the right direction to descend the mountain. "Feeling the slope" is precisely the role of the derivative.
  • Conceptual Comparison:

    The book refers to this as understanding the "impact" of a weight/bias on the loss function. In mathematics, this concept is precisely defined as the derivative. The derivative of a function at a point tells us the "instantaneous rate of change" of the function at that point.

    Info

    Source: According to Khan Academy, "The derivative of a function... can be interpreted as the instantaneous rate of change... or the slope of the tangent line to the graph of the function at that point."

2. "Impact" - From Slope to Derivative

The chapter begins to build intuition about derivatives starting from the familiar concept of slope from algebra.

2.1. Simple Case - Linear Function (y = 2x)

  • Easy-to-understand Explanation:

    For a straight line, the slope is constant at every point. The slope is calculated using the "rise over run" formula, or Δy / Δx (change in y divided by change in x). For the function y = 2x, every time x increases by 1 unit, y increases by 2 units. Therefore, the "impact" of x on y is always 2.

  • Diagram of Slope Calculation:

    Choose any two points on the line y=2x, for example, p1 = (1, 2) and p2 = (3, 6).

Text Only
    p1 = (x1, y1) = (1, 2)
    p2 = (x2, y2) = (3, 6)

    Change in x (Δx) = x2 - x1 = 3 - 1 = 2
    Change in y (Δy) = y2 - y1 = 6 - 2 = 4

    Slope = Δy / Δx = 4 / 2 = 2

2.2. The Challenge - Nonlinear Function (y = 2x²)

  • Easy-to-understand Explanation:

    When the graph is a curve, the slope is no longer fixed. If you calculate the slope between x=1 and x=2, you will get a different result than when calculating it between x=3 and x=4. The slope of the curve changes depending on where you are looking. This leads to the question: "What is the slope at a single point?"

2.3. Numerical Derivative - An Approximation

  • Easy-to-understand Explanation:

    To find the slope at a single point (e.g., at x=1), we can't use the Δy / Δx formula because Δx would be 0 (dividing by 0 is illegal). The book introduces a trick: instead of choosing two points far apart, we choose two points that are extremely close to each other. For example, we choose point x1 = 1 and x2 = 1.0001. The distance Δx is now very small (0.0001) but not zero. The slope calculated between these two points will be a very good approximation of the true slope of the tangent line at x=1. This is called the numerical derivative.

  • Conceptual Comparison:

    This method is a practical implementation of the limit definition of the derivative in calculus.

    The formal definition of a derivative: f'(x) = lim (h→0) [f(x+h) - f(x)] / h Here, h is the value p2_delta = 0.0001 that the book uses. Since computers cannot compute with h approaching infinity, we choose a sufficiently small value for h. The book also notes the trade-off: h must be small enough for an accurate approximation but not so small that it causes floating-point precision errors.

3. Two Methods for Calculating Derivatives: Numerical vs. Analytical

3.1. Limitations of Numerical Derivative in Neural Networks

  • Easy-to-understand Explanation:

    Although the numerical derivative is easy to understand and program, it is extremely slow for neural networks. Imagine a neural network with 1 million parameters (weights and biases). To update these parameters just once, we would need to: 1. Calculate the initial loss (1 forward pass). 2. For each of the 1 million parameters:

    Text Only
    *   Nudge the parameter slightly (`+h`).
    *   Recalculate the loss (another forward pass).
    *   Calculate the approximate derivative.
    *   Revert the parameter to its original value.
    

    This means we would need about 1 + 1,000,000 forward passes just for a single update step! This is an unfeasible "brute-force" approach.

  • Abstract Analogy:

    Returning to the blind mountain climber on a million-dimensional "mountain." To find the steepest direction, they would have to take a small step in each of the million dimensions, measure the change in altitude, and only then decide on the composite step. This is incredibly time-consuming.

3.2. Analytical Derivative - The Exact Solution

  • Easy-to-understand Explanation:

    Instead of approximating numerically, the analytical method gives us an exact formula for the derivative. For example, instead of having to approximate the derivative of f(x) = 2x² at x=1 and x=2, we find the general derivative formula f'(x) = 4x. Now, to calculate the derivative at any point, we just plug the number into this formula:

    • At x=1, the derivative is 4 * 1 = 4.
    • At x=2, the derivative is 4 * 2 = 8. This is millions of times faster than the numerical method.
  • Basic Derivative Rules (Building Blocks):

    The chapter introduces basic rules to "break down" complex functions into simpler parts and find their derivatives:

    1. Constant Rule: d/dx (c) = 0 (The slope of a horizontal line is 0).
    2. Power Rule: d/dx (x^n) = n * x^(n-1) (This is the most powerful rule).
    3. Constant Multiple Rule: d/dx (c*f(x)) = c * d/dx (f(x)).
    4. Sum/Difference Rule: d/dx (f(x) + g(x)) = d/dx (f(x)) + d/dx (g(x)).
  • Diagram of Analytical Calculation:

    For example, calculating the derivative of f(x) = 3x² + 5x.

Text Only
f(x) = 3x² + 5x

// Apply the Sum Rule to split into 2 parts
f'(x) = d/dx(3x²) + d/dx(5x)

// Consider the first part: d/dx(3x²)
// Apply the Constant Multiple Rule
// -> 3 * d/dx(x²)
// Apply the Power Rule with n=2
// -> 3 * (2 * x^(2-1))
// -> 3 * (2x¹) = 6x

// Consider the second part: d/dx(5x) or d/dx(5x¹)
// Apply the Constant Multiple Rule
// -> 5 * d/dx(x¹)
// Apply the Power Rule with n=1
// -> 5 * (1 * x^(1-1))
// -> 5 * (1 * x⁰)
// -> 5 * 1 = 5

// Combine them
f'(x) = 6x + 5

This is a structured process that allows a computer to find the derivative formula efficiently, instead of having to perform repetitive approximations.

4. Summary & Connection
  • Main takeaways:

    1. Optimizing a neural network requires a "smarter" method than random search.
    2. The derivative tells us the "impact" or "slope" of a parameter on the loss, thus indicating the direction to adjust the parameter.
    3. There are two ways to calculate derivatives: Numerical (approximate, slow but easy) and Analytical (exact, fast, the choice for neural networks).
    4. The analytical derivative works by using basic rules (power, sum, constant...) to find an exact derivative formula.
  • Bridge to the Next Chapter:

    This chapter taught us how to calculate derivatives for functions with a single input variable (x). However, a neural network's loss function depends on millions of variables (all the weights and biases). When a function has multiple input variables, calculating the "slope" with respect to each variable individually is called taking the partial derivative. This is the topic of the next chapter and the final stepping stone before we learn about the backpropagation algorithm, where all these concepts will come together.