Introducing Optimization

This chapter is a significant turning point in the book. After having built a complete neural network capable of:

Forward Pass: Taking input and calculating an output.
Loss Calculation: Measuring the network's degree of error.

...the next big question is: "How does the network learn and improve on its own?". Chapter 6 answers this question by introducing the concept of Optimization – the process of adjusting parameters (weights and biases) to minimize the loss function.

The purpose of this chapter is not to immediately present the best method, but to help the reader understand the nature of the problem by experimenting with two "naive" methods and recognizing their limitations.

Part 1: Analysis of the Core Concept - Optimization¶

1. Simple Explanation¶

Optimization in machine learning is the journey of finding the "best" set of parameters (weights and biases) for a model. "Best" here means the set of parameters that makes the loss function's value as small as possible.

2. Abstract Analogy: The Journey to the Bottom of the Valley¶

Imagine:

The Earth's surface is the Parameter Space: Each point on the surface (defined by longitude, latitude) corresponds to a unique set of weights and biases.
The altitude at each point is the Loss Value: A place with high altitude is where the model predicts very poorly (high loss). A place with low altitude is where the model predicts well (low loss).
Our goal is to find the point with the lowest altitude on the entire map (the bottom of the deepest chasm) – which is the Global Minimum.

Optimization is the process of finding the path down to the lowest point in this valley.

Part 2: Analysis of the 2 "Naive" Optimization Methods in the Book¶

The book approaches the problem by proposing two very intuitive but inefficient strategies.

Method 1: Random Search¶

How it works:
1. Generate a completely random set of weights and biases.
2. Use this set of parameters to calculate the loss on the entire dataset.
3. If this loss is lower than the lowest loss ever recorded, save this set of parameters.
4. Repeat from step 1 many times.
Abstract Analogy: "Blind Parachuting" Imagine you are on an airplane and want to find the lowest point in a vast mountain range below. This method is like randomly parachuting to an arbitrary location, measuring the altitude, then getting back on the plane and jumping to another random location. After thousands of jumps, you hope that one of them will land at the bottom of the valley.
Results in the book:
- Extremely inefficient. After 1 billion iterations, the loss decreased insignificantly, and the accuracy remained almost unchanged.
- Reason: The parameter space of a neural network (even a small one) is extremely large. The probability of success with a blind "trial and error" approach is incredibly small, like finding a needle in a haystack.

Method 2: Random Local Search¶

How it works:
1. Start with the best set of weights and biases found so far (best_weights).
2. Create a new set of weights/biases by adding a small random value to the best_weights set.
3. Calculate the loss with this new set of parameters.
4. If the new loss is lower: Update best_weights with the new set of parameters.
5. If the new loss is higher: Discard the change, revert to the old best_weights.
6. Repeat from step 2.
Abstract Analogy: "The Blindfolded Mountaineer"

Now, instead of parachuting, you have been dropped onto a hillside. You are blindfolded and don't know which direction is downhill. You do the following: 1. Take a small trial step in a random direction. 2. If you feel you are going down (altitude decreases), you stay in the new position. 3. If you feel you are going up or sideways (altitude increases/stays the same), you return to your previous position and try stepping in another random direction.
Results in the book:
- With simple data (vertical_data): It worked much better! The loss decreased significantly, and the accuracy reached ~93%.
- With complex data (spiral_data): It failed almost completely. It got stuck in a Local Minimum.

Part 3: Illustrative Diagrams of Calculations (ASCII Art)¶

Diagram 1: General Calculation Flow (Forward Pass & Loss)¶

Text Only

+----------------+
|    Input X     |
+----------------+
        |
        V
+----------------+
|    Dense 1     |
|   (w1, b1)     |
+----------------+
        |
        V
+----------------+
| Activation ReLU|
+----------------+
        |
        V
+----------------+
|    Dense 2     |
|   (w2, b2)     |
+----------------+
        |
        V
+----------------+
|Activation Softmax|
+----------------+
        |
        V
+----------------+     +----------------+
|   Predictions  |     |   True Labels  |
|     (y_pred)   |     |      (y)       |
+----------------+     +----------------+
        |                    |
        +--------+   +-------+
                 |   |
                 V   V
          +----------------+
          |  Loss Function |
          | (CrossEntropy) |
          +----------------+
                 |
                 V
          +----------------+
          |   Final Loss   |
          +----------------+

Diagram 2: Logic of Method 1 (Random Search)¶

Text Only

         +----------+
         |  Start   |
         +----------+
              |
              V
+-------------+------------------+
| init lowest_loss = 9999999     |
+--------------------------------+
              |
              |   +--------------------------------------+
              +-->|          Loop (10000 times)          |
                  +--------------------------------------+
                                |
                                V
                  +--------------------------------------+
                  | Generate COMPLETELY NEW weights/biases|
                  +--------------------------------------+
                                |
                                V
                  +--------------------------------------+
                  |   Perform Forward Pass & calc Loss   |
                  +--------------------------------------+
                                |
                                V
                  +--------------------------------------+
                  |      loss < lowest_loss ?            |
                  +------------------+-------------------+
                                     |
                       +-------------+-------------+
                       |                           |
                       V (Yes)                     V (No)
        +----------------------------+
        | lowest_loss = loss         |         (Do nothing)
        | Save the weights/biases    |
        +----------------------------+
                       |                           |
                       +-------------+-------------+
                                     |
                                     | (Return to start of Loop)
                                     +----------------------+
                                                            |
(After loop ends)                                           |
              |                                             |
              V                                             ^
         +----------+                                       |
         |   End    |<--------------------------------------+
         +----------+

Diagram 3: Logic of Method 2 (Random Local Search)¶

Text Only

          +----------+
          |  Start   |
          +----------+
               |
               V
+--------------+-------------------+
| init best_weights/biases          |
| init lowest_loss = 9999999        |
+-----------------------------------+
               |
               |   +-----------------------------------------+
               +-->|             Loop (10000 times)          |
                   +-----------------------------------------+
                                 |
                                 V
                   +-----------------------------------------+
                   | weights += small_random_value           |
                   | biases  += small_random_value           |
                   +-----------------------------------------+
                                 |
                                 V
                   +-----------------------------------------+
                   |  Perform Forward Pass & calc Loss       |
                   +-----------------------------------------+
                                 |
                                 V
                   +-----------------------------------------+
                   |         loss < lowest_loss ?            |
                   +-------------------+---------------------+
                                       |
                         +-------------+-------------+
                         |                           |
                         V (Yes)                     V (No)
        +----------------------------+  +-----------------------------------+
        | lowest_loss = loss         |  | Revert:                           |
        | best_weights = weights.copy()|  | weights = best_weights.copy()   |
        | best_biases = biases.copy()  |  | biases = best_biases.copy()     |
        +----------------------------+  +-----------------------------------+
                         |                           |
                         +-------------+-------------+
                                       |
                                       | (Return to start of Loop)
                                       +-----------------------+
                                                               |
(After loop ends)                                              |
               |                                               |
               V                                               ^
          +----------+                                         |
          |   End    |<----------------------------------------+
          +----------+

Part 4: Conclusion and Lessons Learned¶

Chapter 6 plays an excellent pedagogical role. By guiding the reader through "naive" methods, it highlights the core challenges of training a neural network:

The inefficiency of blind search: The parameter space is too vast to be searched randomly.
The Local Minima problem: Local search methods can easily get "stuck," preventing the model from reaching optimal performance.

This chapter sets the perfect stage for the following chapters, where a "smarter" method will be introduced. Instead of stepping randomly, we need a way to determine which direction will decrease the loss fastest. That direction is the gradient, and the method that uses it is called Gradient Descent – the gold standard in neural network optimization.

Conceptual References:

Optimization Algorithms: Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Chapter 8).
Local Minima: A fundamental concept in the fields of mathematical optimization and computer science.