Softmax

Explanation of Softmax in Neural Networks¶

Part 1: Why Do We Need Softmax? What's the Problem?¶

In previous chapters, you may have become familiar with the ReLU (Rectified Linear Unit) activation function. ReLU is excellent for hidden layers but has a few problems if used for the final output layer of a classification network:

Unbounded: The output of ReLU can be any positive number (e.g., [12, 99, 318]). These numbers don't mean much on their own. 318 is larger than 99, but "how much" larger? Is it significantly more "certain"? We don't have a "context" for comparison.
Not Normalized: The output values have no overall relationship. Their sum doesn't equal any fixed number.
Exclusive: The output of each neuron is independent of the other neurons.

Our Goal: For a classification problem, we want the neural network to "tell" us which class it "thinks" the input belongs to, with a clear level of confidence. For example, with 3 classes (dog, cat, bird), we want an output that looks like [0.05, 0.9, 0.05], which means: "I am 90% certain this is a cat, 5% it's a dog, and 5% it's a bird."

=> Softmax was created to solve this problem. It takes any real numbers (positive, negative, large, or small) and transforms them into a probability distribution. The characteristics of this probability distribution are: * All output values are in the range [0, 1]. * The sum of all output values always equals 1.

These values are precisely the confidence scores we need.

Section 2: "Anatomy" of the Softmax Formula¶

The formula in the book might look intimidating:

\[ S_{i,j} = \frac{e^{z_{i,j}}}{\sum_{l=1}^{L} e^{z_{i,l}}} \]

Don't worry, let's break it down into two extremely simple steps:

Step 1: Exponentiation - The Numerator e^z

z are the output values from the previous layer (e.g., layer_outputs = [4.8, 1.21, 2.385]).
e is Euler's number (approximately 2.71828), the base of the natural logarithm.
"Exponentiation" simply means raising e to the power of those z values. In Python, we use E ** output or math.exp(output).

Python

# Example from the book
layer_outputs = [4.8, 1.21, 2.385]
E = 2.71828182846

# Calculate e^z for each value
exp_values = [E**4.8, E**1.21, E**2.385]
# Result: [121.51, 3.35, 10.86]

Why do this step? 1. Eliminate Negative Numbers: e raised to any power always results in a positive number. This is crucial because probabilities cannot be negative. 2. Amplify Differences: The exponential function makes large values overwhelmingly larger than small values. The value 4.8 is only about 2 times larger than 2.385, but after exponentiation, 121.51 is more than 11 times larger than 10.86! This helps the network become more "confident" in the prediction with the highest score.

Step 2: Normalization - The Division

After getting the exponentiated values (exp_values), we just need to do one thing:

Calculate the sum of all those values (the denominator $\sum_{l=1}^{L} e^{z_{i,l}}$).
Divide each value by the sum you just calculated.

Python

# Continuing the example above
exp_values = [121.51, 3.35, 10.86]

# 1. Calculate the sum
norm_base = sum(exp_values) # 121.51 + 3.35 + 10.86 = 135.72

# 2. Divide each value by the sum
norm_values = [
    121.51 / norm_base, # ~0.895
    3.35 / norm_base,   # ~0.025
    10.86 / norm_base   # ~0.080
]

# Result: [0.895, 0.025, 0.080]
# Check: 0.895 + 0.025 + 0.080 = 1.0

And that's it! We have transformed [4.8, 1.21, 2.385] into a probability distribution [0.895, 0.025, 0.080].

Section 3: Optimization with NumPy and Batch Processing¶

In practice, we don't process data samples one by one but in an entire batch to speed things up. A batch of data will be in the form of a matrix, where each row is the output for one sample.

Python

# A batch of 3 samples, each with 3 outputs
layer_outputs = np.array([[4.8, 1.21, 2.385],
                          [8.9, -1.81, 0.2],
                          [1.41, 1.051, 0.026]])

Now, we need to calculate the Softmax for each row individually. This is where NumPy's axis and keepdims parameters shine.

np.exp(layer_outputs): NumPy is smart and will automatically calculate the exponential for every element in the matrix.
np.sum(..., axis=1): We need to calculate the sum of values along each row.
- axis=0: calculates the sum down the columns.
- axis=1: calculates the sum across the rows. This is what we need.
keepdims=True: When summing along axis=1, the result would be a row vector [8.395, 7.29, 2.487]. If we try to divide a (3, 3) matrix by a (3,) vector, NumPy might throw an error or not perform the division row-wise as intended. keepdims=True preserves the dimensions, turning the result into a column vector [[8.395], [7.29], [2.487]] with a shape of (3, 1). Now, NumPy can correctly perform the division of a (3, 3) matrix by a (3, 1) column vector (each row of the matrix is divided by the corresponding value in the column vector).

Section 4: The "Secret Trick" for Overflow Prevention¶

The exponential function e^x grows very rapidly. If the input z is a large number (e.g., 1000), np.exp(1000) will return inf (infinity), causing an overflow error and breaking the entire calculation.

The Solution: We can subtract any arbitrary number from all input values z without changing the final result of the Softmax. Why? Because of the properties of exponents and division: $$ \frac{e^{z_1}}{e{z_1} + e^{z_2}} = \frac{e^{z_1} \cdot e^{-C}}{e{z_1} \cdot e^{-C} + e^{z_2} \cdot e^{-C}} = \frac{e^{z_1 - C}}{e^{z_1 - C} + e^{z_2 - C}} $$

So, what number should we subtract? The largest number (max) among the input values of that row.

Python

inputs = [1, 2, 3]
max_value = 3
shifted_inputs = [1-3, 2-3, 3-3] # -> [-2, -1, 0]

The benefits of this are: 1. The largest value after subtraction will be 0. (e^0 = 1) 2. All other values will be negative. (e to a negative power is always a number less than 1). 3. This ensures that the input to the exp function will never be a large positive number, thereby completely preventing overflow errors.

This is exactly why in the final code snippet of the book, you will see this line:

Python

exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))

This is the complete, safe, and efficient version of Softmax.

Part 2: Abstract & Easy-to-Understand Explanation¶

Forget the math formulas and code. Imagine Softmax is a "Confidence Distribution Machine" for a talent competition.

1. The Preliminary Round - Raw Scores

Suppose there are 3 contestants (Dog, Cat, Bird) in a competition. The judges (the preceding neural network layer) give them raw scores. These scores can be messy: Raw Scores = [4.8, 1.21, 2.385]

Looking at these scores, we know the Dog contestant has the highest score, but how much "higher"? Is the lead "overwhelming"? It's hard to say.

2. Step 1: The "Hype" Machine - Exponentiation

To make the results clearer, the MC puts these scores into a "Hype" Machine. This machine has two functions: * No Negative Scores: It turns all scores into "enthusiasm" points (always positive). * Hypes Up the Best: This machine is extremely "biased." Whoever already has a high score gets hyped up to the moon, while those with low scores only get a slight boost.

After passing through the "Hype" Machine (i.e., e^x): Hype Scores = [121.5, 3.4, 10.9]

Now the difference is crystal clear! The Dog contestant not only has a higher score but is completely dominating the rest.

3. Step 2: Slicing the "Confidence Pie" - Normalization

Now, to make it easy for the audience to understand, the MC decides to stop using hype scores and instead divide a 100% "confidence pie" among the 3 contestants, based on the ratio of their hype scores.

Total Hype Score = 121.5 + 3.4 + 10.9 = 135.8
Dog's Slice of the Pie: 121.5 / 135.8 ≈ 89.5%
Cat's Slice of the Pie: 3.4 / 135.8 ≈ 2.5%
Bird's Slice of the Pie: 10.9 / 135.8 ≈ 8.0%

The Final Result: Confidence Levels = [0.895, 0.025, 0.080]

This is the output of Softmax. It gives us a very clear conclusion: "Based on the performance, I am 89.5% confident that the winner is the Dog."

Regarding the "overflow prevention trick": Imagine a judge gets overly excited and gives a score of 1000. The "Hype" Machine would "burn out" (overflow). A clever MC realizes that what matters is the difference between scores, not the absolute scores themselves. So, before putting them in the machine, he finds the highest score (1000) and subtracts it from everyone's score. The final result after slicing the pie remains exactly the same, but the hype machine is saved!

In summary, Softmax does two things:

It uses the exponential function e^x to amplify the highest score, making it a clear "front-runner."
It normalizes those amplified scores into a percentage (or probability), so that they all add up to 1.