Math - derivative combining Softmax and Cross-Entropy Loss.
This time, we'll delve into the mathematical nature, proving each step in detail and explaining the reasoning behind each transformation.
We will go through 3 parts:
- Proving the derivative of the Softmax function.
- Proving the derivative of the Cross-Entropy Loss function.
- Combining both using the Chain Rule to get the final result.
Part 1: Detailed proof of the Softmax Derivative¶
Objective: Calculate \(\frac{\partial \hat{y}_j}{\partial z_k}\), which is the change in an output probability \(\hat{y}_j\) when an input score \(z_k\) changes slightly.
Softmax function formula:
The output probability for class \(j\) is calculated from all the scores \(z\):
To calculate this derivative, we must use the Quotient Rule:
If \(f(x) = \frac{u(x)}{v(x)}\), then
\(f'(x) = \frac{u'(x)v(x) - u(x)v'(x)}{[v(x)]^2}\).
Here, \(u = e^{z_j}\) and \(v = \sum_{i=1}^{N} e^{z_i}\). The differentiation will differ depending on whether \(j\) equals \(k\) or not.
Case 1: j = k
(Derivative of an output with respect to its corresponding input)¶
We are calculating \(\frac{\partial \hat{y}_j}{\partial z_j}\).
-
Calculate the derivative of the numerator (\(u'\)):
\(\frac{\partial u}{\partial z_j} = \frac{\partial}{\partial z_j} (e^{z_j}) = e^{z_j}\).
-
Calculate the derivative of the denominator (\(v'\)):
\(\frac{\partial v}{\partial z_j} = \frac{\partial}{\partial z_j} \left( \sum_{i=1}^{N} e^{z_i} \right) = \frac{\partial}{\partial z_j} (e^{z_1} + e^{z_2} + ... + e^{z_j} + ... + e^{z_N})\).
In this sum, only the term \(e^{z_j}\) depends on \(z_j\). The derivative of the other terms is 0.
Thus, \(\frac{\partial v}{\partial z_j} = e^{z_j}\).
-
Applying the Quotient Rule:
Result 1: \(\frac{\partial \hat{y}_j}{\partial z_j} = \hat{y}_j (1 - \hat{y}_j)\).
Case 2: j ≠ k
(Derivative of an output with respect to a different input)¶
We are calculating \(\frac{\partial \hat{y}_j}{\partial z_k}\).
-
Calculate the derivative of the numerator (\(u'\)):
\(\frac{\partial u}{\partial z_k} = \frac{\partial}{\partial z_k} (e^{z_j})\). Since \(j \neq k\), \(e^{z_j}\) is a constant with respect to \(z_k\). Thus, the derivative is 0.
-
Calculate the derivative of the denominator (\(v'\)):
\(\frac{\partial v}{\partial z_k} = \frac{\partial}{\partial z_k} \left( \sum_{i=1}^{N} e^{z_i} \right)\).
Similarly to the above, only the term \(e^{z_k}\) depends on \(z_k\).
Thus, \(\frac{\partial v}{\partial z_k} = e^{z_k}\).
-
Applying the Quotient Rule:
Result 2: \(\frac{\partial \hat{y}_j}{\partial z_k} = - \hat{y}_j \hat{y}_k\).
Part 2: Detailed proof of the Cross-Entropy Loss Derivative¶
Objective: Calculate \(\frac{\partial L}{\partial \hat{y}_k}\), which is the change in Loss when a predicted probability \(\hat{y}_k\) changes.
Cross-Entropy Loss formula:
Here, \(y_j\) is the ground-truth label, having a value of 1 for the correct class and 0 for others (one-hot encoding).
-
Apply the differentiation:
\(\frac{\partial L}{\partial \hat{y}_k} = \frac{\partial}{\partial \hat{y}_k} \left( - \sum_{j=1}^{N} y_j \log(\hat{y}_j) \right)\).
-
Move the derivative inside the summation:
\(\frac{\partial L}{\partial \hat{y}_k} = - \sum_{j=1}^{N} y_j \cdot \frac{\partial}{\partial \hat{y}_k} \left( \log(\hat{y}_j) \right)\).
-
Analyze the inner derivative:
The derivative \(\frac{\partial}{\partial \hat{y}_k} \left( \log(\hat{y}_j) \right)\) is non-zero only when \(j=k\). * If \(j=k\), \(\frac{\partial}{\partial \hat{y}_k} \log(\hat{y}_k) = \frac{1}{\hat{y}_k}\). * If \(j \neq k\), \(\frac{\partial}{\partial \hat{y}_k} \log(\hat{y}_j) = 0\).
-
Simplify the summation:
Therefore, the entire sum \(\sum_{j=1}^{N}\) has only one remaining term at \(j=k\). \(\frac{\partial L}{\partial \hat{y}_k} = - y_k \cdot \frac{1}{\hat{y}_k} = - \frac{y_k}{\hat{y}_k}\).
Result 3: \(\frac{\partial L}{\partial \hat{y}_k} = - \frac{y_k}{\hat{y}_k}\).
Part 3: Combining with the Chain Rule - The Convergence Point¶
Now we have all the pieces. We will substitute them into the original Chain Rule formula:
To solve this summation, we again split it into two parts: the term at j=k
and the remaining terms j ≠ k
. This is the key step of the proof.
-
Substitute the values for the
j=k
term:Using Result 3 and Result 1:
\(\left( - \frac{y_k}{\hat{y}_k} \right) \cdot \left( \hat{y}_k (1 - \hat{y}_k) \right) = -y_k(1-\hat{y}_k) = -y_k + y_k \hat{y}_k\).
-
Substitute the values for the sum of
j ≠ k
terms:Using Result 3 and Result 2:
\(\sum_{j \neq k} \left( - \frac{y_j}{\hat{y}_j} \right) \cdot \left( - \hat{y}_j \hat{y}_k \right) = \sum_{j \neq k} (y_j \hat{y}_k)\).
-
Combine everything:
-
Final analysis:
Let's look at the expression in the parentheses: \((y_k + \sum_{j \neq k} y_j)\). This is precisely the sum of all elements in the ground-truth vector \(y\), i.e., \(\sum_j y_j\). Since \(y\) is a one-hot vector, it contains only one 1 and the rest are 0s. Therefore, the sum of all its elements is always 1.
\[ \sum_j y_j = 1 \] -
Final result: Substituting 1 into the above expression, we get:
Thus, we have fully and detailedly proven that the combined gradient is Prediction - Truth
.