Math - derivative combining Softmax and Cross-Entropy Loss.

This time, we'll delve into the mathematical nature, proving each step in detail and explaining the reasoning behind each transformation.

We will go through 3 parts:

Proving the derivative of the Softmax function.
Proving the derivative of the Cross-Entropy Loss function.
Combining both using the Chain Rule to get the final result.

Part 1: Detailed proof of the Softmax Derivative¶

Objective: Calculate \(\frac{\partial \hat{y}_j}{\partial z_k}\), which is the change in an output probability \(\hat{y}_j\) when an input score \(z_k\) changes slightly.

Softmax function formula:

The output probability for class \(j\) is calculated from all the scores \(z\):

\[ \hat{y}_j = \frac{e^{z_j}}{\sum_{i=1}^{N} e^{z_i}} \]

To calculate this derivative, we must use the Quotient Rule:

If \(f(x) = \frac{u(x)}{v(x)}\), then

\(f'(x) = \frac{u'(x)v(x) - u(x)v'(x)}{[v(x)]^2}\).

Here, \(u = e^{z_j}\) and \(v = \sum_{i=1}^{N} e^{z_i}\). The differentiation will differ depending on whether \(j\) equals \(k\) or not.

Case 1: `j = k` (Derivative of an output with respect to its corresponding input)¶

We are calculating \(\frac{\partial \hat{y}_j}{\partial z_j}\).

Calculate the derivative of the numerator (\(u'\)):

\(\frac{\partial u}{\partial z_j} = \frac{\partial}{\partial z_j} (e^{z_j}) = e^{z_j}\).
Calculate the derivative of the denominator (\(v'\)):

\(\frac{\partial v}{\partial z_j} = \frac{\partial}{\partial z_j} \left( \sum_{i=1}^{N} e^{z_i} \right) = \frac{\partial}{\partial z_j} (e^{z_1} + e^{z_2} + ... + e^{z_j} + ... + e^{z_N})\).

In this sum, only the term \(e^{z_j}\) depends on \(z_j\). The derivative of the other terms is 0.

Thus, \(\frac{\partial v}{\partial z_j} = e^{z_j}\).
Applying the Quotient Rule:

\[ \begin{align*} \frac{\partial \hat{y}_j}{\partial z_j} &= \frac{(e^{z_j}) \cdot (\sum_{i} e^{z_i}) - (e^{z_j}) \cdot (e^{z_j})}{(\sum_{i} e^{z_i})^2} \\ &= \frac{e^{z_j}}{\sum_{i} e^{z_i}} \cdot \frac{\sum_{i} e^{z_i}}{\sum_{i} e^{z_i}} - \frac{e^{z_j}}{\sum_{i} e^{z_i}} \cdot \frac{e^{z_j}}{\sum_{i} e^{z_i}} \quad \text{(Splitting into two fractions)} \\ &= \hat{y}_j \cdot 1 - \hat{y}_j \cdot \hat{y}_j \quad \text{(Recognizing the main components are } \hat{y}_j \text{)} \\ &= \hat{y}_j (1 - \hat{y}_j) \end{align*} \]

Result 1: \(\frac{\partial \hat{y}_j}{\partial z_j} = \hat{y}_j (1 - \hat{y}_j)\).

Case 2: `j ≠ k` (Derivative of an output with respect to a different input)¶

We are calculating \(\frac{\partial \hat{y}_j}{\partial z_k}\).

Calculate the derivative of the numerator (\(u'\)):

\(\frac{\partial u}{\partial z_k} = \frac{\partial}{\partial z_k} (e^{z_j})\). Since \(j \neq k\), \(e^{z_j}\) is a constant with respect to \(z_k\). Thus, the derivative is 0.
Calculate the derivative of the denominator (\(v'\)):

\(\frac{\partial v}{\partial z_k} = \frac{\partial}{\partial z_k} \left( \sum_{i=1}^{N} e^{z_i} \right)\).

Similarly to the above, only the term \(e^{z_k}\) depends on \(z_k\).

Thus, \(\frac{\partial v}{\partial z_k} = e^{z_k}\).
Applying the Quotient Rule:

\[ \begin{align*} \frac{\partial \hat{y}_j}{\partial z_k} &= \frac{(0) \cdot (\sum_{i} e^{z_i}) - (e^{z_j}) \cdot (e^{z_k})}{(\sum_{i} e^{z_i})^2} \\ &= - \frac{e^{z_j} \cdot e^{z_k}}{(\sum_{i} e^{z_i})^2} \\ &= - \left( \frac{e^{z_j}}{\sum_{i} e^{z_i}} \right) \cdot \left( \frac{e^{z_k}}{\sum_{i} e^{z_i}} \right) \quad \text{(Splitting the denominator)} \\ &= - \hat{y}_j \cdot \hat{y}_k \end{align*} \]

Result 2: \(\frac{\partial \hat{y}_j}{\partial z_k} = - \hat{y}_j \hat{y}_k\).

Part 2: Detailed proof of the Cross-Entropy Loss Derivative¶

Objective: Calculate \(\frac{\partial L}{\partial \hat{y}_k}\), which is the change in Loss when a predicted probability \(\hat{y}_k\) changes.

Cross-Entropy Loss formula:

\[ L = - \sum_{j=1}^{N} y_j \log(\hat{y}_j) \]

Here, \(y_j\) is the ground-truth label, having a value of 1 for the correct class and 0 for others (one-hot encoding).

Apply the differentiation:

\(\frac{\partial L}{\partial \hat{y}_k} = \frac{\partial}{\partial \hat{y}_k} \left( - \sum_{j=1}^{N} y_j \log(\hat{y}_j) \right)\).
Move the derivative inside the summation:

\(\frac{\partial L}{\partial \hat{y}_k} = - \sum_{j=1}^{N} y_j \cdot \frac{\partial}{\partial \hat{y}_k} \left( \log(\hat{y}_j) \right)\).
Analyze the inner derivative:

The derivative \(\frac{\partial}{\partial \hat{y}_k} \left( \log(\hat{y}_j) \right)\) is non-zero only when \(j=k\). * If \(j=k\), \(\frac{\partial}{\partial \hat{y}_k} \log(\hat{y}_k) = \frac{1}{\hat{y}_k}\). * If \(j \neq k\), \(\frac{\partial}{\partial \hat{y}_k} \log(\hat{y}_j) = 0\).
Simplify the summation:

Therefore, the entire sum \(\sum_{j=1}^{N}\) has only one remaining term at \(j=k\). \(\frac{\partial L}{\partial \hat{y}_k} = - y_k \cdot \frac{1}{\hat{y}_k} = - \frac{y_k}{\hat{y}_k}\).

Result 3: \(\frac{\partial L}{\partial \hat{y}_k} = - \frac{y_k}{\hat{y}_k}\).

Part 3: Combining with the Chain Rule - The Convergence Point¶

Now we have all the pieces. We will substitute them into the original Chain Rule formula:

\[ \frac{\partial L}{\partial z_k} = \sum_j \frac{\partial L}{\partial \hat{y}_j} \cdot \frac{\partial \hat{y}_j}{\partial z_k} \]

To solve this summation, we again split it into two parts: the term at j=k and the remaining terms j ≠ k. This is the key step of the proof.

\[ \frac{\partial L}{\partial z_k} = \left( \frac{\partial L}{\partial \hat{y}_k} \cdot \frac{\partial \hat{y}_k}{\partial z_k} \right) + \sum_{j \neq k} \left( \frac{\partial L}{\partial \hat{y}_j} \cdot \frac{\partial \hat{y}_j}{\partial z_k} \right) \]

Substitute the values for the j=k term:

Using Result 3 and Result 1:

\(\left( - \frac{y_k}{\hat{y}_k} \right) \cdot \left( \hat{y}_k (1 - \hat{y}_k) \right) = -y_k(1-\hat{y}_k) = -y_k + y_k \hat{y}_k\).
Substitute the values for the sum of j ≠ k terms:

Using Result 3 and Result 2:

\(\sum_{j \neq k} \left( - \frac{y_j}{\hat{y}_j} \right) \cdot \left( - \hat{y}_j \hat{y}_k \right) = \sum_{j \neq k} (y_j \hat{y}_k)\).
Combine everything:

\[ \begin{align*} \frac{\partial L}{\partial z_k} &= (-y_k + y_k \hat{y}_k) + \sum_{j \neq k} y_j \hat{y}_k \\ &= -y_k + \left( y_k \hat{y}_k + \sum_{j \neq k} y_j \hat{y}_k \right) \quad \text{(Grouping terms with } \hat{y}_k \text{)} \\ &= -y_k + \hat{y}_k \left( y_k + \sum_{j \neq k} y_j \right) \quad \text{(Factoring out } \hat{y}_k \text{)} \end{align*} \]

Final analysis:

Let's look at the expression in the parentheses: \((y_k + \sum_{j \neq k} y_j)\). This is precisely the sum of all elements in the ground-truth vector \(y\), i.e., \(\sum_j y_j\). Since \(y\) is a one-hot vector, it contains only one 1 and the rest are 0s. Therefore, the sum of all its elements is always 1.

\[ \sum_j y_j = 1 \]
Final result: Substituting 1 into the above expression, we get:

\[ \begin{align*} \frac{\partial L}{\partial z_k} &= -y_k + \hat{y}_k \cdot (1) \\ &= \hat{y}_k - y_k \end{align*} \]

Thus, we have fully and detailedly proven that the combined gradient is Prediction - Truth.