Diagram

1. Overview Diagram – Main Functional Flow¶

This diagram provides an overall view of how data moves through each component.

Text Only

           +-------------+      z      +-------------+      ŷ      +--------------------+
Input (x)  |             |             |             |             |                    |
---------> | Dense Layer | --Logits--> |   Softmax   | --Probs.--> | Cross-Entropy Loss | ---> Loss (L)
           |             |             |             |             |                    |
           +-------------+             +-------------+             +---------^----------+
                                                                             |
                                                                             |
                                                                    y (True Labels)

Explanation:

x goes through the Dense Layer to produce raw scores Logits (z).
z passes through the Softmax function to become probabilities Probs. (ŷ).
Both ŷ and the ground-truth label y are input into the Cross-Entropy Loss function to compute the final error Loss (L).

2. Detailed Dependency Diagram – "Opening the Black Boxes"¶

This is the most important part for understanding the Chain Rule. We’ll examine each dependency individually.

Part A: Dependencies from Logits (z) to Predictions (ŷ)¶

This is a Many-to-Many relationship. Because of the shared denominator in the Softmax formula, a change in any z affects all the ŷ values.

Text Only

    +------+                 +--------------------------+                 +-------+
    |  z₁  | --------------> |                          | --------------> |  ŷ₁   |
    +------+                 |                          |                 +-------+
                             |                          |
    +------+                 |   SOFTMAX INTERACTION    |                 +-------+
    |  z₂  | --------------> |  (Shared Denominator)    | --------------> |  ŷ₂   |
    +------+                 |                          |                 +-------+
                             |                          |
    +------+                 |                          |                 +-------+
    |  z₃  | --------------> |                          | --------------> |  ŷ₃   |
    +------+                 +--------------------------+                 +-------+

           LOGITS                            (Dense interdependency)              PREDICTIONS

Explanation:

Think of the "SOFTMAX INTERACTION" box as where all the values \(e^{z_i}\) are summed to form the common denominator.
Since all ŷ values share this denominator, they are interdependent. Changing z₁ will affect the denominator, and thus will alter ŷ₁, ŷ₂, and ŷ₃.

Part B: Dependencies from Predictions (ŷ) to Loss (L)¶

This is a Many-to-One relationship. All the probabilities ŷ are used in a single total formula to compute one final Loss value.

Text Only

    +-------+
    |  ŷ₁   | ----------------.
    +-------+                 |
                              v
    +-------+              +------+
    |  ŷ₂   | -----------> |      |
    +-------+              |  L   | ---> Final Loss
                              ^    |
    +-------+                 |    |
    |  ŷ₃   | ----------------'    |
    +-------+                      +------+

    PREDICTIONS                  (Convergence)

Explanation:

Each ŷ contributes to the formula \(L = - \sum y_j \log(\hat{y}_j)\).
So, all of them influence the final result L.

3. Combined Diagram – The Path of the Chain Rule¶

Now, visualize the path through which a small change in \(z_1\) propagates to L. It must pass through ALL the ŷ_j.

Text Only

                                    .----------------> ŷ₁ ---.
                                   /                          \
                                  /                            \
    z₁  =========================> ------[SOFTMAX]-----> ŷ₂ ------> L
                                  \                            /
                                   \                          /
                                    '----------------> ŷ₃ ---'

Intuitive Explanation of the Chain Rule:

Question: If I give a tiny "nudge" to \(z_1\), how much will L be affected?
Path Analysis:
1. The nudge to \(z_1\) spreads and affects \(\hat{y}_1, \hat{y}_2, \hat{y}_3\).
2. Then, those changes in \(\hat{y}_j\) collectively influence L.
Chain Rule Insight: The total impact on L is the sum of the contributions from all three paths.
- Total Impact = (Effect via ŷ₁) + (Effect via ŷ₂) + (Effect via ŷ₃)

This is why the derivative formula \(\frac{\partial L}{\partial z_k}\) takes the form of a summation:

\[ \frac{\partial L}{\partial z_k} = \sum_j \left( \frac{\partial L}{\partial \hat{y}_j} \cdot \frac{\partial \hat{y}_j}{\partial z_k} \right) \]