LOSS & ACCURACY
Calculating Network Error with a Loss Function¶
Welcome to the lesson on how we "measure the error" of a neural network. In previous chapters, we built a network capable of taking an input and making a prediction. But how do we know if that prediction is "good" or "bad"? And how much "worse" is it compared to another prediction? This is where the Loss Function comes into play.
1. Why Isn't Accuracy Enough?¶
The first question you might ask is: "Why not just use accuracy for measurement? We can just see what percentage of predictions the network gets right."
Let's consider the classic example from the book: Assume the correct class is the middle one (index 1). We have two predictions from the model:
- Prediction A:
[0.22, 0.6, 0.18]
- Prediction B:
[0.32, 0.36, 0.32]
If we only use accuracy, we would do the following:
1. Use the argmax
function to find the index of the largest value.
2. The argmax
for both A and B is 1
.
3. Compare this to the correct result, which is 1
. Both are correct. The accuracy is 100% for both cases.
But let's look closer. The output of a neural network (after the Softmax layer) represents its confidence level. * In Prediction A, the model is very confident (60%) that it's class 1. * In Prediction B, the model is only slightly more confident (36%) and is quite indecisive among the three classes.
Clearly, Prediction A is much "better" than Prediction B. We need a metric that can reflect this difference in confidence. Accuracy is a discrete metric (right or wrong), whereas we need a more continuous and detailed metric.
This is the purpose of Loss. Loss is a single number that indicates how much the model was wrong. The goal of training is to adjust the weights and biases to bring this Loss value as close to 0 as possible.
2. Categorical Cross-Entropy: The Loss Function for Classification¶
In our problem (classifying spiral data), the output layer uses the Softmax activation function, which turns numbers (logits) into a probability distribution (the sum of values is 1). To compare two probability distributions (one from the model's prediction, one from the ground truth), a mathematical tool called Cross-Entropy is commonly used.
Info
Side Note: What is Cross-Entropy? In information theory, cross-entropy measures the difference between two probability distributions. If you use an optimal encoding for distribution P to encode data from distribution Q, the cross-entropy H(Q, P) tells you the average number of bits required. The more similar P and Q are, the smaller this value becomes. In Machine Learning, we consider the "ground-truth" as distribution P and the model's "predictions" as distribution Q. The goal is to make Q as similar to P as possible, i.e., to minimize the cross-entropy. (Source: Deep Learning Book, Chapter 3.13)
When applied to a multi-class classification problem, it is called Categorical Cross-Entropy.
Mathematical Formula¶
The full formula for calculating the loss for a single sample i
is:
L_i = - Σ_j ( y_i,j * log(ŷ_i,j) )
Where:
L_i
: The loss value for thei
-th sample.j
: The index for each output class (e.g., dog, cat, human).Σ_j
: The summation symbol, summing over all classesj
.y_i,j
: The ground-truth. It is 1 if samplei
truly belongs to classj
, and 0 for all other classes.ŷ_i,j
(read as "y-hat"): The model's prediction. It is the probability that the model predicts samplei
belongs to classj
(the value from Softmax).log
: Is the natural logarithm (base e), which ismath.log()
ornp.log()
in Python.
The Magic of One-Hot Encoding¶
Let's see how this formula works in practice. Assume we have 3 classes and the ground truth is the first class (index 0).
* Model's prediction (ŷ): [0.7, 0.1, 0.2]
* Ground truth (y): [1, 0, 0]
The vector [1, 0, 0]
is called one-hot encoding. "Hot" is 1, "cold" is 0.
Applying the formula:
L = - ( y_0*log(ŷ_0) + y_1*log(ŷ_1) + y_2*log(ŷ_2) )
L = - ( 1*log(0.7) + 0*log(0.1) + 0*log(0.2) )
L = - ( 1*log(0.7) + 0 + 0 )
L = -log(0.7)
The initially imposing formula has been simplified to a very simple operation: just take the natural logarithm of the predicted probability for the correct class, and then negate it.
This is a critically important insight! It makes the calculation much more efficient.
3. Implementation in Python (Step-by-Step)¶
3.1. Calculating Loss for a Batch¶
In practice, we don't process samples one by one but in batches to speed things up.
import numpy as np
# Softmax output for a batch of 3 samples, each with 3 classes
softmax_outputs = np.array([[0.7, 0.1, 0.2], # Sample 1
[0.1, 0.5, 0.4], # Sample 2
[0.02, 0.9, 0.08]]) # Sample 3
# True labels (target labels). This is in "sparse" format.
# Sample 1 is class 0, sample 2 is class 1, sample 3 is class 1
class_targets = [0, 1, 1]
How do we extract the correct probabilities from softmax_outputs
? We need:
- From row 0, get the element at column 0 (
0.7
) - From row 1, get the element at column 1 (
0.5
) - From row 2, get the element at column 1 (
0.9
)
NumPy provides a very elegant way to do this, called advanced indexing:
# Extract the probabilities of the correct classes
correct_confidences = softmax_outputs[
range(len(softmax_outputs)), # Row indices: [0, 1, 2]
class_targets # Column indices: [0, 1, 1]
]
print(correct_confidences)
# Result: [0.7 0.5 0.9]
Now, we just need to apply the -log
formula to each of these values:
# Calculate the loss for each sample
sample_losses = -np.log(correct_confidences)
print(sample_losses)
# Result: [0.35667494 0.69314718 0.10536052]
Finally, the loss for the entire batch is usually calculated as the arithmetic mean of the individual losses:
# Calculate the average loss for the batch
average_loss = np.mean(sample_losses)
print(average_loss)
# Result: 0.38506088...
3.2. The Tricky Problem: log(0)
¶
The function log(x)
is only defined for x > 0
. What happens if the model predicts a probability of 0 for the correct class?
log(0)
is negative infinity. This will produce a loss value of inf
(infinity) in Python, and once inf
appears, it will "infect" subsequent calculations (e.g., the mean of a list containing inf
is also inf
). This will break the entire training process.
Even worse, if the model is overconfident and predicts a probability of 1.0 for the correct class, then log(1.0) = 0
, and the loss will be 0. But due to floating-point rounding errors in computers, the value might be 1.0000001
. log(1.0000001)
is a very small positive number, causing the loss to become a very small negative number. A negative loss is nonsensical.
Solution: Clipping
To solve this definitively, we will "clip" the predicted values so they never reach 0 or 1. We'll force them into a very small range, for example, [1e-7, 1 - 1e-7]
.
1e-7
is scientific notation for0.0000001
.- Any value smaller than
1e-7
will be set to1e-7
. - Any value larger than
1 - 1e-7
will be set to1 - 1e-7
.
In NumPy, we use the np.clip()
function:
# y_pred is the array of predicted probabilities
y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
log
of 0 or of a number greater than 1, making the calculation stable.
3.3. Handling Both Label Formats (One-Hot and Sparse)¶
The true labels (targets) can come in two formats:
- Sparse: A 1D array containing the indices of the correct classes. Example:
[0, 1, 1]
. - One-Hot: A 2D array. Example:
[[1,0,0], [0,1,0], [0,1,0]]
.
We can check the number of dimensions of the label array (len(y_true.shape)
) to determine its format and handle it accordingly.
- If Sparse (shape=1): Use the indexing method as shown above.
- If One-Hot (shape=2): We go back to the original formula
Σ (y * log(ŷ))
. Sincey
is one-hot, the multiplicationy * log(ŷ)
will turn all values for the incorrect classes to 0, leaving only the value for the correct class. Then we just need to sum along the rows (axis=1
).
# Assume y_pred_clipped has been calculated
# And y_true is a one-hot label array
if len(y_true.shape) == 2:
correct_confidences = np.sum(
y_pred_clipped * y_true,
axis=1
)
4. Building a Complete Loss Class¶
To keep our code organized and reusable, we will create classes for the Loss.
The Base Loss
Class¶
This is an abstract parent class. It has a calculate
method that is common to all types of loss: calculate the loss for each sample (via a forward
method) and then take the average.
# Common loss class
class Loss:
# Calculates the data loss
def calculate(self, output, y):
# Calculate sample losses
sample_losses = self.forward(output, y)
# Calculate mean loss
data_loss = np.mean(sample_losses)
return data_loss
The Loss_CategoricalCrossentropy
Class¶
This class inherits from Loss
and implements the specific calculation logic for Categorical Cross-Entropy.
# Cross-entropy loss
class Loss_CategoricalCrossentropy(Loss):
# Forward pass method
def forward(self, y_pred, y_true):
samples = len(y_pred)
# 1. Clip data to prevent division by 0
y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
# 2. Check label format and calculate
# If labels are sparse [0, 1, 1]
if len(y_true.shape) == 1:
correct_confidences = y_pred_clipped[
range(samples),
y_true
]
# If labels are one-hot [[1,0,0], [0,1,0], ...]
elif len(y_true.shape) == 2:
correct_confidences = np.sum(
y_pred_clipped * y_true,
axis=1
)
# 3. Calculate negative log likelihoods (loss for each sample)
negative_log_likelihoods = -np.log(correct_confidences)
return negative_log_likelihoods
5. Additionally Calculating Accuracy¶
Although not used for optimization, accuracy is still a very intuitive metric for humans to evaluate a model's performance. We will calculate it in parallel with the loss.
The way to calculate accuracy is simple:
- Use
np.argmax(softmax_outputs, axis=1)
to find the predicted class with the highest probability for each sample.axis=1
means finding the argmax along each row (each sample). - Compare this array of predictions with the array of true labels. The comparison
predictions == class_targets
will return an array ofTrue
(if correct) andFalse
(if incorrect) values. - Take the mean of this boolean array.
np.mean()
will automatically treatTrue
as 1 andFalse
as 0.
# Get predictions from the softmax output
predictions = np.argmax(softmax_outputs, axis=1)
# If labels are one-hot, convert to sparse
if len(class_targets.shape) == 2:
class_targets = np.argmax(class_targets, axis=1)
# Compare and calculate the mean
accuracy = np.mean(predictions == class_targets)
print('acc:', accuracy)
Summary¶
In this chapter, we have learned the core concepts:
- Why we need Loss: Loss is a detailed, continuous measure of the model's "degree of error," which is much better for training than accuracy.
- Categorical Cross-Entropy: It's the standard loss function for multi-class classification, measuring the difference between the predicted probability distribution and the true distribution.
- Simplified Formula: When labels are one-hot, the complex formula simplifies to
-log(probability of the correct class)
. -
Practical Implementation:
- Using NumPy indexing for efficient batch calculations.
- Solving the
log(0)
problem with the clipping technique. - Building the code in an object-oriented (OOP) way with
Loss
classes for easy management and extension. - Accuracy: It remains an important metric for human monitoring and is calculated in parallel with the loss.
By being able to calculate the Loss, we now have a "signal" to know where and how the model needs to improve. The next chapter will show us how to use this signal to actually "teach" the neural network through optimization and backpropagation.