Deep Learning Demystified
Demystified: Categorical Cross-Entropy
A quick primer on cross entropy as a loss function for deep learning models
Categorical cross entropy is used almost exclusively in Deep Learning problems regarding classification, yet is rarely understood.
I’ve asked practitioners about this, as I was deeply curious why it was being used so frequently, and rarely had an answer that fully explained the nature of why its such an effective loss metric for training. I even tend to take the high level abstractions provided by Keras and other ML libraries for granted, so I thought it would be worthwhile to review.
I’ve covered the principles of information entropy in a previous article:
Demystified: Kullback–Leibler Divergence
A quick primer on Kullback-Leibler Divergence, an important concept to understand in machine learning and information…
Now that we understand entropy, let’s take a look at a typical classification network.
Classification using a Feedforward Network
[see the notebook at the end of this article for reference]
I’m not covering the basics of artificial neural networks. What I do want to discuss however, is the output of the last layer of a typical neural network and how categorical cross entropy is actually used.
For simplicity’s sake I’m going to use the standard MNIST classifier example.
Everything is pretty standard, we have 2 convolution layers (each of which has a maxpooling operation), followed by a flatten layer, which creates a single output vector from our last convolution, then a dropout layer, which helps avoid overfitting, and finally a dense layer which outputs a 1 * 10 vector, which has been scaled by our softmax function.
Once the model is trained, I’m going to feed in a random item from our test set, then use the nice Keract library to pull in the specific layer activation from our output.
Here, we have a random sample, visualized with matplotlib
Now, we run this example through our model with the Keract
get_activations function as a wrapper
from keract import get_activations, display_activationsthis_x = np.array([random_choice_classification])activations = get_activations(model,this_x, auto_compile=True)
Which gives us our output:
0.9999992847442627, # we see our model correctly predicts the class
Categorical Cross Entropy
So, now that we have an example vector, and understand the mechanics of our classification model, we can start to explore what the cross entropy actually means.
Remember from our discussion of entropy above, the entropy measures the “distance” between two probability distributions, in the number of additional bits required to ‘encode’ distribution 1 to distribution 2.
I’ve written a function to do this manually, but in no means should you do this. Use the tensorflow API. (You will also notice this value may be off by a few precision points, as I’m not using the exact specification used in the loss).
assert(len(p) == len(q))
_sum = 0
for i, _p in enumerate(p):
# we add a small constant to avoid taking the log of zero
_sum += _p * math.log2(q[i]+1e-20)
This gives us a loss of 1.031896274211761e-06 for our training example.
This means that, for each mini-batch, the network is learning to minimize the “distance” between two probability distributions: the first is the output vector which has been scaled by a softmax (all values sum to 1) and the ground truth vector, which is just a probability of 1 for our target class, and a probability of 0 elsewhere. This patterns is the same for every classification problem that uses categorical cross entropy, no matter if the number of output classes is 10, 100, or 100,000. Voila!
Also important to note that, the keras api is using
auto to reduce the losses, which essentially averages the cross entropy for each training batch. This is something you may need to consider if you’re dealing with custom evaluations that need to handle extreme misclassifications.
All the code for this writeup can be found here:
If you enjoyed this article, you might also like:
Forecasting NYC Arrests during COVID-19 with Long Short-Term Memory Networks
Using Deep Learning to forecast arrests during the pandemic
Optimal Data Science: Latent Dirichlet Allocation
This is part of a series on demonstrating popular machine learning algorithms
Thanks for reading!