Deep Learning Demystified

Demystified: Kullback–Leibler Divergence

A quick primer on Kullback-Leibler Divergence, an important concept to understand in machine learning and information theory

5 min readDec 25, 2016

So, first things first — we need to understand what entropy is, in terms of information theory and not thermodynamic entropy. Both important and curiously related, but for ML, and possibly card counting, we’re going with information theory. This entropy (Shannon Entropy) is named after it’s creator. We miss you Claude.

Shannon Entropy

Let’s start off with a bit of pre-prerequisite understanding

Discrete (countable) random variables (DRVs)

This one is pretty simple. A variable with a finite sample space. A single dice roll is a good example — it has a sample space of 6: {1, 2, 3, 4, 5, 6}. All outcomes with equal probability (1/6). A coin flip is another good example — a sample space of 2: {heads, tails}. Again, equal probability (1/2). The associated probabilities can be equal, like in a fair coin flip, but don’t have to be, like an unfair coin used by a rogue carny trying to trick you out of your hard earned cash {heads: 1/3, tails: 2/3}

Surprisingly, if you were to dive into a textbook definition of DRVs, it would get quite complicated for no reason.

That covers DRVs. Let’s move into entropy.

Entropy

Shannon entropy was devised as a method to measure the amount of information in transmitted messages. Therefore, the result of the formula will result in an average # of bits required to represent the information.

You will find a lot of strange intuitions of what this means. I posit this is related to a fundamental misunderstanding of entropy in general. By no means am I an expert on physical and informational entropy. However, this line of conversation could lead to a 6-month long mathematical rabbit hole in which you lock yourself into your home and lose your mind.

Let’s save you from that and keep it simple.

The clearest and most succinct definition came from an unpredicted place.

OK. Let’s jump straight in to the formula

X is our sample space. X_i is an element of our sample space. b represents our base, typically in information theory, we’ll use base 2.

So, using our example above, let’s calculate a bit of entropy.

For a single fair coin flip:

- ((1/2 * log(1/2)) + (1/2 * log(1/2)) = -(-0.5 + -0.5) = 1

One ‘bit’ of entropy. A fair coin is in a state of ‘maximal entropy’.

So, if we have an unfair coin with the probabilities listed above, we have an increased degree of predictability right? Yes — we can predict tails more often (2/3 vs. 1/2). Given that, we’d expect our entropy measure to be less than one for a flip of the unfair coin. Said otherwise, entropy decreases as predictability increases.

- ((1/3 * log(1/3))+(2/3 * log(2/3))= -((-0.52832) + (-0.3899)≈ 0.92

Yep. The universe is still in consistent order. For now…

By the way, if you’re a data scientist and are working with information entropy, don’t waste time writing your own implementation unless you’re looking to learn/challenge yourself. There’s a nice implementation in the SciPy package. If that’s not fast enough, and you’re working with strings of information, there’s a separate package implemented in C.

If you still want to dive into information entropy, I’d start with this video:

If that hasn’t satiated your voracious appetite for information entropy, then MIT has open sourced an entire course on information and entropy — taught by Seth Lloyd. What a world we’re blessed with.

Seth, in the highly unlikely causal chain of events that would end up with you reading this article, I just wanna say, I dig your nerdy giggle.

Kullback–Leibler (KL) Divergence

If you’re working in the field, you’re probably already aware of KL divergence, sometimes called information gain in the context of decision trees and also called relative entropy. But, unless you’ve written the implementations of bayesian inference algorithms or have done graduate coursework in information theory or machine learning, you might have not gotten down into the nuts and bolts. Don’t worry, it’s actually pretty straight forward.

The simplest, highest level and nonthreatening way of describing it is: if you have two probability distributions, P and Q the KL divergence measures the similarity of P and Q. If KL_Divergence(P||Q) = 0, the distributions are equal.

Alright, that should conceptualize that in your mind. Now, onto the details.

Formulae

Properties

It is non symmetric. Sometimes, you’ll hear KL divergence called a distance metric. It’s a simpler way of understanding, but if we want to keep it technically correct, KL divergence doesn’t satisfy the triangle inequality, so it’s not a true distance. KL_Divergence(P||Q) != KL_Divergence(Q||P). If we want a symmetric form, say for clustering or some other interesting use of distances in a topological space— use Jensen Shannon divergence.
It’s a summation for discrete probability distributions and an integral for continuous probability distributions. (formulas above)
It’s always ≥ 0

As with the entropy calculation above, don’t worry about writing your own implementation unless you’re working on something custom or if you’re doing it for the sheer joy of learning. The SciPy package, actually, the same function will give you KL divergence with an additional parameter.

Now, let’s try to mix all of these ingredients into a mind cake for you to enjoy for the rest of your nerdy days.

Why is it called information gain?

I decided to snapshot straight out of my notebook. If we have some sort of bayesian update process, as is typical in bayesian learning processes, we can follow this train of logic.

This is demonstrating literal information measured in additional bits gained by moving from a prior distribution p(x) to a posterior p(x|y).

Sorry in advance if you can’t read my chicken scratch

The above process can be repeated infinitely with every new piece of Y you can observe.

With enough observation, you eventually have all the bits. What you do with them, is your choice….

Thanks for reading!

If you liked this article, check out my latest article:

Forecasting NYC Arrests during COVID-19 with Long Short-Term Memory Networks

Using Deep Learning to forecast arrests during the pandemic

medium.com

Feel free to leave a comment to let me know what you’re interested in!