Using Latent Dirichlet Allocation to categorize UFO sightings

A recently released photo of a UFO. Thanks Trump!

Background

The Latent Dirichlet Allocation (LDA) algorithm was “twice born,” once in 2000 for the purpose of assigning individuals to K populations based off of genetic information and again in 2003 for topic modelling of text corpora. For the purposes of this discussion, I’m going to stick to topic modelling. But LDA is applicable to multiple domains and the genetic applications are quite interesting. To read the aforementioned papers, I’ve included links below.

  1. Pritchard, Jonathan K., Matthew Stephens, and Peter Donnelly. “Inference of population structure using multilocus genotype data.” Genetics 155.2 (2000): 945–959
  2. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993–1022

I’m going to be referring to paper 2 as I go through this explanation.

Super high level

To start off, let’s go with the black box description. The LDA algorithm asks that you select a number of classes (topics) and input a corpus of documents. The result is a list of topics, each topic is a probability distribution over words. The LDA model will also be able to classify a document, and assign probabilities for each topic.

A visual representation of the description above. There are more hyperparameters to tune, but lets keep it simple for now

Sweet! Let’s wrap it up and hit happy hour….but wait…

It’s important to note that the algorithm makes no high level determination of what the topic should be or that the number of topics is adequate for your corpus. That’s where we apply human intuition and some good ole’ mathematical optimization to determine what the topics are and if they are adequately represented by the model.

Your typical data scientist may wrap things up here. We’ve got some outputs and the outputs are seemingly acceptable.

But again, Typical != Optimal. Let’s break this down.

Theory

LDA is a generative probabilistic model. If you’re unsure of what a generative model is, or the distinction between generative and discriminative models is, let me refer you to a nice description here.

Given the context of topic modeling, we are assuming that all the documents are the result of (generated by) some hidden, or latent variables (topics, topic assignments and topic proportions).

But starting off, we have no idea what the latent variables are. All we have is a corpus. In order to generate these latent factors, we use statistical inference. Specifically, approximate posterior inference methods such as Gibbs sampling and variational inference. I’m not going to dive down deep into the implementation of these algorithms. It’s unlikely that you’re going to need to implement it on your own since the community is awesome, but it’s good to understand the methods going on underneath the hood.

Some key points:

  1. A document can, and generally does, belong to multiple topics
  2. Each topic is represented as a probabilistic distribution over terms in a FIXED vocabulary. What does that mean? For any topic X, we’ll have words (w_1..w_n) with associated probabilities that sum to 1. This also means that we’ll have to agree to a fixed vocabulary in advance.
  3. For topic modeling, we are also going to analyze our text data as a bag of words. Order doesn’t matter here.

At the end of building an LDA model, we’ll have a probabilistic model which we can use to give us a distribution of topics associated to a document in our problem space.

To follow along with the logical sequence in Blei et. all, once we have a model built, each document (w) in a corpus (D) is generated by:

A description of the steps taken for each document of a text corpus. — D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003

N is a fixed vocabulary of words. We set this based off of the occurence of each word in our training corpus. Generally, after removing stop words, performing stemming and the preprocessing steps reviewed in the implementation section below. You may also want to remove words which appear in nearly every document as they don’t add much discriminative power. Another common tactic is to remove very rare words. Don’t worry to much about the Poisson distribution here. Just remember that N is a fixed vocabulary of words w_n

θ represents a random distribution Dirichlet parameterized by a vector of length K α. K being the number of topics we’ve selected.

Then for 1..N

  1. We select a topic Z_n, taken from a discrete (multinomial) distribution parameterized by θ. Remember, we’re arbitrarily selecting a topic based off of something we’ve already observed.
  2. We select a word based off a probability conditioned on the topic selected above.

If we wanted 100 documents given this theoretical model, what this would give us in reality is 100 documents, all of N words that are randomly generated from some topics that come from a probability distribution. Each of the words would be based off of some pre-existing matrix of topics x words (k x N) as described in the paper.

Axioms

Just to clarify here, I’m using ‘axioms’ interchangeably with assumptions or foundational building blocks that we can build an understanding from.

Aside from understanding the basics of Multinomial (categorical) and Dirichlet distributions and their conjugacy, a key axiom to understand is the concept of exchangeability. That is, LDA is built as a mixture model that considers both topics and words (random variables) exchangeable.

This building block was based on De Finetti’s theorem which has much broader applicability to probabilistic inference. I recommend you try to understand De Finetti’s theorem. It’s definitely enhanced my understanding of probabilities in general.

Formulas

The paper contains a number of formulas, but I find it’s a lot easier to digest when it’s in this plate notation.

Plate notation for the generative process described above. The shaded node represents the only observed variables in the model

Also, the intractable problem to solve in our modeling (finding out what the posterior distribution is for our hidden variables)

We can estimate these hidden parameters using variational inference. I’ll be quite honest here — an explanation from me wouldn’t hold a candle to this excellent explanation I’ve come across by Jordan Boyd-Graber. It’s a 30-ish minute watch, but covers the basics for mean field variational inference: KL divergence, Jensen’s inequality and even their application in terms of LDA. Ideal. Check it out here:

Variational Inference — demystified

Variations and examples of real world use

  1. Spatial Latent Dirichlet Allocation: An interesting CV example of using LDA on image data to classify “visual words” into higher level topics
  2. Online Learning for Latent Dirichlet Allocation: an online method of the variational bayes algorithm that works for streaming data. Online methods are always interesting given the value they can generate when embedded in real-time applications
  3. Parallel inference for latent Dirichlet allocation on graphics processing units: Discusses two parallel inference methods used for LDA: Gibbs sampling and Variational Bayesian. Not a pure variation of the algorithm, but useful for those who want to process larger corpora (20x speedup!)

Implementation

For this example, I’ve decided to use this hilariously interesting data set containing >80k UFO sighting reports from the National UFO Reporting Center (NUFORC). Kudos to Sigmoid Axel for organizing/scrubbing this data and to Oliver Cameron for bringing it to my attention via this article.

They just came to Earth for love. Don’t be scared ✌️👽

Here’s a few interesting excerpts. I’ve left these in their raw form to give you an example of data generated by free form text fields.

“possible anamolous object on TV show”

Possibly an artifact from editing, possibly last night’s lasagna on the screen. Better report it.

“Monkeys where they shouldn&#39t be&#44 and others saw them too.”

Unexplained monkeys. Must be aliens.

“isawabluestarmovingslowwthadstrobelightarounditthenitstopinthemiddleofthesky2seconsmovewestthenlightsoffbutthestrobekeptflashngmanycolo”

NO TIME FOR SPACES — THE PEOPLE MUST KNOW!

And an example of an informative and clear description:

“3 white lights in a triangle formation pointing down with a green light below the white lights. No strobes just solid lights”

This should give you a good brief on what non academic text corpora look like. We see some encoding cruft &#.., mispellings and some reports that don’t make any sense at all.

For LDA, it’s important to remember: quality in = quality out. Preprocessing to the rescue!

I’ve decided to use the excellent text mining library gensim. Not only does it offer out of the box LDA, but also offers a nice API for streaming topic modeling, some handy tools for creating a corpora of different formats, TF-IDFs, and other useful features. If you use Python for data science, and haven’t explored this library, it’s definitely worth keeping on your radar.

Feature preprocessing

These steps are applicable to any machine intelligence problem that uses a simple “bag-of-words” representation as an input.

Basic text preprocessing

  1. Tokenization: beginning with unigrams for a standard bag-of-words representation. This may be extended to n-grams or paragraphs as described in the paper cited above.
  2. Removing stop words
  3. Stemming or lemmatization: You may want to take it a step further and reduce words to higher level representations. In our sample corpus, words such as light, lighter, lights should be all be reduced to light. In this case, I’ve removed these words as they were found in nearly every document.
  4. Expanding/replacing acronyms: I’ve found in a number of domain specific real-world corpora, acronyms are used quite often and may need some subject matter expertise to understand

Library specific techniques

For gensim, we need to ensure our corpus fits the format required by the library. There are some great tutorials here

Tuning Hyperparameters

There’s a couple big one’s that we’ll need to figure out:

  1. K , the number of latent topics to represent our corpus
  2. α the dirichlet parameter

Luckily, the library we’ve chosen has an API for automatically estimating alpha based off of the data we train with.

However, for optimizing K, we’ll need to figure that out. There’s some algorithmic methods, however, they generally lean towards varying K and retraining based off of the perplexity measure

Intuitively, perplexity should decrease as we increase the number of topics. But, we eventually reach overfitting with an inability to generalize.

I ended up training my model with varying values of K and eventually setting on 20 topics after inspecting the perplexity and the token arrays produced by each topic.

Results

I’ll start off by saying that the dataset isn’t the best quality.

But no excuses. We can turn lead into gold by being clever.

First, I had to eliminate a lot of words from the vocabulary. There were a large number of misspellings. Some could be solved using nltk, but it’s hit or miss. Sometimes a big miss. You’ll see my logic if you check out the attached github gist at the end of this article.

Also, some of the descriptions were just a few words in length or cut off completely. I’m not sure if this was due to a database field char limit or a result of the export process.

But, when life hands you lemons, you make some lemonade and brute-force an LDA model. I’m pretty pleased with some of the insights gathered.

Here’s the listing of topics, top tokens with associated probabilities and my take on the underlying meaning. To derive meaning, I ended up reading a handful of descriptions with the tokens and found that they were pretty accurate!

  1. Silver cigars

u’0.113*”shaped” + 0.058*”craft” + 0.033*”cigar” + 0.027*”silver” + 0.016*”saw” + 0.016*”two” + 0.015*”saucer” + 0.014*”flying” + 0.013*”object” + 0.012*”high”’

2. Fast triangulars

u’0.059*”speed” + 0.053*”high” + 0.051*”moving” + 0.032*”fast” + 0.027*”formation” + 0.026*”triangle” + 0.024*”white” + 0.017*”rate” + 0.014*”altitude” + 0.014*”across”’

3. UFOs seen from inside

u’0.041*”saw” + 0.016*”outside” + 0.015*”night” + 0.015*”friend” + 0.014*”looked” + 0.014*”house” + 0.013*”home” + 0.012*”just” + 0.011*”see” + 0.010*”window”’

4. UFOs at the beach

u’0.029*”red” + 0.028*”beach” + 0.027*”like” + 0.020*”ocean” + 0.017*”triangle” + 0.016*”looked” + 0.012*”orange” + 0.012*”san” + 0.011*”flying” + 0.010*”black”’

5. Patriotic UFOs!

u’0.062*”white” + 0.036*”blue” + 0.035*”red” + 0.030*”flash” + 0.022*”disk” + 0.022*”amp” + 0.020*”green” + 0.018*”shaped” + 0.015*”oval” + 0.013*”hovering”’

6. Canadian reports

u’0.048*”night” + 0.026*”report” + 0.023*”lake” + 0.022*”object” + 0.022*”time” + 0.019*”three” + 0.012*”strange” + 0.011*”canadian” + 0.010*”hbccufo” + 0.010*”orange”’

7. Triangular crafts interacting with jets

u’0.027*”jet” + 0.024*”white” + 0.022*”car” + 0.021*”red” + 0.017*”direction” + 0.017*”front” + 0.016*”craft” + 0.013*”bottom” + 0.011*”triangular” + 0.011*”shaped”’

8. Daytime UFOs

u’0.030*”craft” + 0.024*”near” + 0.023*”three” + 0.021*”daylight” + 0.019*”close” + 0.015*”object” + 0.013*”unknown” + 0.012*”hovering” + 0.011*”two” + 0.011*”observed”’

9. Cloudy/smoky

u’0.039*”cloud” + 0.019*”noticed” + 0.016*”space” + 0.014*”trail” + 0.013*”north” + 0.013*”saw” + 0.011*”seen” + 0.010*”two” + 0.009*”smoke” + 0.009*”driving”

10. Flashy UFOs

u’0.059*”moving” + 0.053*”flashing” + 0.039*”shape” + 0.035*”color” + 0.031*”red” + 0.025*”changing” + 0.023*”white” + 0.016*”slow” + 0.014*”blue” + 0.014*”fast”

11. Black triangular

u’0.020*”right” + 0.019*”moved” + 0.018*”horizon” + 0.017*”black” + 0.017*”slowly” + 0.017*”u” + 0.016*”tree” + 0.016*”left” + 0.015*”triangle” + 0.015*”line”’

12. Texan UFOs

u’0.049*”west” + 0.048*”sighting” + 0.044*”east” + 0.020*”south” + 0.019*”moving” + 0.018*”north” + 0.014*”colored” + 0.013*”traveling” + 0.011*”texas” + 0.011*”near”

13. Orange and red paired circular

u’0.063*”minute” + 0.027*”orange” + 0.023*”red” + 0.023*”moving” + 0.017*”blinking” + 0.017*”two” + 0.014*”seen” + 0.014*”circular” + 0.013*”white” + 0.012*”large”

14. Likely shooting stars or space debris

u’0.082*”star” + 0.081*”like” + 0.036*”moving” + 0.028*”plane” + 0.025*”looked” + 0.018*”saw” + 0.015*”shooting” + 0.012*”move” + 0.011*”thought” + 0.011*”slow”

15. Fireballs

u’0.114*”orange” + 0.057*”orb” + 0.034*”glowing” + 0.024*”formation” + 0.024*”moving” + 0.023*”red” + 0.019*”across” + 0.018*”object” + 0.017*”flying” + 0.015*”two”

16. Red/white Low flying objects

u’0.049*”flying” + 0.041*”low” + 0.036*”white” + 0.026*”noise” + 0.022*”red” + 0.019*”sound” + 0.017*”orange” + 0.016*”craft” + 0.014*”large” + 0.013*”ground”

17. Flying formations between separate crafts

u’0.065*”one” + 0.028*”two” + 0.016*”moving” + 0.015*”another” + 0.015*”turn” + 0.012*”back” + 0.012*”zig” + 0.012*”object” + 0.011*”moved” + 0.011*”pattern”

18. Moon fireballs

u’0.051*”ball” + 0.023*”saw” + 0.022*”orange” + 0.021*”moving” + 0.021*”fire” + 0.017*”disappeared” + 0.016*”across” + 0.016*”white” + 0.016*”sphere” + 0.015*”moon”

19. Venus sightings

u’0.073*”note” + 0.073*”nuforc” + 0.072*”pd” + 0.029*”sighting” + 0.029*”possible” + 0.024*”strange” + 0.021*”star” + 0.014*”report” + 0.012*”venus” + 0.010*”seen”

20. Lots of colored lights

u’0.087*”seen” + 0.070*”fireball” + 0.040*”green” + 0.025*”red” + 0.023*”orange” + 0.018*”craft” + 0.015*”large” + 0.013*”triangular” + 0.012*”blue” + 0.011*”across”

Now, Let’s apply some of that human reasoning. Note: aside from the occasional X-Files episode, I haven’t really researched the UFO phenomenon. So let’s see what we can come up with

  1. Silver cigars: the UFO watcher community has definitely categorized these already. A simple google search for cigar UFOs came up with a ton of results. My guess is, these reports are from those who have been conditioned by the UFO culture given that they don’t use other descriptive terms, like cylinder, tube, etc. Therefore, when they see an object in the sky that can’t be classified into something standard like a plane, bird, or superman, their belief system overlays cigar UFO. That’s not saying that there aren’t cigar shaped crafts about. There’s even a Huffington Post article about one sighting in Ukraine — with video!
  2. Fast triangular: Also, another class of UFOs from the community. Military aircrafts are my best guess. B-2 Spirits, F-117 Nighthawk are likely candidates here.
  3. UFOs seen from inside: Nothing too interesting here, people reporting seeing crafts from inside their homes
  4. UFOs at the beach: Also, nothing super interesting here, people reporting similar crafts from a beach location
  5. Patriotic UFOs!: I’m cheekily calling these patriotic as they include reports of blue lights. Standard aviation lights are white, red and green. No blue lights. Let’s think deeper — A possible explanation is color vision deficiency, people affected by Tritanopia see blue in place of green on the visual light spectrum. Although it is a rare genetic trait. Possibly, tritans are more common than we originally thought. Also possible: Aliens.
  6. Canadian reports: These seem to be exports from another system as they have some standard formatting
  7. Triangular crafts interacting with jets: Military aircrafts generally fly in formations
  8. Daytime UFOs: Nothing of spectacular interest here, aside from hovering crafts?
  9. Cloudy/smoky: These reports either deal with amorphous shaped crafts, or are describing condensation trails left by crafts
  10. Flashy UFOs: Aircraft lights usually flash. Zzz
  11. Black triangular: Here’s another big class of UFOs from the community. Possible conditioning by increased interest in UFO parascience. Also, take a look at the in service stealth aircraft links under fast triangulars for likely candidates.
  12. Texan UFOs: UFOs in Texas — clear skies, a large proportion of citizens which live in Texas. A large number of US Air Force bases. All likely indicators here
  13. Orange and red paired circular: These are another class of commonly observed UFOs: fireballs. Lots of candidates here — space debris, meteorites. Pretty much anything that enters the Earth’s atmosphere is going to be flaming.
  14. Likely shooting stars or space debris: The mere fact that they are described as ‘star like’ kind of gives it away
  15. Fireballs: Orange/Red orbs attributable to the fireball class
  16. Red/white Low flying objects: These deal either with flight trajectories, e.g. ‘flying towards the ground’ or altitude, e.g. ‘flying low to the ground’
  17. Flying formations between separate crafts: More formation flight
  18. Moon fireballs: Fireballs around the moon
  19. Venus sightings: These seem to be appended descriptions, attributing a cause to the planet venus
  20. Lots of colored lights: These seem to be a mixture of fireballs and colored lights

So, our LDA model was successful in capturing some common UFO classes. Did machine learning prove the existence of aliens? Not quite.

I have some other hypotheses and questions that I think can be answered from the entire dataset and not just the descriptions I’ve used for this LDA model. Would love to see someone poke at this data as well.

  1. Are there any spatial/temporal trends? E.g. do specific crafts show up clustered around the same time/region?
  2. Are there patterns emerging from the vicinities of US Air Force bases?

If you decide to play with this data, let me know!

I will say that, if I was an alien who travelled to this planet from across the cosmos, I wouldn’t spend time mindlessly flying around remote areas of the United States abducting people….I’d probably be dancing at a nightclub in Ibiza. Interstellar vacations don’t come cheap.

Here’s a github gist — feel free to modify!

--

--

--

Data Scientist

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

TRADING, a tool for those who wanted to earn money but cannot. My experience.

Six Data Science projects to top-up your skills and knowledge

Bias and Variance

Ten More AI Papers to Read in 2020

Techniques in cleaning large datasets sing pandas.

Covid-19 Survey data — what do we really need to know?

Analysis of Argo Float Data in the Bay of Bengal

make your own visualization

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sam Black

Sam Black

Data Scientist

More from Medium

African Language Speech Recognition

GPT-3 and Everything About it

Hydra AI: Accurately extract text from any image