Using BERT Sentence Embeddings, T-SNE and K-Means to Visualize and Explore Statements
A visual method for exploring natural clusters in transcribed speeches
In this article, I demonstrate a method for understanding natural clusters of statements in transcribed speeches. This is a useful way to understand latent themes in public speeches or other long form transcribed audio data. Additionally, I demonstrate how to visualize this data easily with Streamlit.
There are a few different tools that I’m using to put this analysis together.
- Sentence Embeddings to create homogenous statement representations
- K-Means to cluster statements
- T-SNE for dimensionality reduction
The data that I’m analyzing is the transcription from the first presidential debate between Joe Biden and Donald Trump.
For simplicity, I am only considering the statements made by Joe Biden.
There are natural extensions to this analysis, such as applying classifiers (Biden vs Trump), clustering all statements together, determining the probability of an interruption, etc. I’m considering that out of scope for the time being. Feel free to extend this analysis if you find it interesting!
- Extract and preprocess data
- Create sentence embeddings for each statement
- Cluster the statements using KMeans
- Apply TSNE to the embeddings from step #2
- Create a small Streamlit app that visualizes the clustered embeddings in a 2-dimensional space
Extracting and preprocessing the data
The data are already in good shape, so all I need to do is scrape and extract the data of interest from our link. Simple enough
Preprocessing the data was also simple. I need to extract each statement from a sequence of statements. Also, there are some remnants of decoding errors, such as
\xa0word . Also, this transcription contains
-- for pauses, and in cases where Trump has interrupted, contains
-- at the end of the statement, which could be informative for an embedding layer.
-- occurs mid statement, I’ll consider that a pause, and replace it with
<pause> . If it occurs at the end of the statement, I’ll consider it an interruption, tokenized with
<interrupt> . See the gist below for how I accomplish this in a functional way.
Creating embeddings for each sentence
Sentence BERT embeddings have been shown to improve the performance on a number of important benchmarks, thus have superseded GloVe averaging as the defacto method for creating sentence level embeddings.
For a brief summary of how these embeddings are generated, check out:
Richer Sentence Embeddings using Sentence-BERT — Part I
Using naive sentence embeddings from BERT or other transformer models leads to underperformance. To get around this, we…
I’m using this library to generate the embeddings. The authors have provided a number of pretrained models which work for our task. If you have a highly specific domain, you may want to fine tune your own model.
This gives us a dense vector of length 768 for each sentence in our corpus.
Using K-Means to cluster the statements
Because I’m planning to visualize this data, I want to have these statements clustered with varying degrees of K. If you were looking to find the optimal value for K, use the gap statistic.
T-SNE for dimensionality reduction
Here, I use T-SNE to reduce the dimensionality for visualizing and exploring our clustered statements.
Visualizing and exploring the data with Streamlit
Now that we have all of our data organized, we can create a simple streamlit app and deploy it to Heroku. The app is here.
Please see this repo for an example of how to accomplish this.
Thanks for reading!
If you liked this, you may also like: