Deep Learning Demystified

This AI will enlighten you

Using GPT-2 to create a digital Terence McKenna

It’s been a long, strange, trip. Not-so-good photoshop: Me.

** Disclaimer, in no way, shape or form am I attempting to besmirch the late Terence McKenna**

In this article, I want to demonstrate some of the latest techniques in natural language generation, and how transfer learning can be used to narrow down the generative process.

The opportunities and applications for this, done in the right way, are endless. Some interesting examples are lyric generation, auto generating news articles, chatbots, question answering systems, and more.

I occasionally listen to McKenna’s lectures, and find his viewpoints interesting and provoking. I chose him specifically, because he was one of the first “New-Age” philosophers to really embrace the idea of Artificial Superintelligence. That being said, I it would be fitting to immortalize his thoughts as a set of matrices.

Let’s get into the core requirements

The underlying model: GPT-2

Note that the following explanation assumes you have some understanding of deep learning. If you just want to read some entertaining blurbs, skip to the end.

GPT-2 represents the latest and greatest in sequence prediction models, from OpenAI.

The original paper that spawned this architecture is here:

Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D. & Sutskever, I. (2018), ‘Language Models are Unsupervised Multitask Learners’

It was trained on approximately 40GB of data from Reddit, but researchers don’t have access to it. Instead, we’re working with a watered down version. OpenAI auspiciously states early in the introduction of the model:

“Due to our concerns about malicious applications of the technology, we are not releasing the trained model”

which is tantalizing, because the watered-down version is still pretty amazing.

At the highest level, GPT-2 learns the next word from the previous word, while maintaining an internal memory (the context) of the word it’s previously read. This allows it to do very interesting things, like generate mystery novels and keep track of names it’s used.

How it works

The network architecture — per OpenAI

Take a giant corpus of sentences. Create byte-pair encodings of every sentence, and add tokens that indicate starts/stops: <|endoftext|>`

  1. Feed each BPE encoded sentence to an embedding layer
  2. The embedding (a fixed length dense vector representing the byte-pair encoding) are fed to a encoding layer (this includes a self attention layer and dense layers)
  3. The output of the embedding sequence is fed to a decoding sequence, which has a mask, a self attention layer and dense layers. The difference here being that the outputs are shifted to the right
  4. Feed all of that to a linear and softmax for probabilities

This allows GPT-2 to work like so

Gif borrowed from Jay Alammar, who writes fantastic visual summaries of machine learning algorithms

Finetuning the model for Terence McKenna

Now that we have a trained GPT-2 (note that the training time can take days to converge depending on the size of your input), we can “fine-tune” it, using a specific set of sentences, say, tweets, or a corpus of text from from a specific author. Then we ask it to generate new sentences, given this fine-tuning process.

This was all made possible by the excellent GPT-2 abstraction created by

I’ve embedded the notebook here:


  1. Extract data from a Terence McKenna book
  2. Fine-tune a trained GPT-2 with our corpus
  3. Generate short texts from our fine-tuned model

Prior to 2018, this would have been a week long project. Thanks to the community, which has created a number of convenience wrappers and APIs, I can get GPT-2 running with a lot less code — in a lot less time. Thanks!

Extracting the data

This was simple. Got a pdf version, and extracted the text. Now our book is in a flat file. The API does a lot of heavy lifting, so I don’t even need to preprocess the text into byte pair encodings. Just feed a raw text file with <|endoftext|> tokens to break the text into segments.

Fine-tune a pre trained GPT-2

Here, I ran the experiment using Google Colab, which allocates a Tesla-V100 for a low monthly price. If you’re a data scientist working on multiple projects, I highly recommend the paid version of Colab, which is ~$10 a month.


And kick off our training. [refer to the embedded notebook]

Now, we are faced with the hardest problem machine learning practitioners face: what do you do while your models are training?

My favorite model training pass-time: Learn to play ping pong with nunchucks (caveat, I am still an amateur with nunchaku)

Generate some short texts from our trained model

This is actually quite amazing. Given that our model has been trained heavily on the English language, and has been fine tuned given a corpus of our author’s writing, we can generate some realistic and provoking “McKenna-isms”

I’ve given the model a context phrase of “The New Age is

Example 1:

The New Age is a religion of peace, peace at any cost. Peace at all costs has become the new world order

Maybe. McKennabot, just maybe..

Example 2:

The New Age is not so much the re-creation of the eschatological object as it is the transformation of matter and energy in a way that is new and exaggerated

Interesting, tell me more..

Example 3:

The New Age is nothing less than a movement of consciousness from the ego to the supra-linguistic topical organismal system, a transformation of the mammalian body and mind into a philosophically minded and ecological organismal system.

Some other though provoking statements generated by our model. I need to remind you that, the only thing that is occurring here “under the hood” is prediction of the next most likely term, given a context, the model’s internal memory and a randomness element.

“The Eigenpattern is an effect of the spacing of the life spans of the ebb and flow of time effects that existed in the early universe. This effect is immensely important for any attempt to chronicle the history of time — for any to appreciate, as well as for anyone to ignore, the fact that for hundreds of years after the generation of the tree of life, all life was covered with a silence.

I have no idea what an Eigenpattern is, but this sounds like something that would cause a slow-applause at a campfire

It is the collective will of the human species to produce an intelligence that will supersede the will of the individual. Will is not an element of the equation.

This sounds a little doomsday-ish.

All of us were born in the world into which we had come. All of us became crypto-geometers of community and spread belief through the world until we had isolated ourselves so that no one else could. Evolution has made us the only species on the planet from which evolutionary energy has been able to escape. Only Homo sapiens, the only species to have arisen on an entirely new planet, has achieved the ability to collect and communicate with the language of the new planet.

Again I’m surprised at the model’s ability. No where in McKenna’s writing does “crypto-geometers” appear, however it kind of makes sense in a New Age sort of way.

Hopefully, this demonstrates the amazing potential of this new technique. While I’m not 100% in agreement with McKennabot, I will say that it’s statements do generate some interesting thoughts.

Thanks for reading!

If you enjoyed this, you may also enjoy:

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store