Over the weekend I took my first stabs at fine-tuning GPT-2.
My first project was based on Max Woolf's article on how to train GPT-2 on tweets. I followed his method and Colab to train the 355M model on @jon__reed's Twitter. It was a smooth process!
The Colab is built on top of Max's gpt-2-simple library, which does a ton of work for you, such as importing/exporting from Google Drive, picking the optimal parameters for the Tensorflow session, and overall being a nice interface.
Next, I had a more ambitious project in mind: training the 1558M model on transcripts of the Accidental Tech Podcast. I used Otter.ai to transcribe the most recent hundred episodes. The transcripts were exported as raw monologues without speaker annotations.
I ended up with two Colabs running, one fine-tuning on 774M, the other on 1558M.
To get there, I needed a GPU with adequate memory. I used a trick common in AI Dungeon Colabs to guarantee myself a P100, the strongest GPU available.
Podcast transcripts, 774M
First I trained the 774M model on a ~5 MB file of ATP transcripts. I used the same Colab notebook as the tweet fine-tuning, just changing the downloaded model and tweaking the parameters.
However, I stumbled hard out of the gate, making a couple dumb mistakes:
Once the 774M fine-tuning was chugging along happily, I noticed that the samples it produced were much cleaner syntactically than the mistake-littered transcripts. The samples read easier than the transcripts, even if they weren't logically coherent.
I wonder if there's an opportunity to incorporate GPT-2 into the automated transcription process as a way to smooth out transcription artifacts. I also wonder if those artifacts would start to reappear as the loss went down.
Podcast transcripts, 1558M
In the mean time, I tried the same approach with the 1558M model. Immediately I ran into what I assume were memory issues, as the fine-tuning process got stuck and never started. I needed to fit the 1558M model onto the P100 somehow.
On Gwern's GPT-2 page, he mentions that halving the context window worked for loading 1558M onto the P100. I modified the gpt-2-simple repo to allow me to pass a context-length parameter, but even at a length of 256 I was never able to get the model up and running.
Next I switched the Colab runtime to TPU (still with a context length of 256). And it worked! The fine-tuning process was able to start, unlike on the P100 GPU. However, it was slow going, only averaging on 48 steps per hour. I also noticed that something was eating heavily into the VM's ~35 GB of RAM. Hmmm...
Only after messing around with setting up my own TPU rig in GCP did I realize something was fishy with the 1558M run.
I set up a VM and TPUv3-8, hoping to train a bit faster than on Colab (which seems to use a TPUv2-8). I used the same exact steps to initialize the fine-tune (still using my fork of gpt-2-simple), and immediately the process ran out of memory. Tensorflow was running on CPU and allocating the VM's RAM.
Long story short(er), gpt-2-simple doesn't support running on TPUs. There's additional setup required, and a bunch more tweaking to the Tensorflow code to get reasonable performance (something about bypassing the Estimator API).
My original 1558M run was entirely on the Colab CPU. I had gotten lucky in getting a VM with ample RAM, enough that the process didn't crash on startup. Wow! Lesson learned.
Since then, I've gotten a 1558M run going properly on Colab TPU, after much studying of Shawn Presser's TPU fine-tuning code and Colab notebook. I also heavily cross-referenced the Colab posted by Svilen Todorov ("Tenoke") on his blog.
The fine-tuning is definitely faster than on CPU. More to come!