OpenAI's Jukebox

So i was all set to talk about GPT-2 today. And since you have all watched the last HTC Seminar, you are all familiar with it at a basic introductory level. But we're going to do a 180 degree turn and instead focus on the brand new Jukebox neural net model. Jukebox is a neural net that generates music as raw audio in a variety of generes and artist styles. So kind of like what we described in the Magenta post, but way more intense.

So yeah, Junkbox is related to GPT-2 on some level. They are both trying to predict the future based on what they were trained with.  But because Jukebox want to model real audio, the training sequences it is dealing with and the statistics it hopes to capture are all very long.

A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million time steps. For comparison, GPT-2 had 1,000 time steps and OpenAI Five took tens of thousands of time steps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies.

So originally OpenAI focused on MuseNet, which synthesizes MIDI music based on large MIDI data training sets.  Jukebox boldly goes beyond this to model and hopefully reproduce actual raw audio files.

Jukebox is an auto-encoder neural model. So it's using compression to force a certain kind of learning structure. And this approach to building neural nets has been popular for a long time.

The particular approach Jukebox takes is based on VQ-VAE quantization.  It works on 3 different levels, with the goal of the high level modeling and then reproducing long range structure in the data.  The bottom 2 levels are focusing on learning and reproducing the local musical structure.

So you would expect that you might want a really large training set if you had any hope of this approach working well.  They started with a dataset of 1.2 million sings and associated lyrics and metadata (artist, album genre, year released, mood) that were captured by crawling the web.

And now we have a lovely tie in to yesterdays post on t-SNE. Because they used it to visualize how their model learns all of this audio data. Take a look.
So now you are getting a better sense of what t-SNE is doing. It's modeling the lower dimensional manifold that the neural net learned from the training data.  Keep in mind that the actual learned manifold might be high dimensionality then the 2D map shown here, so there's another transformation going on at that point to get it down to the map dimensionality.

The whole lyric aspect to this system is fascinating as well.  They had to deal with matching the text lyrics to the actual audio so that they both matched up in terms of positioning in the raw audio file. So they are using something called Spleeter to extract vocals from the raw audio of a song, and then running NUS AutoLyricAllign to get the precise word level alignment.

There is a separate auto-encoder added to what we discussed above to deal with learning the representation for the lyrics.

So how well does it work. Check out the examples they provide.  Or even better try working with it (they provide source code).  They are also hiring.

It appears that the current model does not really learn about aspects of song composition (things like verse-chorus-ver-chorus-middle bit-verse-chorus).  So maybe it's not modeling the really long term structure as well as it should.


Popular posts from this blog

Pix2Pix: a GAN architecture for image to image transformation

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Smart Fabrics