Perceptually mapping music

Perceptually modeling music with deep learning nets. How far can we really push it.

We just discussed systems that are trying to model the structure of songs. So historically people first focused on the melodic or harmonic structure, since you can deal with MIDI files that totally capture just that information, and use those for training.

A good example of early research in this area includes the work of Dr David Cope at UCSC.

Another more recent example is the Project Magenta work involving transformer auto-encoders.  Again working off of MIDI files.  And once again this notion of using temporal compression to force an embedding.

So then you have systems like Jukebox that are working with the actual raw audio files, and trying to then learn that melodic and harmonic structure. And hopefully more, like timbre for instance. Extending that reach for more to include things like song lyrics and band meta data.

But can we take the mapping even further? Like down to modeling the actual production values for the mastering? Or separating the instruments from the amps and effects they are played through?

We'll discuss both of those ideas below in more detail.

So let's continue the thread we started yesterday. How can we train neural networks to understand the structure of music. And the inter-relationships between different kinds of music. What goes well together, what does not.

So DJ playlists actually do a good job of modeling what goes well together. At least in terms of how human DJ's think about that problem.

Imagine you were building Spotify's song radio in 100 lines of Python.

So if we continue the line of thought laid out in the last 2 blog posts (neural nets learn representations of data anchored in lower dimensional manifolds because that accurately models the real world's statistics), we can think of the music mapping project we have laid out as the same thing. We're trying to learn and then manipulate the lower dimensional manifold that encompasses music.  Or in this case what we mean by music is how different pieces of music transition (a much more limited universe to be sure).

Now you may have heard of Word2vec. Or not. It's a group of related models that can be used to produce word embeddings. Models being shallow 2-layer neural networks that are trained to reconstruct linguistic contexts of words.

So it takes a large corpus of training text and produces a lower dimensional manifold mapping vector space (typically several hundred dimensions). Words that share common contexts (they are related) will be mapped closer to each other in the manifold space (as opposed to words that aren't related).
Pandora for example, has 450 different dimensions in it's human curated musical interrelationship database.

By training it on collections of playlists containing song ids (as opposed to sentences containing words), it will learn the relationships between songs. So can this idea be extrapolated to deal with the song structure learning issue in Jukebox?

But lets go further. I would argue that there is a whole other set of learnable parameter models associated with a large corpus of music.  Let's just consider the guitar parts for a second, and then try and extrapolate that to cover a wider range.

So electric guitars are played through amps and associated effects. In the older times one would plug an actual guitar into some ridiculously too loud tube amp and off you would go. So the sound that you hear is composed of several different models interacting. The model for the guitar and what it is generating, the model for the effects, and the model for the amp and associated speaker (and maybe also the mike you used to record it).

Of course nowadays all of that can just be a simulation you play through. So the effects, the amp, the speaker, the mike, are all just software models.

If you aren't familiar with any of this, take a look at the Line 6 Helix system.  And if you extend it with Variax then you can even get into the level of modeling the actual instrument resonances and non-linearities, so you can virtually swap from playing a vintage telecaster to a XXX.

So if the neural network is really good a modeling the statistics of the system, it should be able to capture this level of musical modeling as well.  So you could then use the system to transform lets say a bluegrass acoustic group into something that was playing through Metallica's stage rig at Madison Square Garden.

There is a similar thing going on with the overall production of all of the various musical bits inside of any song. Everything associated with how it was mastered, what kinds of compression, eq, etc were used to make the recording.

So if the neural network is really good at modeling all of the statistics of the system, it should be able to learn and reproduce this level as well.  The sheen of production level.  And there is a real artistry to this part of how songs are actually constructed. Someone like Butch Vig or Steve Albini is going to make a song sound very different than if it was produced and mastered by Giorgio Moroder or Brian Eno.

Everything in this last section doesn't seem to depend on any long range statistics (like song structure does). Can something like Jukebox learn the representations?

So if you are viewing the problem as a supervised learning problem, then here's one possible approach to getting at this additional song structure. Someone could try to label the audio files with tags for who produced it. Or tags for what kind of guitar and amp setup it is using. Or tags for specific effect processing being used on an instrument. So more meta data.

I wonder whether this idea could also be pursued to improve Jukebox's representation of song structure. Sine they already work with lyric metadata. It seems like you could work off of the lyrics to mark verse - chorus as a separate tag track within the song. You'd include that tag track as an additional metadata when training the system.

A different approach might be to force the structure of the neural model to constrain to specific functional blocks that represent the various components of how we are modeling the system.  So let's say something like musical note structure - instrument with associated resonances and non-linearities - effects - amp - mike - mastering processing.

In both of these schemes, we're trying to force the neural network to constrain itself in some way. Either by the kind of tags we are providing for the supervised learning part of the system. Of in how the system itself is built.


Popular posts from this blog

Pix2Pix: a GAN architecture for image to image transformation

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Smart Fabrics