Multimodal Neurons in Artificial Neural Networks
There is a fascinating new paper out in distill by some folks at openAI titled 'MultiModal neurons in Artificial Neural Networks'. Anyone familiar with research into visual perception has heard of 'grandmother neurons', or the more updated 'Halle Berry neuron'.
This distill paper analyzes an equivalent kind of phenomena taking place in the feature representation constructed internally int he recent CLIP mode. We covered CLIP in recent HTC posts here and here.
You can read the distill publication online here.
You can also read all about this work on the openai blog here.
Our old friend Yannic Kilcher has put together a really great analysis of the paper you can watch below.
The CLIP model consists of 2 different components, a ResNet vision model, and a Transformer language model
Yannic points out that CLIP is probably using the text input and associated language model more than the visual input for how it is grouping things internally. A really good example of this is show below.
Now if you dig into the appendix, they talk about their new 'faceted feature visualization'. We are going to dive into this more in another post, since i believe this is the key to getting the cool visualization images they are showing.
Here we propose a new feature visualization objective, faceted feature visualization, that allows us to steer the feature visualization towards a particular theme (e.g. text, logos, facial features, etc), defined by a collection of images. The procedure works as follows: first we collect examples of images in this theme, and train a linear probe on the lower layers of the model to discriminate between those images and generic natural images.
We then do feature visualization by maximizing the penalized objective,