Posts

Showing posts from March, 2021

HTC Updates: March Madness (deep learning updates)

Image
 Time marches on very quickly when it comes to deep learning research these days. Quite a lot has been happening recently.  Henry AI Labs has these informative weekly updates to help us stay up to speed on the latest developments. This may seem like an information overload slogfest, but it is a great way to get an overview of what is happening (keep in mind this is still only a slice of current research activities in the field).  And then you can dive into specific papers you are particularly interested in.

HTC Seminar Series #32: How do we Inject Inductive Bias into a Deep Learning Model

Image
We're mixing up the HTC seminar series formula a little bit in this post. Let's start off with a good podcast with Max Weiler you can listen to  here . Covers gauge equivalent CNNs, group equivalent CNNs, generative VAE baysian stuff, compressing a neural net for deployment, etc. Observations: The end discussion about modeling vs just interpolating to data is fascinating. Laws of physics have a few parameters. Stuff we care about lives in that world. So the real world lives in a manifold. We ideally want models that really model this. As opposed to just overfitting to a huge amount of data. Historically, we started by building models for our systems. Then we moved to just training our system on data, because it did better than more limited models. But these systems  are missing the underlying low dimensional model of the world, the manifold it lives in.  They just try to interpolate the data. Here's a link to the Sutton rebuttal, titled 'Do we still need models or just

Perceiver: General Perception with Iterative Attention

Image
 Today's video is an analysis by a new transformer architecture paper put out by a group of people at DeepMind.  And certainly Andrew Zisserman's name has been attached to really great papers for a very long time, and this one is interesting as well. They restructure the transformer architecture a little bit to reduce the computational complexity as your data size increases.  They also define a uniform blank slate architecture that can be used for different tasks (vision, audio, 3d point clouds, text, etc). And with that intro we turn to our old pal Yannic Kilcher to give us his astute analysis of the paper. You can check out the paper titled 'Perceiver: General Perception with Iterative Attention' here . Observations: 1:  Is the 'fourier' style position encoding really just an elaborate way to build a scale space pyramid, encoding, whatever you want to call it of the input data? 2: His comments about it being a recursive neural net if the weight encodings of th

Full Stack Deep Learning - Deep Learning Fundamentals

Image
 Full Stack Deep Learning is an official UC Berkeley course on deep learning taking place this spring. It pro-ports to cover  the full-stack production needed to get deep learning projects from theory or experiments to something actually shipping.   You can experience it for free online.  I've only watched the first Deep Learning Fundamentals lecture so far, but i was impressed by it.  Because it skips a lot of the bullshit and gets to the real meat of the material. So if the other lectures are like the first one, well worth your time to watch and absorb. And with that intro, let's dive into that first lecture on Deep Learning Fundamentals. Her's a link to some info on Full Stack Deep Learning. Here's a link to all of the first lecture material. Note that they have coding notebooks linked from there you are going to want to work through to get the most out of it. We will probably be working through some more of this material in future HTC posts, since again it seems l

Curve Detectors

Image
 Lets take a look at a recent article in Distill that analyses curve detectors in the InceptionV1 deep learning neural net.   What do we even mean by curve detectors anyway?  You could of course reread yesterdays HTC blog post.  But in the discussion associated with todays highlighted Curve Detector article, we are referring to 'curve neurons' in the Inception V1 feature space. For example, here are 1- curve neurons in layer 3b We are looking at a feature visualization of what an input image that maximally excites them looks like. Curve detectors are activated by input signals of a certain orientation. So if you have a bunch of them you can cover all of the different orientations.  And get every where in between by rereading the recent HTC material on steerable oriented filters. You can find the curve detector article here . You can find a guided overview of early vision in the InceptionV1 architecture here . Observations 1: The article claims it is surprising that curve detect

Banana Wavelets

Image
 Time to fire up the wayback machine once again, and take a look at a fascinating wavelet representation apparently lost to the annuls of time. Let's take a look at the classic 'banana wavelet'. All of this is from the paper titled 'Learning Object Representations by Clustering Banana Wavelet Responses by Peters and Kruger', which you can find here . A banana wavelet is composed of the product of a warped gaussian and a curved wave function. We have discussed both scale space pyramids as well as oriented steerable filters recently. So you can now view this banana wavelet through that lens. You can build this beast with different orientations, as shown below. The mobius topology thing above might seem confusing, but if you think about how you have to compute distance between points in an orientation space, it should start to make sense.  If you are still confused, think about hue distance in a colorspace with a hue axis. Angle has the same modulo property. So now of

HTC Seminar Series #31: Geometric Deep Learning

Image
 Today's HTC seminar series is a talk titled 'Geometric Deep Learning, from Euclid to drug design', presented by Michael Bronstein virtually at Imperial College London.  This is an awesome lecture.  Grab a big cup of coffee, sit back, and take it all in. Observations: 1: Geodesic CNN's, super cool.  Geodesic image processing is a little known thing, but very useful.  All over Studio Artist for example. 2: Note the discussion of priors in these systems.  How the built in priors give you various kinds of symmetry transform invariance. 

Multimodal Neurons in Artificial Neural Networks

Image
 There is a fascinating new paper out in distill by some folks at openAI titled 'MultiModal neurons in Artificial Neural Networks'.  Anyone familiar with research into visual perception has heard of 'grandmother neurons', or the more updated 'Halle Berry neuron'. This distill paper analyzes an equivalent kind of phenomena taking place in the feature representation constructed internally int he recent CLIP mode.  We covered CLIP in recent HTC posts here and here. You can read the distill publication online here . You can also read all about this work on the openai blog here . Our old friend Yannic Kilcher has put together a really great analysis of the paper you can watch below. The CLIP model consists of 2 different components, a ResNet vision model, and a Transformer language model Yannic points out that CLIP is probably using the text input and associated language model more than the visual input for how it is grouping things internally. A really good example

Zero-Shot text-to-Image Generation

Image
 The Open-AI DALL-E paper and associated code are online now.  They had posted the results before, but the actual technique was still under speculation, as discussed in this previous HTC post . The paper is out now. They use a discreet variational auto-encoder (dVAR) to compress a 256x256 RGB image down to 32x32.  That lower layer can be thought of as a set of image tokens, that are then fed into an auto-regressive transformer along with a set of BPE-encoded text tokens. Here's a very quick overview video that discusses the work. Here's a link to the paper. Here's a link to a github repository with a PyTorch implementation of DALL-E. Here's a link to the official github repository (PyTorch) for the discrete VAE used for DALL-E.

Pretrained Transformers as Universal Computation Engines

Image
 Interesting analysis of a recent paper on using frozen transformers as a fixed prior architecture for function approximation.  So what basis function set did it learn?

Scale Space, Image Pyramids, and Filter Banks

Image
 This is a really great introduction to linear scale space lecture.  Whether it is etched into your DNA, or a first time exposure, there are things to learn here. Here's a hint for the 'why is camouflage attire effective?' question he asks at the end. 1: Human perception of texture is based on a spatial channel model.  He discusses spatial bandpass filters near the end of the lecture. 2: There is a texture metamerism effect that occurs when the signal response of the various texture channels is at the threshold of detection.  Its how stochastic screening works.  Think of signal masking in the various channels.

Deep Generative Modeling

Image
 The latest 2021 lectures from the MIT Introduction to Deep Learning class are trickling out into the known universe.  And we have a great one to watch today.  Ava Soleimary will school us on all things generative.  Well not all things, but it's pretty good overview.     She covers VAE architecture, and i was struck by her description of how it works and LeCun's Energy Model unified field theory of generative models.  Its fascinating to compare the 2 descriptions.     And regularization is all about centering the mean while regularizing the variance.  And when you think about it that way it seems a lot more straightforward than you might have thought at first.     Then we get into the latent space, latent perturbation and disentanglement, GANs, a really great intuition description of how GANs are transforming one distribution into another distribution, StyleGan, CycleGAN, etc.

Simplex Noise

Image
 So one issue with simple linear interpolation in the latent space of StyleGAN is that the timing of the transitions between the different points changes, because it's dependent on the distance between them in the latent space. So what people are using instead as an alternative is a noise loop interpolation instead of straight linear interpolation. So this flashed by on the screen in a StyleGAN presentation i was watching, and i thought, oh, are they using Perlin noise to do that?  Some googling re-activated my brain cells, because Simplex noise is kind of Perlin Noise 2.0.  Better than the original if you want to generate it in high dimensions for sure.  Perlin discussed this in a paper in 2001, which i was very familiar with at the time.  Of course i'm really, really familiar with the even older 80's paper that started the Perlin noise revolution.  We are seriously cranking the wayback machine this week at HTC. Stefan Gustavson put together a really good writeup with some

SinGAN - Learning a Generative Model from a Single Natural Image

Image
 Ok, let's take a look at the fabulous SinGAN architecture (so many GAN architectures, so little time) .  This is some interesting work, but i think you want to put it in the right perspective.  And with that cryptic intro, let's dive in first with a really great overview presentation by Tamar Shaham at Israel Computer Vision Day 2019. You can check out the actual paper, titled 'SinGAN: Learning a Generative Model from a Single Natural Image' here . You can check out the actual code (PyTorch, yeah) on Github here . For a different take on the actual architecture, we turn to our old pal Yannic Kilcher for his astute analysis. Ok, so we have a multi-scale architecture. Typical GAN structure where we input random noise and output image from the generator, but we do that at every scale. Here's what the individual generator looks like at a single scale. Each of the 5 conv-blocks in the single scale generator above consists of the form Conv(3x3)-Batch Norm-LeakyReLU. Her

Why Does Gatys et al Neural Style Transfer Work Best With Old VGG CNN Features?

 Does it really? Let's avoid a discussion of what 'works best' even means, let alone 'style'.  For now. I grabbed this archived discussion from reddit and copy/pasted it here below in case the one on reddit vanishes for some reason.  And it's a very interesting read, and it highlights some things we kept pointing out at HTC in many previous posts.  That there is something about the VGG architecture that seems to work well with a number of different neural net image transformation tasks. An acquaintance a year or two ago was messing around with neural style transfer ( Gatys et al 2016 ), experimenting with some different approaches, like a tile-based GPU implementation for making large poster-size transfers, or optimizing images to look different using a two-part loss: one to encourage being like the style of the style image, and a negative one to penalize having content like the source image; this is unstable and can diverge, but when it works, looks cool. (Exa

Classic Paper Day #2: Steerable Filter Double Header + a special bonus paper

Image
 The much neglected classic paper day tag resurfaces for a double feature presentation. Papers anyone working with convolutional neural networks (especially visualizing their feature maps) should probably be very familiar with (but you get the sense that is probably not true for many of the younger people) . We will start with the classic paper by Freeman and Adelson titled 'The Design and Use of Steerable Filters'. From a PAMI Transactions issue way back in 1991.  You can find the pdf here . And of course no read through is complete without David Heeger's notes on Steerable Filters from almost a decade later. And we immediately segue into the Steeraable Pyramid.  And the followup classic paper titled 'The Steerable Pyramid: A Flexible Architecture for Multi-Scale Derivative Computation' by Simoncelli and Freeman. You can find the pdf here . Eero Simoncelli has a great set of notes on steerable pyramids you can get here . It's probably worth also pointing out

RepVGG Convolutional Neural Net Architecture

Image
 There's an interesting new paper out called 'RepVGG: Making VGG-style ConvNets Great Again'.  Were they ever not great? I guess in today's new architecture of the month mad dash of deep learning research, they are old news. But oftentimes the mad dash is all about just using larger and larger models, or a new architecture so you can get your new paper published.  There is value in rethinking old architectures, especially if by restructuring them you can get them to train and run better.  Because then you might actually understand something about how they work internally.  Which is the ultimate goal. The abstract lays out very clearly why it's worth understanding what is going on in Rep VGG. We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time an

Unifying VAEs and Flows into one Framework

Image
 Most people break down generative deep learning models into one of 3 categories.  They are GANs, VAEs, and Flows.  We have covered the first 2 quite a bit here at HTC.  We have not really done so with the Flow architectures. I've been trying to grok Flows recently, and came across this very interesting presentation by our old friend Max Welling called 'Make VAEs Great Again: Unifying VAEs and Flows'.  In it, he explains both, lays out the differences between them, an then tries to setup his own unified field theory of generative models where you can analyze them both in the same framework. Yann LeCun has his own 'unified field theory' of generative models as well (energy based models), which we covered in a previous post .

Attention is Not All You Need - the transformer backlash begins

Image
 With the end of my 45 day straight 16 hour a day campout in the debugger doing stack traces for a major software release, i finally have some time to get back to this blog.  And in the fast paced world of deep learning, there are a ton of new things to cover. A paper was just release called 'Attention is not all you need: pure attention looses rank doubly with depth'.  It tries to dive into answering the question 'why do transformers work'.  And it turns out our old friend the skip connection plays the key role (surprise). Here's the abstract. Attention-based architectures have become ubiquitous in machine learning. Yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention poss

Meta-Gradients in Reinforcement Learning

Image
 Some person or entity named Wiki P Edia has the following thoughts on this whole 'meta' concept.   Any subject can be said to have a metatheory , a theoretical consideration of its properties, such as its foundations , methods , form and utility , on a higher level of abstraction. In a rule-based system , a metarule is a rule governing the application of other rules, and " metaprogramming " is writing programs that manipulate programs.  So extrapolate as you wish when thinking about meta-gradients or meta-learning in deep learning neural nets and in reinforcement learning systems (because they are somewhat different). I like the concept that machines that learn should learn how to train themselves more efficiently.