Very Deep VAE (VDVAE) Architecture
U-Net and VAE architectures are like peanut butter and jelly, you just want to smash them together and see what happens. And the paper 'Very Deep VAEs Generalize Autoregressive Models and Can Outperform them on Images' by Rewon Child does just that. You can find it here.
As you can see from the block diagram above, we have the classic U-Net with the horizontal skip connections mashed into a VAE encoder-decoder structure.
Average pooling and nearest neighbor up-sampling for the pool and unpool layers. So immediately you could restructure that part to do better (since the whole point is image synthesis). Put on your thinking caps.
GELU non-linearity threw me for a minute (instead of ReLu). You can read about it here. A transformer thing apparently.
They claim that N-Layer VAEs are universal approximators of N-Dimensional Latent Densities. So the scale space prior imposed by the depth on the computation is why?
You can check out the PyTorch implementation here.
1: Cool architecture.
2: I don't know about their initial claim that generative models are going to be a boon to increased robustness in generative learning by providing artificial training data. In some sense, you aren't bringing any new information into the system, which is what real world data does. And data augmentation is like munging that data to build priors associated with munge-demunge into the system. What do generative models bring into the system when you are talking about training?
I guess you can generate artificial data to use to train the other system, but why not just take your generative model and apply it's architecture / feature space representation to the target new system?
3: They make a point that VAEs are distinguished by their usage of latent variables. Are they saying because it's built into the model, as opposed to being strapped on later? StyleGAN has latent variables in the architecture?
Maybe the key to understanding what they are talking about is in this diagram from the paper.
4: Ok, this is perhaps the most fascinating result (and associated screen grab) from the paper. Check it out. Really take a close look at it.
Now this should be telling us something very important about what is going on in the system, what it is modeling, and how we could take advantage of that. Because it is a very different kind of scale space representation than more conventional approaches.