Zero-Shot text-to-Image Generation
The Open-AI DALL-E paper and associated code are online now. They had posted the results before, but the actual technique was still under speculation, as discussed in this previous HTC post.
The paper is out now. They use a discreet variational auto-encoder (dVAR) to compress a 256x256 RGB image down to 32x32. That lower layer can be thought of as a set of image tokens, that are then fed into an auto-regressive transformer along with a set of BPE-encoded text tokens.
Here's a very quick overview video that discusses the work.
Here's a link to the paper.
Here's a link to a github repository with a PyTorch implementation of DALL-E.
Here's a link to the official github repository (PyTorch) for the discrete VAE used for DALL-E.