Pix2Pix: a GAN architecture for image to image transformation

 I thought following up yesterday's TraVelGAN post with a Pix2Pix GAN post would be useful to compare what is going on in the 2 architectures.  Two different approaches to the same problem.

I stole this Pix2Pix Overview slide below from an excellent deeplearning.ai GAN course (note that they borrowed it from the original paper) because it gives you a good feel for what is going on inside of the Pix2Pix architecture.  

Note how the Generator part is very much like an auto-encoder architecture, but rebuilt using the U-Net architecture features (based on skip-connections) that fastai has been discussing in their courses for several years before it became more widely known to the deep learning community at large (and which originally came from an obscure medical image segmentation paper).

So the Generator in this Pix2Pix GAN is really pretty sophisticated, consisting of a whole image to image auto-encoder network with U-Net skip connections to generate better image quality at higher resolutions.  It's easy when first reading about Pix2Pix to think they are just talking about a new auto-encoder architecture and not realize it's just the Generator part of a larger full GAN system with a Discriminator.

The Discriminator in the Pix2Pix GAN is also interesting, consisting of a PatchGAN Discriminator network that outputs a classification matrix.  The patch based comparisons work to better reproduce interesting structure in the images (at the location associated with the different patches).  Note that the patch comparison is pixel based.

An additional Pixel Distance Loss Term direct pixel difference (between the generated and real image) is also added to the Discriminator output. Again, to better reproduce more realistic fine detail in the generated images.

Here's the original Pix2Pix paper titled 'Image to Image Translation with Conditional Adversarial Networks'.  
I find the term 'transition' to be potentially misleading (since image processing and graphics folks think of image translation as being a kind of affine transformation of the image), so i prefer 'transformation' instead.

Since TraVelGAN is still fresh in your mind from yesterday's, what is different about Pix2Pix vs TraVelGAN?  
Hint, it has to do with the latent space and how TraVelGAN works to constrains it in an organized way (organized with respect to the image data).  That and working in latent space rather than direct image pixel space for computing image to image comparisons.

Another good question to think about is how Pix2Pix compares with CycleGAN.

There a a number of different papers that build off of (or tie into) the initial Pix2Pix architecture in interesting ways.  We point out a few of them below.

The 'High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs' paper presents the Pix2PixHD architecture, which synthesizes high resolution images from semantic label maps.

The 'Photo-Realistic Single Image Super Resolution Using a Generative Adversarial Network paper details a super resolution GAN that enhances the GAN output resolution by 4X.

The 'Patch-Based Image Inpainting with Generative Adversarial Networks' paper shows off an Image Inpainting GAN that uses a patchGAN discriminator.

The 'Semantic Image Synthesis with Spatially-Adaptive Normalization' paper shows off the GauGAN architecture.


Popular posts from this blog

Simulating the Universe with Machine Learning

CycleGAN: a GAN architecture for learning unpaired image to image transformations