HTC Seminar Series #16 - Style and Structure Disentanglement for Image Manipulation

 This weeks HTC Seminar Series talk focuses on how to make deep learning systems that transform images more user controllable.  So a neural net that takes an image as it's input, processes it using a trained deep learning model, and then outputs the deep learning model's processing result as a new output image.

We would like to add user adjustable slider controls to the deep learning model.  

We would also like these user slider controls to correspond to some aspect of human perception of the comparison between the input image and the generated output image. So that adjusting the slider controls results in useful and understandable manipulation of some aspect of the properties of the imaging transformation being generated by the deep learning model.

Richard Zhang is a research scientist at Adobe Research..  His presentation is entitled 'Style and Structure Disentanglement for Image Manipulation'.  It was presented at the AIM Workshop, ECCV 2020. Note that there are a lot of different co-authors associated with this work.

I've jotted down a few key topics covered in the presentation below.

3 different approaches for building systems that learn 'style' transfer

paired translation - we have examples of input and desired transformation of input to train system

unpaired translation    - we have examples of input and output categories, but no direct match ups

    - example: make horse look like zebra

unlabeled collection of images    - system learns everything about data in unsupervised way

    - we can potentially apply labels afterwards

New work based on patch-base contrastive loss

softmax cosine similarities inference loss

input -> encoder -> decoder -> output -> encoder

match parameters of 2 encoders, as opposed to comparing images

    - this is worth noting, because if you just quickly glance at the paper you come away thinking they are working with image patches directly, but he says that is not really the case.  The encoder parameters match different spatial neighborhoods depending on where you dip into them, and this is the 'patch'. 

    - At least this is my understanding, we should really look directly at the code to see what they are specifically doing.

CUT Contrastive Unpaired Translation        -faster then CycleGAN

FastCUT    - fast then CUT, results more like CycleGAN

Here's a link to the Contrastive Learning for Unpaired Image-to-Image Translation work.

This first paper has the CUT, FastCUT work discussed above in it.

Here's a link to the Swapping Autoencoder for Deep Image Manipulation work.

This second paper has some cool ideas in it.  The notion of grabbing part of the embedded representation associated with one image, and inserting it into the associated part of an embedded representation associated with a second image.  With the hope that you can separate out things like 'structure' vs 'texture' by grabbing different parts of the embedded representation.  

I was hacking something in fastai to try and do this when i unearthed this recent published research that experiments with it in an autoencoder architecture.

Here's a link to the original CycleGAN paper work.

    -Remember, work presented in the talk supersedes CycleGAN.  Note that they have PyTorch code here now in addition to the older Torch code.


Popular posts from this blog

Pix2Pix: a GAN architecture for image to image transformation

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Smart Fabrics