SinGAN - Learning a Generative Model from a Single Natural Image

 Ok, let's take a look at the fabulous SinGAN architecture (so many GAN architectures, so little time).  This is some interesting work, but i think you want to put it in the right perspective.  And with that cryptic intro, let's dive in first with a really great overview presentation by Tamar Shaham at Israel Computer Vision Day 2019.


You can check out the actual paper, titled 'SinGAN: Learning a Generative Model from a Single Natural Image' here.

You can check out the actual code (PyTorch, yeah) on Github here.

For a different take on the actual architecture, we turn to our old pal Yannic Kilcher for his astute analysis.


Ok, so we have a multi-scale architecture. Typical GAN structure where we input random noise and output image from the generator, but we do that at every scale.


Here's what the individual generator looks like at a single scale.
Each of the 5 conv-blocks in the single scale generator above consists of the form Conv(3x3)-Batch Norm-LeakyReLU.

Here's the PyTorch code for that conv-block implementation to make that clear (excuse the wrap around).

class ConvBlock(nn.Sequential):    
    def __init__(self, in_channel, out_channel, ker_size, padd, stride):
        super(ConvBlock,self).__init__()
        self.add_module('conv',nn.Conv2d(in_channel ,out_channel,kernel_size=ker_size,
stride=stride,padding=padd)),
        self.add_module('norm',nn.BatchNorm2d(out_channel)),
        self.add_module('LeakyRelu',nn.LeakyReLU(0.2, inplace=True))


Observations:

1: What are the priors built into the system? You always want to understand this when looking at any architecture.

2: Similarities/differences between pre-existing texture synthesis from example architectures?  What's really going on here?

3: Why would you use this computationally intense technique for the kinds of limited animation effects they show off?  As opposed to keyframing some interactive warps effects in StudioArtist for example?  Seriously.

More importantly, what does their animation output tell you about the latent space you are manipulating?  

Remember, interpolating between random inputs to a GAN is equivalent to manipulating the latent space of the system. 

So if you take StyleGAN for example trained on a dataset of head shot photos, and feed it a linear transformation between 2 different random inputs, you get a kind of morph between 2 artificial fake generated head shot output images.

What do 2 different random input vectors output from this system?  What does the output of the linear transformation between the 2 random input vectors look like?  What does this tell you about what the system is modeling?

4: I said looking at the code would make the conv-block implementation clearer.  As i'm looking at it now i'm thinking, but does it really? Or is it way more confusing on some level?  And what does that tell us about what might be a better way to build the implementation?



Comments

Popular posts from this blog

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Pix2Pix: a GAN architecture for image to image transformation

Smart Fabrics