Novel View Synthesis Tutorial

 Most things about living through a pandemic suck, but there does appear to be one side benefit, every academic and technical conference you might be interested in is being conducted virtually remotely.  So you can sit in your home and watch the lectures and tutorials without having to spend thousands of dollars and jacking the planet's carbon footprint through the roof in the process.

So here we have a cutting edge view into a fascinating sub domain of computer vision called View Synthesis.  So working off of a single or small number of photos of a scene and then letting a user manipulate the view of that scene in 3D.  

These Novel View Synthesis Tutorial lectures are from CVPR 2020 this summer.


Keep in mind that my purpose of pointing you at this tutorial is to encourage you to watch the 30 minute intro lecture at the beginning by Orazio Gallo called 'Noel View Synthesis: A Gentle Introduction'.

After you do, you will have a good introduction into the history and current state of the art of the field.  And the other presentations in the tutorial are supposed to give you exposure to the very latest and most exciting new developments in 2020.  You may or may not want to continue on with the rest of the lectures (this is a full day tutorial event).


Observations

1:  It was great to see that reference to the fathers of this field, the Chen and Williams paper 'View Interpolation for Image Synthesis' referencing excellent work done at Apple Computer's Advanced Technology Group (ATG) in the early 90s.


2:  Ok, so after watching the 'Gentle Introduction' we understand the problem domain, and how people tried to address it technologically.

Now if you are at all like me, you immediately want to know what is being done with deep learning in this Novel View Synthesis domain.

So right away, you could imagine a GAN architecture (let's call it ViewSynthesisGAN) that learns the latent space representation of a 3D scene.  And can construct artificial views throughout the modeled scene(s) encoded in it's latent space.

How does ViewSynthesisGAN work?

Well get to work students.  Figure it out.  Use your fastai api smarts.  Let's figure it out

I also wonder if the old Chen and Williams ideas could be somehow utilized in the GAN implementation.  Twisted to help constrain the latent embedded space of the ViewSynthesisGAN we are proposing here.


3: If you are like me (too many things to do and not enough time to do them), then finding 6 hours to watch every presentation in this tutorial could be challenging.  Feel free to do some jumping around to figure out which presentations might be of most interest to you.

I will endeavor to watch them all over the next few days, both because i am very curious about the content, and also because i want to see if every paper has deep learning in it somewhere, and if not, why not.

Perhaps our proposed ViewSynthesisGAN architecture already exists.  I will point you at the specific lecture if it is described for us.  If not, then maybe we need to keep thinking about how to build the ViewSynthesisGAN using the fastai api.

We will continue this discussion later in another post that will reference back here.  After we do a little more research on what i just described.


4: Now how about those additional lectures in this Tutorial?

I highly recommend you watch the second one by Rick Szeliski called 'Reflections on Image-Based Rendering'.  It totally compliments the first 'Gentle Introduction' and really provides a great overview.  This is an excellent presentation packed with a ton of useful information.


'SynSin: Single Image View Synthesis' by Olivia Wiles is an interesting presentation. They put together a deep learning generative model implementation that has some interesting 3D component stuff in the middle (for manipulating the latent space in ways that will be intuitive for users to manipulate the model's output as the view of the scene the model encodes is adjusted).

Here's a web link to the abstract for the oral presentation with a few results.

Here's a link to the paper.


'View Synthesis with Multiplane Images' by Richard Tucker had some fascinating information in it (fascinating for me).  I was not really familiar with the particular Multiplane Image data structure they use in their research.  And it's a really cool representation that would appear to have many things in common with other 'multi-resolution pyramid' image models, scale space representations, etc.

Seems like it could be useful for many different things, but you would immediately want to build interpolation into it to compensate for the 'moved the camera too far' artifacts he showed off.  

Why can't it just be a RGBA image with an associated Depth Map image? As opposed to using a full set of depth plane images like they do in their representation?

Why haven't they done the work to better incorporate deep learning?  They mentioned they tried a GAN and it didn't work well, why not?

And of course the artist in me wanted him to dig into the 'move the view' controls and take us way, way into the scene, where the system would have to essentially hallucinate what it was generating for large portions of the image.  From an artistic point of view, this might be the most interesting part of the system.  

What can you do with that part of the system?  

How can you make it more user controllable, more user adjustable, more interactively modulatable?


Comments

Popular posts from this blog

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Pix2Pix: a GAN architecture for image to image transformation

Smart Fabrics