Computational Video Editing for Dialog Driven Scenes

 Non-linear editing is of particular interest to me (given my background as one of the original developers of ProTools digital audio editing system).  So when i watched the video below and read the associated paper i got very excited.  Let's take a look at a demo and then we can comment further.

So we're looking at an experimental system to edit dialog driven scenes in a video.  And conventionally this kind of editing is done in a non-linear video editing system.  So Avid or Final Cut or Premiere would be common systems used for almost everything to watch today.  All of which have their humble beginnings back in the golden olden times when we were doing that original Pro Tools development effort.

And as they say in the video demo above, hand editing of this kind of scene by a human editor using one of these non-linear editing systems is kind of a pain.  Time consuming, tedious.  So developing semi-automated systems that can do a lot of the manual grunt work that is needed to created finished edited scenes is really exciting. At least to me, conventional film or video editors might feel somewhat intimidated by them.

This work is an example of what people are coining 'smart media'.  It's a pretty wild topic area.  One that has all kinds of amazing implications as you think through how it's going to totally transform the media landscape over the next decade.

There is a general issue with any AI automated 'intelligent' system that you want creative professionals and artists to use in their work.  It's what i call the marriage of the artist and the intelligent automated system.  It's something we spent a lot of time thinking about and developing in Studio Artist, which is a program for digital artists that includes computational intelligence in it.

Now Studio Artist is 20 years old at this point, so we were restricted by the technologies available at the time (enhanced over the years of course).  But technology marches on, and the kinds of things you can think about doing today are very exciting.

The work described in the video above is trying to do a few different things. Bring automatic scene segmentation and labeling into the video editing pipeline.  Generate composable representations for film and video editing.  And move the user interface for non-linear editing out of the frame based basement to higher cognitive levels.  With the ultimate goal of developing 'smart' editing systems that let a digital artist work at a much higher conceptual level when editing video.

Now because of my background, when i looked at this work i immediately thought about how it could be applied to digital audio editing.  Because non-linear video editors and digital audio editors are essentially the same on a conceptual level.  In non-linear video editing you are working with segments of video frames at the basement frame level.  In non-linear audio editing you are also working with segments of digital audio files, at the sample level potentially, but usually at a higher bar/beat level of organization (since that is how music in general is structured).

I don't think you should view these systems as replacing the video editor.  More as a tool to enhance the creative range of the video editor.  Note that the system described in the video above allows a film editor to work at a much higher conceptual level than conventional frame based video editing systems allow.

Here's a link to the paper associated with the work described above if you are interested in learning more about what is going on under the hood of this 'synthetic media' system.


  1. I asked a professional film editor (who has also written a book on film editing) to take a look at this research and comment. There comments are as follows below:
    No jump cuts’ ‘Start wide’ and ‘faster pace idioms’

    As a former editor - For dialogue scenes like the example video (shot -reverse shot) I have often used the Multi cam options in Premiere and Avid – if the production is shot with more than one camera in real time.

    Can these ‘idioms’ read cinematographic areas to be aware of while editing What about the cinematography (Lighting, color, camera moves, and human errors)?
    Like any language, the language of film editing is always changing.
    The shot reverse shot not only requires finding the best shot/take, but requires reading the eyes (the gaze) the blink, emotions, and adding timing/pacing, and maybe a move in or pull out in post. Have you read Walter Murch's short book: In the Blink of an Eye? He has some interesting ideas.

    This Shot reverse shot is a classic construction that some filmmakers avoid. Many film theoreticians have looked to Psychoanalytic theory for answers. The space between each cut that is ‘sutured’ together, seamlessly, so that we don’t notice the cuts is something to avoid. The space between each cut is what creates a desire for closure – which can never come. This is a contested theory.

    Some filmmakers have also experimented with ways to avoid the shot reverse shot. Some have opted for a two person shot dialogue scenes. Or some just avoid the editing altogether: Rope, 1917, Russian Ark, Timecode, …or reduce the number of edits…Tarkovsky films, Sokurov films, Bela Tarr films, Jeanne Dielman.

    To circle back to the shot-reverse-shot and this program.

    Perhaps this would be best served as live switchers for events or talk shows shot (I’m thinking of the format from the classic Charlie Rose), or a singles tennis match or ….

    Anyway – it has certainly got me thinking. Thank you.


Post a Comment

Popular posts from this blog

Pix2Pix: a GAN architecture for image to image transformation

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Smart Fabrics