Jay Rose Posted January 3, 2018 Report Share Posted January 3, 2018 Fascinating article in today's NYT about neural networks generating still images of faces with no 'uncanny valley'. But buried in that article is reference to work at University of Washington last Summer... that automatically edits lips to match a different track! Literally puts new words in someone's mouth. On a computer screen, sync looks absolutely realistic. Resolution might not be enough for a big screen... but these things tend to leap forward quickly. Here's a link just to the UW demo. They took some real Obama speeches, and put them into multiple other Obama faces. Same speech, many different visual deliveries. The article doesn't mention what could happen when you edit the source speech to say something new. But heck, good dialog editors have always been able to change what someone says, on the track. Now a computer can make the target individual appear on-camera, saying the edited version! NYTimes full article link. Quote Link to comment Share on other sites More sharing options...
Constantin Posted January 3, 2018 Report Share Posted January 3, 2018 Oh, that lends a whole new meaning to fake news. Now they can even provide the picture to the fake soundbite. Scary. Quote Link to comment Share on other sites More sharing options...
Syoung Posted January 3, 2018 Report Share Posted January 3, 2018 I'm sure you guys have seen this, too. Adobe's Project Vocohttp://www.bbc.com/news/technology-37899902 Quote Link to comment Share on other sites More sharing options...
Jim Feeley Posted January 3, 2018 Report Share Posted January 3, 2018 Yikes! There's also this automated object/person removal demo from Adobe MAX... i.e., it's somewhere between a SIGGRAPH paper and a product feature. Brief marketing article (ie- not the tech paper): Cloak: Remove Unwanted Objects in Video https://research.adobe.com/cloak-remove-unwanted-objects-in-video/ Six-minute demo that's pretty interesting and a bit disconcerting. #fakeviews Quote Link to comment Share on other sites More sharing options...
Jay Rose Posted January 3, 2018 Author Report Share Posted January 3, 2018 6 hours ago, Syoung said: I'm sure you guys have seen this, too. Adobe's Project Vocohttp://www.bbc.com/news/tecnology-37899902 From the article: Quote At a live demo in San Diego on Thursday, Adobe took a digitised recording of a man saying "and I kissed my dogs and my wife" and changed it to say "and I kissed Jordan three times". I find that almost impossible to believe. There's no /zh/ in the source speech, so how could it create a /d zh/ diphthong for "Jordan"? Where did it get the /th/ for "three"? Or the /ee/? I'll accept that it could fake the /t/ for "times" by putting a stop at the front the word... but that's an ancient trick. (Forgive my attempts at phonetic transcription without IPA...) Quote Link to comment Share on other sites More sharing options...
Jim Feeley Posted January 4, 2018 Report Share Posted January 4, 2018 Jay, that Adobe project isn't perfect but is kinda cool. Perhaps an answer to your question is answered in the guy's paper, which came out several months after the Adobe demo that freaked out the BBC: VoCo: Text-based Insertion and Replacement in Audio Narration [snip] While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper, demo video, etc can be found here, though I gather from Adobe friends things have advanced a bit this year (still not productized, IIRC): http://gfx.cs.princeton.edu/pubs/Jin_2017_VTI/ Quote Link to comment Share on other sites More sharing options...
Jay Rose Posted January 4, 2018 Author Report Share Posted January 4, 2018 Jim, Thanks. The idea of using a speech generator and then adding prosody with NN makes sense. I figured the BBC quote was just sloppy reporting. Once that system is fully cooked, however... why bother with ADR at all? Dialog editor can sit with the director and rebuild any line they want. Quote Link to comment Share on other sites More sharing options...
IronFilm Posted January 14, 2018 Report Share Posted January 14, 2018 These are starting to remind me of the "Edit Button" software: On 1/4/2018 at 8:42 AM, Jim Feeley said: Six-minute demo that's pretty interesting and a bit disconcerting. #fakeviews At least we now need to never care about if boom is in shot! ;-) We can go as close in as we like for the best sound. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.