Jump to content
Sign in to follow this  
realsnd

2.5D Visual Sound with deep neural nets

Recommended Posts

Pretty cool: after training with wide angle images + sound from artificial stereo head + mono sound, a deep neural learned to

- generates binaural sound

- isolates sound sources

from video+mono audio alone.

 

See

https://arxiv.org/abs/1812.04204

http://vision.cs.utexas.edu/projects/2.5D_visual_sound/

 

This is only the beginning... In the future you may imagine, for example, an AI that focus/extract the relevant sound when you zoom in a high resolution image in post, etc.

 

Share this post


Link to post
Share on other sites

Definitely just the beginning, but a worthwhile launch that'll hopefully spark other developers.

 

 

This initial implementation is very limited: according to my reading of their paper, the experimenters had to train the NN using the same original binaural recordings that they later collapsed to mono and processed to 'artificial' binaural. There's absolutely no reason to believe  -- yet -- that it will translate to other mono recordings for which no spatial information exists... or even to other segments from the same acoustic (or the same session) for which the network hadn't been trained. It would be awesome to show that a 2.5D segment for which there was no specific training matched a binaural version of the same material.

 

On the other hand, their demo does show it's possible to steer a signal based on seeing movement on one side of the screen and not the other.  That's not quite "isolating sound sources"... but it's a start.

 

There's a lot more training and testing necessary just to see if 2.5D will be usable outside these controlled experiments. But stay hopeful: if there's a commercial market, it could happen.

 

--

 

My personal goal for NN in our business? Something I proposed about twenty years ago, when the necessary technology was where 2.5D is now. This would use speech recognition and prosody extraction from a noisy or echoey dialog recording, driving artificial speech trained from other recordings of the same actor. Truly automatic dialog replacement, no looping session required. We're getting there fairly quickly.

 

The downside of my dream? Producers who read breathless news stories about the technique, and decide they don't need a production mixer at all. 

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

×
×
  • Create New...