2.5D Visual Sound with deep neural nets

Pretty cool: after training with wide angle images + sound from artificial stereo head + mono sound, a deep neural learned to

- generates binaural sound

- isolates sound sources

from video+mono audio alone.






This is only the beginning... In the future you may imagine, for example, an AI that focus/extract the relevant sound when you zoom in a high resolution image in post, etc.


Definitely just the beginning, but a worthwhile launch that'll hopefully spark other developers.



This initial implementation is very limited: according to my reading of their paper, the experimenters had to train the NN using the same original binaural recordings that they later collapsed to mono and processed to 'artificial' binaural. There's absolutely no reason to believe  -- yet -- that it will translate to other mono recordings for which no spatial information exists... or even to other segments from the same acoustic (or the same session) for which the network hadn't been trained. It would be awesome to show that a 2.5D segment for which there was no specific training matched a binaural version of the same material.


On the other hand, their demo does show it's possible to steer a signal based on seeing movement on one side of the screen and not the other.  That's not quite "isolating sound sources"... but it's a start.


There's a lot more training and testing necessary just to see if 2.5D will be usable outside these controlled experiments. But stay hopeful: if there's a commercial market, it could happen.




My personal goal for NN in our business? Something I proposed about twenty years ago, when the necessary technology was where 2.5D is now. This would use speech recognition and prosody extraction from a noisy or echoey dialog recording, driving artificial speech trained from other recordings of the same actor. Truly automatic dialog replacement, no looping session required. We're getting there fairly quickly.


The downside of my dream? Producers who read breathless news stories about the technique, and decide they don't need a production mixer at all. 

