Jump to content
Sign in to follow this  
Jay Rose

Neural Networks for Audio: how they work

Recommended Posts

Audionamix's TRAX Pro SP and the Dialog Isolate module in iZotope RX6 are kind of amazing: they use Neural Networks to clean up production tracks in ways we've never been able to before, and they can even give you a stem with the stuff they took away from dialog (like a clean bg track, or just the environmental music). Far better than any of the multiband expansion noise reducers or other algorithmic techniques we've been using for a couple of decades.

 

They can also seriously screw up your track. Just like any other processing.

 

Both manufacturers graciously gave me large chunks of Skype time with their senior developers, so I could write an article about the techniques for CAS Quarterly. The article appears online today, and will be mailed soon. We've also posted a web page where I've put downloadable samples of actual processing of production elements. (If you do the web page, please download the AIFFs. The posted mp3 files are just for orientation, and distort some of what the processors are doing.)

 

Fellow CAS member Stephen Fitzmaurice added a sidebar with initial impressions of the Audionamix in his mix theater. Detailed reviews will be coming in a future issue.

 

Article is in the Quarterly at cinemaudiosociety.org, or as a downloadable pdf at jayrose.com

 

This stuff has been blowing my mind. Please comment. (On the technique, not on my mind; that's a lost cause.)

 

Share this post


Link to post
Share on other sites

Neural Networks are truly incredible. For as long as training can take, the hive-mind nature of software eclipses what any individual human being can learn in his lifetime. And on top of that, computers can train each other, like two individuals playing game after game of chess with the only objective of beating the other player.

 

I'm actually rather surprised Audionamix and iZotope didn't talk about Generative Adversarial Networks (GAN). Advanced neural networks usually employ many different techniques chained together to find the optimal outcome. LSTM is an important component in learning significant features over time, but when talking about appealing to human perceptions of quality, more and more there's going to be a generative component.

Such processing isn't going to be as much a matter of removing noise or moving noise to where it can't be heard. Actually it's a bit of a misnomer to describe noise as "being removed". Zero signal is still a known quantity; noise is the description of what can't be detected. So, whether it is redefining the unknown as zero or as some other more pleasant sound, generative processes will distort our recordings to make them sound better and that should be ok. After all, every technique from companders on tape machines and radios to advanced lossy digital compression algorithms has been distortion in the name of quality.

What I think we'll find and should start to accept is that our perceived notion of "quality" is not at all about purity of signal. At it's most basic, GAN is a technique based on the premise of fooling an entity into believing a generated set of data (image, audio, or other) is actually an observed set of data. That entity is a separate, adversarial, network (called the 'discriminator') that is trained on observed data and taught to recognize the difference so that it can challenge the 'generator' network to fool it... and fool us at the same time.

I have to say, I bought a Pixel 2 recently, and the AI used to make up for the limitations of the camera hardware is nothing short of astounding. And then there's this:
https://blogs.nvidia.com/blog/2017/12/03/nvidia-research-nips/

It's a scary notion, with unknown consequences, but I find it crazy, and fun, and quite humbling. 

Share this post


Link to post
Share on other sites

NewEndian, thanks for the link. That's incredible stuff. 

 

Off the top of my head, I suspect iZotope and Audionamix didn't use GAN because   

1) it's bleeding edge and these products have been in the works for a year,   

2) the infrastructure for commercial development -- like easily purchased AWS training -- isn't there yet (I'm sure it'll be available soon),   

3) the challenges of time-variant audio are so different from the xy arrays of image processing,  and

4) the immediate market for image manipulation is so much bigger than that for audio manipulation. 

 

Visual bias strikes again!

 

Share this post


Link to post
Share on other sites

Certainly nVidia's advanced use of GAN (among other techniques) is bleeding edge. The basic concept has been around for a lot longer though. I'm just surprised it isn't a featured strategy. Particularly when they mention LSTM, which is odd because the periodic nature of audio makes short term memory important but long term memory less so (a simple RNN technique seems more à propos). It makes me wonder if they have a philosophical aversion to explicitly generative techniques. But I'm sure they have their reasons, which is naturally why I feel particularly inquisitive about the article. :)

I do find it amazing how much the AI community assumes when you're talking about neural networks, you're either talking about image processing or voice recognition. But its true, thats where the money lies. Thankfully, one of the really great things about neural networks is that they're actually quite accessible. I expect we'll be seeing a real explosion of innovative uses for them in the near future.

You might be surprised about the similarity between image and audio processing when neural networks are involved though. The bread and butter of AI image processing, the convolutional neural network, is basically a mass of filters. The AI community talks in terms of 'frequency content' (they mean resolution), which makes sense when you think about how important it is to think about pixels in the context of surrounding pixels. That said, most of the concepts aren't specific to image processing, even if that's how they've been implemented the most. Just look at how adversarial networks and LSTM have been used to defeat 9-Dan Go champion, Lee Sedol.

Audio bias strikes again!

Share this post


Link to post
Share on other sites

Hi James,  

 

I worked on a Documentary on the Lee Sedol / AlphaGo match ( on Netfix/iTunes/Amazon called 'AlphaGo")  and it was my introduction to this incredibly fascinating world of AI and true learning machines. Although it is fun to think about the entertainment side of these applications it is the medical field where these machines are going to change things at an amazing rate. 

 

Share this post


Link to post
Share on other sites
10 hours ago, NewEndian said:

That said, most of the concepts aren't specific to image processing, even if that's how they've been implemented the most. Just look at how adversarial networks and LSTM have been used to defeat 9-Dan Go champion, Lee Sedol.


As an AI nerd and Go fanatic, when that happened I saw that as the pinnacle of human achievement of the year! Exciting times. Was reached much sooner than people predicted, everyone still thought that was years away (or maybe perhaps even decades away). 

Share this post


Link to post
Share on other sites
9 hours ago, chriskellett said:

Hi James,  

 

I worked on a Documentary on the Lee Sedol / AlphaGo match ( on Netfix/iTunes/Amazon called 'AlphaGo")  and it was my introduction to this incredibly fascinating world of AI and true learning machines. Although it is fun to think about the entertainment side of these applications it is the medical field where these machines are going to change things at an amazing rate. 

 


I saw that documentary and it was excellent!!! Well done!

Share this post


Link to post
Share on other sites

Had a regular lunch meeting with a close friend who owns Imaginex Studios in Malasia and NZ

 

He told me about cloud software that sort your dialogue tracks out and then you download it - done!!!!!!

 

mike

Share this post


Link to post
Share on other sites

Mike,

Do you mean it sorts character A's voice from character B's? Or that it sorts dialog from other noises like footsteps and bird calls (which are usually immune to conventional noise reduction)? 

Both are theoretically possible with NN, but I haven't heard of anyone doing the former. It would take a lot of training.  The latter is commercially available in a few products now.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

×