<div dir="ltr"><span style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px">Dear community, </span><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px"><br></div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px">I am happy to present our <u>new library <a href="https://github.com/KinWaiCheuk/nnAudio" style="text-decoration-line:none;color:rgb(41,98,255)">nnAudio</a></u>, which allows you to feed waveforms directly into a PyTorch neural network. Our nnAudio layer converts the waveforms on the fly to spectrograms (linear, log, Mel, CQT), and even offers a trainable (Fourrier) kernel. So no more storing large batches of spectrogram images and preprocessing, we obtain speeds 100x faster then traditional processing, plus you can finetune the spectrogram to your task through training. </div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px"><br></div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px">More info on how to use nnAudio: <a href="https://github.com/KinWaiCheuk/nnAudio">https://github.com/KinWaiCheuk/nnAudio</a></div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px"><br></div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px">If you are interested to become a <u>contributor</u> to nnAudio to help with the feature request we have been receiving from our rapidly growing user base, let me know! </div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px"><br></div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px">More info in our publication: </div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px"><i>K. W. Cheuk, H. Anderson, K. Agres and D. Herremans, "nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks," in IEEE Access, doi: 10.1109/ACCESS.2020.3019084. </i><a href="https://ieeexplore.ieee.org/document/9174990">https://ieeexplore.ieee.org/document/9174990</a></div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px"><i><br></i></div><div style="color:rgba(0,0,0,0.87);font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px"><i>In this paper, we present nnAudio , a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio , however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch , its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa ) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa . An average of 2.8 hours is required for nnAudio , which is still four times faster than librosa . Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed.</i></div><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">Dorien Herremans, PhD<br>Assistant Professor<br><a href="http://dorienherremans.com" target="_blank">http://dorienherremans.com</a><br><br>Singapore University of Technology and Design<br>Information Technology and Design Pillar<br>Office 1.502-18<br><br><br></div></div>