Connectionists: nnAudio - new Open Source library for GPU-based on the fly audio processing in PyTorch

Thu Oct 1 02:47:28 EDT 2020

Dear community,

I am happy to present our *new library nnAudio
<https://github.com/KinWaiCheuk/nnAudio>*, which allows you to feed
waveforms directly into a PyTorch neural network. Our nnAudio layer
converts the waveforms on the fly to spectrograms (linear, log, Mel, CQT),
and even offers a trainable (Fourrier) kernel. So no more storing large
batches of spectrogram images and preprocessing, we obtain speeds 100x
faster then traditional processing, plus you can finetune the spectrogram
to your task through training.

More info on how to use nnAudio: https://github.com/KinWaiCheuk/nnAudio

If you are interested to become a *contributor* to nnAudio to help with the
feature request we have been receiving from our rapidly growing user base,
let me know!

More info in our publication:
*K. W. Cheuk, H. Anderson, K. Agres and D. Herremans, "nnAudio: An
on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D
Convolutional Neural Networks," in IEEE Access, doi:
10.1109/ACCESS.2020.3019084. *https://ieeexplore.ieee.org/document/9174990

*In this paper, we present nnAudio , a new neural network-based audio
processing framework with graphics processing unit (GPU) support that
leverages 1D convolutional neural networks to perform time domain to
frequency domain conversion. It allows on-the-fly spectrogram extraction
due to its fast speed, without the need to store any spectrograms on the
disk. Moreover, this approach also allows back-propagation on the
waveforms-to-spectrograms transformation layer, and hence, the
transformation process can be made trainable, further optimizing the
waveform-to-spectrogram transformation for the specific task that the
neural network is trained on. All spectrogram implementations scale as
Big-O of linear time with respect to the input length. nnAudio , however,
leverages the compute unified device architecture (CUDA) of 1D
convolutional neural network from PyTorch , its short-time Fourier
transform (STFT), Mel spectrogram, and constant-Q transform (CQT)
implementations are an order of magnitude faster than other implementations
using only the central processing unit (CPU). We tested our framework on
three different machines with NVIDIA GPUs, and our framework significantly
reduces the spectrogram extraction time from the order of seconds (using a
popular python library librosa ) to the order of milliseconds, given that
the audio recordings are of the same length. When applying nnAudio to
variable input audio lengths, an average of 11.5 hours are required to
extract 34 spectrogram types with different parameters from the MusicNet
dataset using librosa . An average of 2.8 hours is required for nnAudio ,
which is still four times faster than librosa . Our proposed framework also
outperforms existing GPU processing libraries such as Kapre and torchaudio
in terms of processing speed.*

-- 
Dorien Herremans, PhD
Assistant Professor
http://dorienherremans.com

Singapore University of Technology and Design
Information Technology and Design Pillar
Office 1.502-18
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20201001/01bcceed/attachment.html>