SnakeGAN:A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias

Accepted by ICME-2023.

Download this project as a .zip file Download this project as a tar.gz file

Abstract

Nowadays, GAN-based neural vocoders are preferred for their ability to synthesize high-fidelity audio with high speed and small footprint. However, it is still challenging to train a robust vocoder that can synthesize high-fidelity speech when it comes to significantly out-of-domain scenarios, such as unseen speakers with different styles, nonverbal vocalization, etc. In this work, we propose snakeGAN, a GAN-based universal vocoder, which can generalize well under various scenarios. We introduce the time-domain supervision by applying the coarse-grained signal generated by the DDSP oscillator to the generator. We also introduce periodic nonlinearities through the snake activation function and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality. To validate the effectiveness of our proposed method, we performed empirical experiments on various scenarios in subjective and objective metrics.Experimental results show that snakeGAN significantly outperforms the compared approaches and can generate high-fidelity audios including unseen speakers with unseen style, singing voices, instrumental pieces and nonverbal vocalization.


Fig.1: The proposed SnakeGAN model.

Evaluation

To demonstrate that our proposed model can generate various scenarios such as expressive-speech, singing voice, instrumental pieces and nonverbal vocalization, etc, and that our proposed model has achieved superior performance, some samples are provided for comparison. GT (Ref Audio) means the ground truth audio. HiFi-GAN means the official implementation of HiFi-GAN with improved discriminator. DDSP HooliGAN means the DDSP-based source-filter vocoder HooliGAN, also with improved discriminator. SnakeGANv1 means the proposed model in version 1, which takes the DDSP coarse-grained signal as audio template, and the GAN generator only need to learn the residual. SnakeGANv2 means the proposed model in version 2, which incorporates the DDSP coarse-grained signal into generator by DWT down sampling to each upsampple blocks as prior knowledge. We use 278 hours speech training data in total, mixing Mandarin and English voices.

Expressive-Speech

GT (Ref Audio) HiFi-GAN DDSP-HooliGAN SnakeGANv1 SnakeGANv2

Singing Voice

GT (Ref Audio) HiFi-GAN DDSP-HooliGAN SnakeGANv1 SnakeGANv2

Instrumental Pieces

GT (Ref Audio) HiFi-GAN DDSP-HooliGAN SnakeGANv1 SnakeGANv2

Nonveral Vocalization

GT (Ref Audio) HiFi-GAN DDSP-HooliGAN SnakeGANv1 SnakeGANv2

Investigating of the effectiveness of DDSP structure

To validate the effectiveness of DDSP structure, we implemente a DDSP-based vocoder HooliGAN and compare it with HiFi-GAN, and LJSpeech is used to train, while the out-of-domain scenarios are unseen speakers, language, singing voices and instrumental pieces.

Speech

GT (Ref Audio) HiFi-GAN DDSP-HooliGAN

Singing Voices & Instrumental Pieces

GT (Ref Audio) HiFi-GAN DDSP-HooliGAN