Nowadays, GAN-based neural vocoders are preferred for their ability to synthesize high-fidelity audio with high speed and small footprint. However, it is still challenging to train a robust vocoder that can synthesize high-fidelity speech when it comes to significantly out-of-domain scenarios, such as unseen speakers with different styles, nonverbal vocalization, etc.
In this work, we propose snakeGAN, a GAN-based universal vocoder, which can generalize well under various scenarios. We introduce the time-domain supervision by applying the coarse-grained signal generated by the DDSP oscillator to the generator. We also introduce periodic nonlinearities through the snake activation function and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality.
To validate the effectiveness of our proposed method, we performed empirical experiments on various scenarios in subjective and objective metrics.Experimental results show that snakeGAN significantly outperforms the compared approaches and can generate high-fidelity audios including unseen speakers with unseen style, singing voices, instrumental pieces and nonverbal vocalization.
Fig.1: The proposed SnakeGAN model.
Evaluation
To demonstrate that our proposed model can generate various scenarios such as expressive-speech, singing voice, instrumental pieces and nonverbal vocalization, etc, and that our proposed model has achieved superior performance, some samples are provided for comparison. GT (Ref Audio) means the ground truth audio. HiFi-GAN means the official implementation of HiFi-GAN with improved discriminator. DDSP HooliGAN means the DDSP-based source-filter vocoder HooliGAN, also with improved discriminator. SnakeGANv1 means the proposed model in version 1, which takes the DDSP coarse-grained signal as audio template, and the GAN generator only need to learn the residual. SnakeGANv2 means the proposed model in version 2, which incorporates the DDSP coarse-grained signal into generator by DWT down sampling to each upsampple blocks as prior knowledge. We use 278 hours speech training data in total, mixing Mandarin and English voices.
Expressive-Speech
GT (Ref Audio)
HiFi-GAN
DDSP-HooliGAN
SnakeGANv1
SnakeGANv2
Singing Voice
GT (Ref Audio)
HiFi-GAN
DDSP-HooliGAN
SnakeGANv1
SnakeGANv2
Instrumental Pieces
GT (Ref Audio)
HiFi-GAN
DDSP-HooliGAN
SnakeGANv1
SnakeGANv2
Nonveral Vocalization
GT (Ref Audio)
HiFi-GAN
DDSP-HooliGAN
SnakeGANv1
SnakeGANv2
Investigating of the effectiveness of DDSP structure
To validate the effectiveness of DDSP structure, we implemente a DDSP-based vocoder HooliGAN and compare it with HiFi-GAN, and LJSpeech is used to train, while the out-of-domain scenarios are unseen speakers, language, singing voices and instrumental pieces.