The single-speaker singing voice synthesis (SVS) usually underperforms at pitch values that are out of the singer’s vocal range or associated with limited training samples. Based on our previous work, this work proposes a melody-unsupervised multi-speaker pre-training method conducted on a multi-singer dataset to enhance the vocal range of the single-speaker, while not degrading the timbre similarity. This pre-training method can be deployed to a large-scale multi-singer dataset, which only contains audio-and-lyrics pairs without phonemic timing information and pitch annotation. Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information. These pre-trained model parameters are delivered into the fine-tuning step as prior knowledge to enhance the single speaker’s vocal range. Moreover, this work also contributes to improving the sound quality and rhythm naturalness of the synthesized singing voices. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice, and a bi-directional flow model to improve the sound quality. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
Fig.1: The structure of the proposed model.
Subjective Evaluation
To demonstrate that our proposed model can significantly improve the naturalness and quality of the synthesized singing voice, some samples are provided for comparison. GT means ground truth. Baseline represents the baseline model we are comparing, and Proposed means the proposed model with pretrain strategy、learnable upsampling layer and bi-directional flow model, which are described in detail in the paper.
Target Chinese Text
GT
Baseline
Proposed
你说我不该不该不该在这时候
昂首到达每一个地方这世界的太阳
青春嫩绿得
很鲜明
想知道关于我的事情
我还在寻找一个依靠
不要再沉默徘徊
冲破这层层阻碍
我才明白外面世界如此精彩
时间飞这生命似钟摆
Ablation Study
We further conduct an ablation study to validate different contributions in our proposed method. We remove pretrain-strategy, bi-directional flow model, and learnable upsampling layer respectively. The audio samples are present below.
Target Chinese Text
GT
Proposed
without pretrain
without bi-flow
你说我不该不该不该在这时候
昂首到达每一个地方这世界的太阳
青春嫩绿得
很鲜明
想知道关于我的事情
我还在寻找一个依靠
不要再沉默徘徊
冲破这层层阻碍
我才明白外面世界如此精彩
时间飞这生命似钟摆
Target Chinese Text
GT
Proposed
without learnable upsampling layer
是不是说没有做完的梦最痛
把故事听到最后才说再见
右手左手慢动作重播
成长的烦恼算什么
昂首到达每一个地方这世界的太阳
青苔入镜檐下
小酒窝长睫毛迷人的无可救药
我放慢了步调感觉像是喝醉了
我永远爱你到老
就算曾经我们都轻如尘埃
Case Study
To demonstrate the impact of the aforementioned contributions, a case study is conducted to synthesize a testing sample that contains pitch values of limited training data. We compare the ground-truth, the proposed method and the baseline.
The pitch is marked with blue lines and the pitch value at the red line is shown on the right. This sample ends with a slightly low pitch that is associated with few training data. It is observed that the proposed method synthesizes this pitch accurately, but the baseline method tends to incorrectly use a higher pitch to replace this one, proving that the proposed pre-training strategy is effective in enhancing the vocal range.