Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis
submitted to INTERSPEECH 2023.
Subjective Evaluation
To demonstrate that our proposed model can significantly achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text, some samples are provided for comparison.
GT
means ground truth.
FastSpeech 2
means an open-source implementation of FastSpeech 2.
UCS*
means a unified controllable spontaneous conversational speech synthesis (UCS) model with some modified according to paper. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.
In the mos test, provide the real speech of the previous or next sentence of the current sentence as
REF
to serve as a reference for conversational contexts.
Each experiment was divided into two groups.
w label
group provided spontaneous behavior labels during the inferencing phase, and the evaluation metrics were the naturalness of spontaneous behavior in the synthesized speech and the overall naturalness of spontaneous style.
wo label
group uses spontaneous behavior labels automatically predicted by the model in the inferencing process, and the evaluation metrics focus more on the rationality of spontaneous behavior in the synthesized speech.
In the text, 1 means filled pause, 2 means prolongation, 3 means both happen.