Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

submitted to INTERSPEECH 2023.

Subjective Evaluation

To demonstrate that our proposed model can significantly achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text, some samples are provided for comparison. GT means ground truth. FastSpeech 2 means an open-source implementation of FastSpeech 2. UCS* means a unified controllable spontaneous conversational speech synthesis (UCS) model with some modified according to paper. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.

In the mos test, provide the real speech of the previous or next sentence of the current sentence as REF to serve as a reference for conversational contexts.

Each experiment was divided into two groups. w label group provided spontaneous behavior labels during the inferencing phase, and the evaluation metrics were the naturalness of spontaneous behavior in the synthesized speech and the overall naturalness of spontaneous style. wo label group uses spontaneous behavior labels automatically predicted by the model in the inferencing process, and the evaluation metrics focus more on the rationality of spontaneous behavior in the synthesized speech. In the text, 1 means filled pause, 2 means prolongation, 3 means both happen.

MOS1 (w label)

Target Chinese Text	REF	FastSpeech 2	UCS*	Proposed	GT
唉(1)那个(3)……你(2)……你有听过北京国际雕塑公园吗？（听过呀，那个是是一个雕塑文化艺术园区。）
（有的有的，是……零幺零六二八八幺幺四四，这个景点可以玩儿多久啊？）嗯(3)可以玩(3)……一个、一个小时差不多吧。
（嗯这个这个有的，电话是零幺零六四三三八八八七。对了这个……这个景点能玩儿多长时间啊？）嗯(3)两小时左右吧我觉得，嗯(1)你看可以吗？
（为、为什么啊？哎，算了，你走吧。）那(3)……那你把伞拿着吧。

MOS2 (wo label)

Target Chinese Text	REF	FastSpeech 2	UCS*	Proposed	GT
（哦好的，诶对了，那个它的周边……周边还有什么景点吗？）有那个恭王府还有南锣鼓巷。
你好，那个你知道……万……万寿山在什么地方吗？（当然了，就在海淀区新建宫门路颐和园里面，嗯你有这边的电话吧。）
（有啊有啊，要说起来的话你是不是还没有去过清华啊。你要不顺便就一道去了呗，反正你妹妹早晚也要了解的。）嗯我没去过。我……我去过的地方太少了，清华怎么样？有意思吗？
（是啊，嗯我今天早上刚刚才回来。）哦听起来不错。嗯你去那儿都干了什么啊？

Ablation Study

investigation on linguistics-aware encoder (w label)

Target Chinese Text	Proposed	without linguistics-aware encoder
你(1)、你说什么？
那(2)它的地址呢？地址是在哪儿？
谢谢。嗯(3)……我的车好多啦，这个是新的。
哦(3)你学英语多久了？

Investigation on semi-supervised pre-training method (wo label)

Target Chinese Text	Proposed	without semi-supervised pre-training method
嗯对，那那你先打一下试一下吧。
可是现在不是不是还挺早的吗？
唉，算了吧，我我还是想等到天黑再过去。
哦……那……那她边上好像还有很多其他景点吧？