Abstract
Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech. Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.
The Definitions, lexical features and acoustic characteristics of spontaneous behaviors
Audio samples for different models
NOTE: In the text, <laughter> indicates laughter. (Spontaneous Behavior type of Chinese and English) denotes the type of spontaneous behaviors, and the text corresponding to the spontaneous behavior is bolded.
MOS
Target Chinese Text | FastSpeech 2 | VALL-E | Base-L | Proposed |
---|---|---|---|---|
<laughter>(忍不住笑,Involuntary laughter)好啊我今天就过来找你。 | ||||
那那那(结巴,Stutter)你做买卖啊。 | ||||
今天是个好日子,那就做两组普拉提吧! | ||||
<laughter>(嘲笑,Scoff)你这个技术,还是先赢了他再说吧。 | ||||
<laughter>(大笑,Cachinnation)这个笑话太好笑了! | ||||
哦(醒悟,Realization)你原来以为这是在家里呀? | ||||
嗯(赞同,Positive feedback),风景真美丽呀。 | ||||
你只要想见,随时都可以见的呀(撒娇,Coquetry)! | ||||
喂(提醒,Reminder),是不是遇到什么麻烦了,我能帮你什么吗? | ||||
嗯?(疑惑,Doubt)卖塑料瓶可以吗? |
Comparison of manually labeled spontaneous labels and model-predicted spontaneous labels(ABX)
To demonstrate that using predicted labels also produces speech with reasonably spontaneous behavior, labels are not prompted in the sample text.
Target Chinese Text | Proposed-manual | Proposed-predicted |
---|---|---|
等一下,呃,这个好像不是这样的。 | ||
唉,你又觉得我不乖了是吗? | ||
好的好的,芋泥啵啵奶茶大杯不加冰,稍等五分钟,马上好! | ||
<laughter>你怎么在这里啊。 | ||
哼,很多人总是一边嫌弃一边还玩。 |
Ablation Study
investigation on spontaneous prosody modeling
Target Chinese Text | Proposed | without spontaneous prosody modeling |
---|---|---|
你,你(结巴,Stutter)瞎说。 | ||
<laughter>(忍不住笑,Involuntary laughter)不是,这个不是这样放的啦。 | ||
嗯(撒娇,Coquetry),好困,让我再睡一会吧。 | ||
我最近开始学瑜伽,感觉对身体和心灵都很有好处。 |
Investigation on spontaneous behavior modeling
Target Chinese Text | Proposed | without spontaneous behavior modeling |
---|---|---|
<laughter>(微笑,Smile)先生您的酒店在这里,请跟我走。 | ||
哼(撒娇,Coquetry),你再这样我就不理你了啊。 | ||
那(填充停顿,Filled pause),你还爱他吗? | ||
你周末(填充停顿,Filled pause)有什么计划?我们可以一起,(填充停顿,Filled pause),去看场电影或者散步。 |
Controllable of spontaneous behaviors
NOTE: text corresponding to the spontaneous behavior is bolded
Target Chinese Text | Spontaneous behavior type | Audio |
---|---|---|
嗯,风景真美丽呀。 | 赞同,Positive feedback | |
嗯,风景真美丽呀。 | 撒娇,Coquetry | |
嗯,风景真美丽呀。 | 填充停顿,Filled pause | |
<laughter>好啊,我今天就过来找你 | 忍不住笑,Involuntary laughter | |
<laughter>好啊,我今天就过来找你 | 尬笑,Awkward laughter | |
<laughter>好啊,我今天就过来找你 | 微笑,Smile | |
你周末有什么计划?我们可以一起,去看场电影或者散步。 | 填充停顿,Filled pause | |
你周末有什么计划?我们可以一起,去看场电影或者散步。 | 结巴,Stutter |