Abstract
Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embedding at one single scale from the information within the current sentence. The context information in neighboring utterances and the multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict style at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationship in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method is significantly outperformed the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representation that have never been discussed before.

Subjective Evaluation
To demonstrate that our proposed model can significantly improve the naturalness and expressiveness of the synthesized speech, some samples are provided for comparison. GT means ground truth. FastSpeech 2 means an open-source implementation of FastSpeech 2. WSV* means word-level style variations (WSV) model with several changes which are described in detail in the paper. And HCE means hierarchical context encoder (HCE) model, which predicts the style on global-level from the context. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.
S-MOS (in-domain)
Target Chinese Text | GT | FastSpeech 2 | WSV* | HCE | MSStyleTTS |
---|---|---|---|---|---|
小公母儿俩一进屋儿,屋儿里又多了两个人。 | |||||
晚上,赏三两小醑酒,又把客人吃剩的汤菜做成杂烩,送到砦四海的窝棚。 | |||||
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。 | |||||
寥花儿打累了,也把胆怯的心给打没了,拥被坐着,喘着粗气。 | |||||
马嵬坡下草青青,今日犹存妃子陵。 | |||||
郭二坏一眼瞥见余为农,行色匆匆地顺着二道街往前奔。 |
S-MOS (out-of-domain)
Target Chinese Text | FastSpeech 2 | WSV* | HCE | MSStyleTTS |
---|---|---|---|---|
这把火终于烧起来了,而且是燎原之势。 | ||||
作为一个正常人,在做出一个可能会掉脑袋的决定的选择上,是绝对不会如此轻率的。 | ||||
在讨饭的时候,他仔细研究了淮西的地理、山脉、风土人情,他开阔了视野,丰富了见识,认识了很多豪杰。 | ||||
甚至很多同学动不动,啊还会讨论各种学者的观点,什么张说李说陈说周说王说等等等等等等。 | ||||
所以这就决定了,我们的复习方向并不需要面面俱到。 | ||||
那学着学着学着,是不是有一种望洋兴叹的感觉? |
M-MOS
Target Chinese Text | GT | FastSpeech 2 | WSV* | HCE | MSStyleTTS | MSStyleTTS(AR) |
---|---|---|---|---|---|---|
窜轰子,是黑话就是烧死。六格被绑在了通天神树的半截腰儿上。土匪们抱来了成捆的羊草,堆在了他脚下。六格豪迈地说:大当家哒,羊草是薰蚊子哒。 | ||||||
指着锅里的附子片说:这玩意儿自古就被列为‘回阳救逆第一品“。但是,你不炮制它,或者炮制的不得法,它就是断肠草啊。令人不能呼吸,心脏骤停。 | ||||||
双录笑着征求五爷:五爷,我看先开席吧。诶图协领和载佐领官身由不得自己呀,陪着钦差大老爷四处转悠呐”嗯行。咱们爷儿们儿先喝着,他们俩啥时候儿到啥时候儿补上”。 | ||||||
灶坑火退了,除了炕头儿再没暖和的地方。寥花儿冻得不行,悄悄儿地脱了棉祅棉裤,往暖呼儿呼儿的被窝儿里钻。六格一直在装睡“嘿嘿地笑了”一声。 | ||||||
穆隆阿吃惊地问桃儿:诶呦,啥事儿把你急成这样儿啊”。桃儿拉着穆隆阿进了里屋儿,把刚才的事儿根根梢梢儿地说了一遍,末了儿她问穆隆阿。 | ||||||
看着前翰林府塌了架,任木匠才回到了刘二华堂大车店。刘二华堂会来事儿,任木匠二进古城子,还住在他家的上房。刘二华堂殷勤地端茶递水,任木匠也不避讳刘二华堂,对四梁八柱说。 |
Ablation Study
The Effect of Using Knowledge Distillation Strategy to Train the Predictor
In-domain
Target Chinese Text | MSStyleTTS | without residual style embedding |
---|---|---|
必须用你的目光逼退鹰眼射出的寒光。 | ||
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。 | ||
顺便来看看你这个小兔崽子,讨你一口儿江鲜野味儿。 | ||
她自作主张,选了两个分量足的样式,西施红着脸点了点头。 | ||
勾秀云嘴上缺个把门儿的,她调笑四海。 |
Out-of-domain
Target Chinese Text | MSStyleTTS | without residual style embedding |
---|---|---|
这个人叫周德兴,我们后面还要经常提到他。 | ||
这个人当然就是我们的朱重八。 | ||
最典型的疏忽大意,就是所谓的忘却法,我忘了干嘛忘了干嘛。 | ||
方向盘也断了,喇叭也坏了,玻璃也摇不下来了,我嗓子也哑了。 | ||
我以为我踩错了,又把刹车踩到底,啪两个人被撞死了。 |
The effect of using residuals to represent style variations
Target Chinese Text | MSStyleTTS | without residual style embedding | GT |
---|---|---|---|
您老放心,漫说开荒累不死人,就是赴汤蹈火,您侄子第一个跳进去。 | |||
勾秀云嘴上缺个把门儿的,她调笑四海。 | |||
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。 | |||
瓜尔佳氏哼了一声,呵斥道。 | |||
打开食盒,里面儿是血肠儿白肉、大馅儿包子,还有一葫芦酒。 |
Comparisons of utilizing different ranges of context information in predictor
Target Chinese Text | L=0 | L=1 | L=2 | L=3 | L=4 |
---|---|---|---|---|---|
每年腊月门子忙活一阵,賺到的银两都在正月里的赌场上还了人家。 | |||||
本以为六格会搂席,未承想却斯文起来,端端正正儿地坐在那儿,莞尔一笑,想了半天他说. | |||||
贴上余为农,既养了家也解了自己的饥渴。 | |||||
瓜尔佳氏哼了一声,呵斥道。 | |||||
再说了,汪半城也脚着,就算这西施有些说道儿。 |
The effect of multi-scale style predictor
Target Chinese Text | MSStyleTTS | -residual connections |
---|---|---|
无论啥人想给猩猩怪翻案,都不是那么好相与的。 | ||
要想去掉链子,再花三十吊. | ||
刘二华堂会来事儿,任木匠二进古城子,还住在他家的上房。 | ||
他连忙儿打开了盒子,假地契原封不动儿地还躺在里面儿。余为商抹了把汗,胆儿突突地问. | ||
怀瑾听了若有所悟,双手合十唱了一声佛号,躬身退了出去。 |
Comparisons between global-level, sentence-level and subword-level style representation
Investigation on global-level style
Target Chinese Text | Proposed | without global-level style | GT |
---|---|---|---|
您老放心,漫说开荒累不死人,就是赴汤蹈火,您侄子第一个跳进去。 | |||
小公母儿俩一进屋儿,屋儿里又多了两个人。 | |||
西施留在了汪家,桃儿才体会到了什么叫汪大奶奶。 | |||
小施主,关老爷一生最重一个义字。 | |||
乌雅氏和勾秀云早已经捷足先登了。 |
Investigation on global-level and sentence-level style
Target Chinese Text | Proposed | without global-level and sentence-level style | GT |
---|---|---|---|
终于有人跳下了炕,明保脑瓜皮酥了一下。 | |||
他使劲儿拍了拍穆隆阿,又使劲儿拍了拍六格,骂了一句。 | |||
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。 | |||
瓜尔佳氏哼了一声,呵斥道。 | |||
必须用你的目光逼退鹰眼射出的寒光。 |
Case Study
To further explore the impact of the multi-scale style modeling framework on the expressiveness and prosody of synthesized speech, two case studies are conducted to compare our MSStyleTTS with two mono-scale baselines, respectively. The ground truth speeches are also provided as references.
Test case 1
Model | Target Chinese Text | Audio | Mel-Spectrogram |
---|---|---|---|
HCE | 明保听了大为高兴啊。 | ![]() |
|
GT | 明保听了大为高兴啊。 | ![]() |
|
Proposed | 明保听了大为高兴啊。 | ![]() |
Test case 2
Model | Target Chinese Text | Audio | Mel-Spectrogram |
---|---|---|---|
WSV | 裕瑚鲁氏摇了摇头。 | ![]() |
|
GT | 裕瑚鲁氏摇了摇头。 | ![]() |
|
Proposed | 裕瑚鲁氏摇了摇头。 | ![]() |