Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Submitted to Interspeech 2022.

Download this project as a .zip file Download this project as a tar.gz file

Abstract

Previous works on expressive speech synthesis focus on modelling the mono-scale style embedding from the current sentence or context, but the multi-scale nature of speaking style in human speech is neglected. In this paper, we propose a multi-scale speaking style modelling method to capture and predict multi-scale speaking style for improving the naturalness and expressiveness of synthetic speech. A multi-scale extractor is proposed to extract speaking style embeddings at three different levels from the ground-truth speech, and explicitly guide the training of a multi-scale style predictor based on hierarchical context information. Both objective and subjective evaluations on a Mandarin audiobooks dataset demonstrate that our proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.


Fig.1: The architecture of our proposed model.

Subjective Evaluation

To demonstrate that our proposed model can significantly improve the naturalness and expressiveness of the synthesized speech, some samples are provided for comparison. GT means ground truth. FastSpeech 2 means an open-source implementation of FastSpeech 2. WSV* means word-level style variations (WSV) model with several changes which are described in detail in the paper. And HCE means hierarchical context encoder (HCE) model, which predicts the style on global-level from the context. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.

Target Chinese Text GT FastSpeech 2 WSV* HCE Proposed
小公母儿俩一进屋儿,屋儿里又多了两个人。
晚上,赏三两小醑酒,又把客人吃剩的汤菜做成杂烩,送到砦四海的窝棚。
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
寥花儿打累了,也把胆怯的心给打没了,拥被坐着,喘着粗气。
马嵬坡下草青青,今日犹存妃子陵。
郭二坏一眼瞥见余为农,行色匆匆地顺着二道街往前奔。

Ablation Study

Investigation on global-level style

Target Chinese Text Proposed without global-level style GT
您老放心,漫说开荒累不死人,就是赴汤蹈火,您侄子第一个跳进去。
小公母儿俩一进屋儿,屋儿里又多了两个人。
西施留在了汪家,桃儿才体会到了什么叫汪大奶奶。
小施主,关老爷一生最重一个义字。
乌雅氏和勾秀云早已经捷足先登了。

Investigation on multi-scale framework

Target Chinese Text Proposed without multi-scale framework GT
终于有人跳下了炕,明保脑瓜皮酥了一下。
他使劲儿拍了拍穆隆阿,又使劲儿拍了拍六格,骂了一句。
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
瓜尔佳氏哼了一声,呵斥道。
必须用你的目光逼退鹰眼射出的寒光。

Investigation on residual style embedding

Target Chinese Text Proposed without residual style embedding GT
您老放心,漫说开荒累不死人,就是赴汤蹈火,您侄子第一个跳进去。
勾秀云嘴上缺个把门儿的,她调笑四海。
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
瓜尔佳氏哼了一声,呵斥道。
打开食盒,里面儿是血肠儿白肉、大馅儿包子,还有一葫芦酒。

Case Study

To explore the impact of the multi-scale speaking style on the expressiveness and naturalness of synthesized speech, a case study is conducted to synthesize an example utterance in test set with HCE and the proposed model, and the ground-truth speech is also provided for reference.

Model Target Chinese Text Audio Mel-Spectrogram
HCE 明保听了大为高兴啊。
GT 明保听了大为高兴啊。
Proposed 明保听了大为高兴啊。