TOWARDS EXPRESSIVE SPEAKING STYLE MODELLING WITH HIERARCHICAL CONTEXT INFORMATION FOR MANDARIN SPEECH SYNTHESIS

submitted to ICASSP 2022.

Download this project as a .zip file Download this project as a tar.gz file

Abstract

Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information considering structural relationship in context, including interphrase and inter-sentence relations. Moreover, to encourage this encoder to learn style representation better, we introduce a novel training strategy with knowledge distillation, which provides the target for encoder training. Both objective and subjective evaluations on a Mandarin lecture dataset demonstrate that the proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.

Subjective Evaluation

To demonstrate that our proposed model can significantly improve the naturalness and expressiveness of the synthesized speech, some samples are provided for comparison. GT means ground truth. FastSpeech 2 means original FastSpeech 2 with several changes consistent to the proposed model, and XLNET-FastSpeech 2 means FastSpeech 2 with a plain context encoder without the use of inter-sentence relations, which are described in detail in the paper. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.

Target Chinese Text GT FastSpeech 2 XLNET-FastSpeech2 Proposed
那学着学着学着,是不是有一种望洋兴叹的感觉?
而恰恰是因为学的深,反而有可能过不了。
这两种观点在今后我们所有的刑法问题中大家都会看到它们的分析。
从实质上来看,冒充军警人员抢劫都要判十年以上,那真警察抢劫是不更应该处十年以上了,是不这意思。
在二审期间那么就应该撤销案件啊宣判无罪。
但是我希望大家采取阶层论也要尊重四要件。

Ablation Study

Investigation on knowledge distillation training strategy

Target Chinese Text Proposed without knowledge distillation
那学着学着学着,是不是有一种望洋兴叹的感觉?
当然可以,只是从人道主义考虑因为他在日本已经受过刑。
这两种观点在今后我们所有的刑法问题中大家都会看到它们的分析。
但是我希望大家采取阶层论也要尊重四要件。

Investigation on hierarchical context encoder

Target Chinese Text Proposed without hierarchical context encoder
啊,这个成绩是什么呢?客观题。
它其实属于第几款呢?
叫余平故意泄露国家秘密罪。
但现在同学们学了刑法。
我们不要对自己抱以太高的期望。

Case Study

To explore the impact of contextual information on the expressiveness of synthesized speech, a case study is conducted to synthesize the same utterance with different context: i) using ground-truth context (original context); ii) randomly selecting 4 sentences and itself as context (irrelevant context); iii) using current sentence only (no context).

Context Target Chinese Text Audio Mel-Spectrogram
original context 因为大家一定要注意。
irrelevant context 因为大家一定要注意。
no context 因为大家一定要注意。