TOWARDS MULTI-SCALE STYLE CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS

submitted to INTERSPEECH 2021

Abstract

This paper introduces a multi-scale speech style modeling method for end-to-end expressive speech synthesis. The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech, which are then fed into the speech synthesis model as an extension to the input phoneme sequence. During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion. By applying the proposed method to style transfer task, experimental results indicate that the controllability of the multi-scale speech style model and the expressiveness of the synthesized speech are greatly improved. Moreover, by assigning different reference speeches to extraction of style on each scale, the flexibility of the proposed method is further revealed.

Multi-scale Style Transfer

To demonstrate the influence of global-scale style embedding (GSE) and local-scale prosody embedding (LPE) in our proposed multi-scale speech style modeling method, the parallel style transfer results of Base-G, Base-L, and proposed model are provided here for comparison.

Reference Audio Emotion Chinese Text Synthesized (Base-L) Synthesized (Base-G) Synthesized (Proposed)
Neutral 但是如今大部分的跨洋货运还是靠轮船
Angry 但是如今大部分的跨洋货运还是靠轮船
Fear 但是如今大部分的跨洋货运还是靠轮船
Disgust 但是如今大部分的跨洋货运还是靠轮船
Happy 但是如今大部分的跨洋货运还是靠轮船
Sad 但是如今大部分的跨洋货运还是靠轮船
Surprised 但是如今大部分的跨洋货运还是靠轮船

Multi-reference Style Transfer

This part consists of the multi-reference style transfer results of proposed model, where the proposed model is given two different reference audios each time for the style transfer on global and local scale, respectively. These two reference audios can have different global emotion style, text content and local prosody style. The synthesized speech, which shares the same text with the local-scale reference audio, tends to reproduce the global emotion style of global-scale reference audio, while preserves the local prosody style of the local-scale reference.

Reference Audio (Global) Emotion (Global) Chinese Text (Global) Reference Audio (Local) Emotion (Local) Chinese Text (Local) Synthesized
Neutral 但是如今大部分的跨洋货运还是靠轮船 Sad 北京周三限行尾号为二和七
Neutral 但是如今大部分的跨洋货运还是靠轮船 Angry 北京周三限行尾号为二和七
Sad 北京周三限行尾号为二和七 Happy 但是如今大部分的跨洋货运还是靠轮船
Sad 北京周三限行尾号为二和七 Surprised 但是如今大部分的跨洋货运还是靠轮船
Happy 但是如今大部分的跨洋货运还是靠轮船 Fear 哈尔滨明天北风三到四级
Happy 但是如今大部分的跨洋货运还是靠轮船 Disgust 哈尔滨明天北风三到四级

Quasi-phoneme-scale v.s. Frame-scale

In order to prove that downsampling of fine-grained prosody sequence enhances the robustness of reference attention, we present the parallel style transfer results of Multi-scale Style Models with proposed quasi-phoneme-scale (proposed) and frame-scale (Base-FS) respectively. From left to right, each diagram includes the spectrograms of the reference and synthesized audios, and the alignments of decoder attention and reference attention.

Sample1

Reference Audio

Audio Chinese Text Emotion
我顺道来玩一玩好吗?我会学小猫叫哦,喵 Sad

Result

System Plot result Synthesized Audio
Base-FS qvsf_basefs_case1
Proposed qvsf_proposed_case1

Sample2

Reference Audio

Audio Chinese Text Emotion
我顺道来玩一玩好吗?我会学小猫叫哦,喵 Surprised

Result

System Plot result Synthesized Audio
Base-FS qvsf_basefs_case2
Proposed qvsf_proposed_case2

Sample3

Reference Audio

Audio Chinese Text Emotion
你猜我猜你猜不猜,今天算了,去玩吧 Angry

Result

System Plot result Synthesized Audio
Base-FS qvsf_basefs_case3
Proposed qvsf_proposed_case3

Sample4

Reference Audio

Audio Chinese Text Emotion
同学,要我帮你培养正确三观吗? Fear

Result

System Plot result Synthesized Audio
Base-FS qvsf_basefs_case4
Proposed qvsf_proposed_case4

Sample5

Reference Audio

Audio Chinese Text Emotion
同学,要我帮你培养正确三观吗? Surprised

Result

System Plot result Synthesized Audio
Base-FS qvsf_basefs_case5
Proposed qvsf_proposed_case5