CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis

Accepted by interspeech 2022

Subjective Evaluation

We use two datasets to demonstrate that our proposed model can synthesize speech with style more appropriate to the input text. First is an emotional corpus on Mandarin, second is a English audiobook dataset. Some samples are provided for comparison. GT means ground truth. MRTTS represents the baseline model we are comparing, and Proposed means the proposed TTS with a Contrastive Acoustic-Linguistic Module (CALM), which are described in detail in the paper.

emotional corpus on Mandarin

The emotional corpus on Mandarin include six categories (angry, fear, disgust, happy, sad, surprised). Notice that we do not use the emotion labels as supervision in our experiment.

Text	GT	MRTTS	Proposed
已经感受到了自己的无比强大了，哈哈！（喜, happy）
他不顾别人怎么议论，竟然扬长而去。（惊, surprised）
天呐，怎么办，她的手里有刀！（恐, fear）
永和豆浆，你不光垃圾还很恶心。（厌, disgust）
我看到正义无法伸张，大家无不感到痛心疾首。(哀, sad)
你不让我好好活着，我也不会让你好过的！（怒, angry）

English Audiobook dataset

The second dataset is open-source English audiobook data. The books are read by the 2013 Blizzard Challenge speaker, Catherine Byers.

Text	GT	MRTTS	Proposed
But when it came to breaking in, that was a bad time for me.
Ginger and I had become fast friends, and now I missed her company extremely.
I never saw a man so pleased.
My friend, Missus Fraser is mad for such a house, and it would not make me miserable .
i got up.