Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

Submitted to Interspeech 2022.

Hits

Yixuan Zhou*, Changhe Song*, Jingbei Li, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng

* Equal contribution.

Abstract

Exploiting rich linguistic information in raw text is crucial for expressive text-to-speech (TTS). As large scale pre-trained text representation develops, bidirectional encoder representations from Transformers (BERT) has been proven to embody semantic information and employed to TTS recently. However, original or simply fine-tuned BERT embeddings still cannot provide sufficient semantic knowledge that expressive TTS models should take into account. In this paper, we propose a word-level semantic representation enhancing method based on dependency structure and pre-trained BERT embedding. The BERT embedding of each word is reprocessed considering its specific dependencies and related words in the sentence, to generate more effective semantic representation for TTS. To better utilize the dependency structure, relational gated graph network (RGGN) is introduced to make semantic information flow and aggregate through the dependency structure. The experimental results show that the proposed method can further improve the naturalness and expressiveness of synthesized speeches on both Mandarin and English datasets.


Fig.1: The structure of the proposed model.

Subjective Evaluation

To present the generality of the proposed method, we train and evaluate on both DataBaker (Mandarin) and LJSpeech (English) datasets. Our proposed method is denoted as BERT-Dep(RGGN) and all the models are implemented based on phoneme-input Tacotron 2, which are described in detail in the paper. The sentences are selected randomly from the test set and Internet, thus some samples do not have the corresponding ground-truth speech. Besides, translations of Chinese texts are given in parentheses following.

For DataBaker (Mandarin)

ID Text Vanilla BERT BERT-Dep(BLSTM) BERT-Dep(RGGN) GT
DB-000443 古语云,不患寡而患不均。(As the old saying goes, do not suffer from a few but suffer from unevenness.)
DB-000815 香米煮熟后软滑爽口,米饭凉后米粒不发硬。(The fragrant rice is soft, smooth and refreshing after being cooked, and the rice grains are not hard after the rice is cooled.)
DB-001097 网友留言时称,书记未免也太过于奔放了吧?(Netizens said in a message that the secretary was too unrestrained, right?)
DB-001706 夏季是筹款淡季,原因是不少捐款者往往在度假。(Summer is a low season for fundraising, as many donors tend to be on vacation.)
DB-008399 那我就给你亮点真本事吧。(Then let me show you some real strength.)
Internet-1 千山鸟飞绝,万径人踪灭,孤舟蓑笠翁,独钓寒江雪。(From hill to hill no bird in flight, from path to path no man in sight. Alonely fisherman afloat, is fishing snow in lonely boat.)
Internet-2 姑姑也想过过过过儿过过的生活。(Gugu also had wanted to live the life that Guoer had lived.)
Internet-3 人的一生应当这样度过,当他回首往事时,不会因为碌碌无为,虚度年华而悔恨,也不会因为为人卑劣,生活庸俗而愧疚。(He must live it so as to feel no torturing regrets for wasted years, never know the burning shame of a mean and petty past.)

For LJSpeech (English)

For LJSpeech dataset, the overall sound quality of the synthesized speeches is slightly worse than DataBaker. Please focus more on expressiveness and prosody rather than sound quality, much thanks.

ID Text Vanilla BERT BERT-Dep(BLSTM) BERT-Dep(RGGN) GT
LJ001-0183 Therefore, granted well-designed type, due spacing of the lines and words, and proper position of the page on the paper,
LJ003-0282 Many years were to elapse before these objections should be fairly met and universally overcome.
LJ010-0188 Oxford expressed little anxiety or concern.
LJ048-0017 We satisfied ourselves that we had met our requirement, namely to find out whether he had been recruited by soviet intelligence. The case was closed.
Internet-1 Death is just a part of life, something we are all destined to do.

Ablation Study

Three ablation studies are conducted by removing RGGN_fwd, RGGN_rev, edge labels, respectively in the proposed method.

For DataBaker (Mandarin)

ID Text Proposed - RGGN_fwd - RGGN_rev - Edge labels GT
DB-000622 副首相克莱格同样为王室游艇建造提议泼冷水。(Deputy Prime Minister Clegg also poured cold water on the proposal of building a royal yacht.)
DB-000904 张韶涵作势与比萨斜塔拥抱。(Zhang Shaohan embraced the Leaning Tower of Pisa.)
DB-001129 牛炯正好让学生试写一篇小作文,周琦向他借本古汉语字典。(Niu Jiong just asked the students to try to write a small composition, and Zhou Qi borrowed an ancient Chinese dictionary from him.)
Internet-2 姑姑也想过过过过儿过过的生活。(Gugu also had wanted to live the life that Guoer had lived.)
Internet-4 黑化肥发灰会挥发,灰化肥挥发会发黑。(The black fertilizer will volatilize when it turns gray, and the gray fertilizer will turn black when it volatilizes.)

For LJSpeech (English)

For LJSpeech dataset, there exist many bad cases in - RGGN_fwd, resulting in a very low CMOS in Table 2 of the paper.

ID Text Proposed - RGGN_fwd - RGGN_rev - Edge labels GT
LJ003-0282 Many years were to elapse before these objections should be fairly met and universally overcome.
LJ010-0188 Oxford expressed little anxiety or concern.
LJ048-0017 We satisfied ourselves that we had met our requirement, namely to find out whether he had been recruited by soviet intelligence. The case was closed.
Internet-2 To be, or not to be: that is the question.

Case Study

For the sentence “To be, or not to be: that is the question.”, the speeches & mel-spectrograms generated by different methods are provided, and the dependency tree of this sentence is as follows:

Method Synthesized Speech Mel-Spectrogram
BERT
BERT-Dep(BLSTM)
BERT-Dep(RGGN)