HILvoice: Human-in-the-Loop Style Selection for Elder-Facing Speech Synthesis

Published on ISCSLP 2022

Abstract

Controllable speech synthesis has made great progresses over the last decades. State-of-the-art systems can provide flexible interfaces for configuring the style of generated speech for targeted audience. However, for a specific audience group, e.g., the older adults, the selection of style that is favored by the group using the available configuration interfaces still needs to be investigated. Two main questions of such a style selection are (i) how to provide various options for the targeted audience to pick; and (ii) how to effectively obtain the opinions from the targeted audience. Since these two questions are highly correlated which makes it difficult to solve them separately, we propose a holistic framework to consider these two questions together by involving the targeted audience in an iterative loop. We demonstrate by experimental results that the proposed framework can successfully select a better speaker style for the older adults than the neural default setting. Analysis results show that the selected style has slower speaking rate, which coincides with previous auditory perception study results.

Proposed HILvoice Framework

Preference Test Samples

  Cantonese text λ0 λ2(explicit) λ2(implicit)
1 冇大块横条石压镇,墓碑可以讲系一挖就倒。
2 例如宴會、音樂演奏會、受勳儀式咁。
3 希望为啲新书宣传速销。
4 包括蝶式、背泳、蛙泳式、自由式四種泳式。
5 劉千石響立法會政改方案投下棄權票。
6 呢的古本比較現在呢三個流通版本多咗或小咗一的經卷。
7 佢確立揸住朱印狀者先准做貿易嘅朱印船制度。
8 荣华东街、旧豆栏、拱日门、鸡栏之庙;
9 亦有啲卡巴係以薯條取代土耳其麵包當餸嚟食嘅;
10 透過唔同心法令敵人勁氣落空又或將之引乜,
11 小葉嚟到坐咗喺劃畫老師後面個位預備郁手。
12 家下拍烏蠅未必要拍死佢咁暴力。
13 严防极端分子威胁负责选务官员安全。
14 计划嘅路线分为三大部份。
15 水裏便嘅鹽就搞到鋼筋好快生鏽變弱。

Case Study

Cantonese text: 水裏便嘅鹽就搞到鋼筋好快生鏽變弱。

λ Audio Mel-spectrogram
λ2(explicit)
λ0
λ2(implicit)