Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Accepted to INTERSPEECH 2022

Abstract

Cross-speaker style transfer aims to extract the speech style of the given reference speech, which can be reproduced in the timbre of arbitrary target speakers. Existing methods on this topic have explored utilizing utterance-level style labels to perform style transferring via either global or local scale style representations. However, audiobook datasets are typically characterized by both the local prosody and global genre, and are rarely accompanied by utterance-level style labels. Thus, properly transferring the reading style across different speakers remains a challenging task. This paper aims to introduce a chunk-wise multi-speaker multi-scale style model to capture both the global genre and the local prosody in audiobook speeches. Moreover, by disentangling speaker timbre and style with the proposed switchable adversarial classifiers, the extracted reading style is made adaptable to the timbre of different speakers. Experiment results confirm that the model manages to transfer a given reading style to new target speakers. With the support of local prosody and genre type predictor, the potentiality of the proposed method in multi-speaker audiobook generation is further revealed.

Cross-speaker reading style transfer

We evaluate the proposed cross-speaker reading style transfer method on the following disjoint datasets:

MST-Originbeat: A neutral Mandarin corpus from the ICASSP 2021 M2VOC challenge [1], with one female and one male speaker.
DB: A private neutral Mandarin dataset with 10,000 utterances from another female Chinese speaker, named DB6.
Audiobook_FM: A private Mandarin audiobook dataset with 8 speakers and 2 types of topics (Fairy tale/Chinese martial arts fiction). A female and a male speakers cover both of the 2 topics, each of the other 6 speakers only covers one of the 2 topics. One of the 6 speakers is DB6, who only reads the fairy tale documents.

The proposed cross-speaker reading style transfer method aims to generate speeches in the timbre of the given target speaker, while preserving both local prosody and global genre of the reference speech from the audiobook dataset. The cross-speaker reading style transfer results of both baseline model and the proposed model are provided for comparison. (Baseline: An embedding-table-based baseline method with the 2 global branches of the multi-scale style model replaced with a speaker embedding table and a global genre embedding table.Which is similar to [2], except that there is an additional embedding table of the global genre to accommodate the audiobook dataset.)

Fairy tales

Reference Speech	Target Speaker	Baseline	Proposed

Martial arts fiction

Reference Speech	Target Speaker	Baseline	Proposed

Ablation study

In order to reveal the functionality of each component of the proposed method, 3 different settings of ablation experiments are conducted:

w/o GSE.style: remove the global genre branch in the style model, leaving the timbre branch as the only global module;
w/o Chunk: replace the chunk-wise GSE extracting method with an ordinary utterance-wise approach;
w/o SAC: employ vanilla adversarial classifiers to disentangle GSE vectors of different branches, instead of the proposed switchable adversarial classifiers (SAC).

Fairy tales

Reference Speech	Target Speaker	Proposed	w/o GSE.style	w/o Chunk	w/o SAC

Martial arts fiction

Reference Speech	Target Speaker	Proposed	w/o GSE.style	w/o Chunk	w/o SAC

Automatic audiobook generation

Based on the proposed cross-speaker reading style transfer model, an automatic audiobook generation system is constructed by incorporating a RNN-based text analysis model, which predicts the LPE and genre according to the BERT token embedding of phoneme embedding sequence of the given book content (similar to [2,3]). The predicted LPE and genre is generalizable to different speakers in the corpus, regardless of whether the genre of the given book content is included in the training data of the speaker. According to the predicted genre label and the identity of the desired speaker, the GSE vectors on each branch could be obtained by choosing the averaged GSE vectors over the training data of the target genre/speaker. Together with the predicted LPE and text sequence, the speeches of the target speaker reading the material with the predicted style is eventually generated.

The inference results of the audiobook generation system on out-of-set data are provided here, including both the speakers whose training data have seen the genre of given script (Seen), and the speakers whose training data have not (Unseen).

Fairy tales

Script	Speaker 02 (Seen)	Speaker 22 (Unseen)
一天，小猴子在路上捡到了一页书，上面有好多字，它就把这页书带回家。
从牛伯伯家经过时，看到牛伯伯正在挖土坑。小猴子甜甜地叫一声：“牛伯伯，你挖土坑干什么？”
牛伯伯说：“我要把这篮子里的水果种下去。等果子发芽，大树长成，动物们就有更多的果子吃了。”
小猴子也在自己家门前挖了一个土坑，把那一页书的字全都种到土坑里去。他觉得：只要这些字发芽，大树长成，动物们就有看不完的书。

Martial arts fiction

Script	Speaker 04 (Seen)	Speaker 01 (Unseen)
琴音似止未止之际，却有一二下极低极细的箫声在琴音旁响了起来。
回旋婉转，箫声渐响，恰似吹箫人一面吹，一面慢慢走近。
箫声清丽，忽高忽低，忽轻忽响，低到极处之际，几个盘旋之后，又再低沉下去，虽极低极细，每个音节仍清晰可闻。
渐渐低音中偶有珠玉跳跃，清脆短促，此伏彼起，繁音渐增；先如鸣泉飞溅，继而如群卉争艳，花团锦簇。

References

Q. Xie, X. Tian, G. Liu, K. Song, L. Xie, Z. Wu, H. Li, S. Shi, H. Li, F. Hong, H. Bu, and X. Xu, “The multi-speaker multi-style voice cloning challenge 2021,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp.8613–8617.
Q. Xie, T. Li, X. Wang, Z. Wang, L. Xie, G. Yu, and G. Wan, “Multi-speaker multi-style text-to-speech synthesis with single- speaker single-style training data scenarios,” arXiv preprint arXiv:2112.12743, 2021.
Z. Hodari, A. Moinet, S. Karlapati, J. Lorenzo-Trueba, T. Mer- ritt, A. Joly, A. Abbas, P. Karanasou, and T. Drugman, “Camp: a two-stage approach to modelling prosody in context,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6578–6582