Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Accepted to INTERSPEECH 2022

arXiv Hits


Cross-speaker style transfer aims to extract the speech style of the given reference speech, which can be reproduced in the timbre of arbitrary target speakers. Existing methods on this topic have explored utilizing utterance-level style labels to perform style transferring via either global or local scale style representations. However, audiobook datasets are typically characterized by both the local prosody and global genre, and are rarely accompanied by utterance-level style labels. Thus, properly transferring the reading style across different speakers remains a challenging task. This paper aims to introduce a chunk-wise multi-speaker multi-scale style model to capture both the global genre and the local prosody in audiobook speeches. Moreover, by disentangling speaker timbre and style with the proposed switchable adversarial classifiers, the extracted reading style is made adaptable to the timbre of different speakers. Experiment results confirm that the model manages to transfer a given reading style to new target speakers. With the support of local prosody and genre type predictor, the potentiality of the proposed method in multi-speaker audiobook generation is further revealed.

Cross-speaker reading style transfer

We evaluate the proposed cross-speaker reading style transfer method on the following disjoint datasets:

The proposed cross-speaker reading style transfer method aims to generate speeches in the timbre of the given target speaker, while preserving both local prosody and global genre of the reference speech from the audiobook dataset. The cross-speaker reading style transfer results of both baseline model and the proposed model are provided for comparison. (Baseline: An embedding-table-based baseline method with the 2 global branches of the multi-scale style model replaced with a speaker embedding table and a global genre embedding table.Which is similar to [2], except that there is an additional embedding table of the global genre to accommodate the audiobook dataset.)

Fairy tales

Reference Speech Target Speaker Baseline Proposed

Martial arts fiction

Reference Speech Target Speaker Baseline Proposed

Ablation study

In order to reveal the functionality of each component of the proposed method, 3 different settings of ablation experiments are conducted:

Fairy tales

Reference Speech Target Speaker Proposed w/o w/o Chunk w/o SAC

Martial arts fiction

Reference Speech Target Speaker Proposed w/o w/o Chunk w/o SAC

Automatic audiobook generation

Based on the proposed cross-speaker reading style transfer model, an automatic audiobook generation system is constructed by incorporating a RNN-based text analysis model, which predicts the LPE and genre according to the BERT token embedding of phoneme embedding sequence of the given book content (similar to [2,3]). The predicted LPE and genre is generalizable to different speakers in the corpus, regardless of whether the genre of the given book content is included in the training data of the speaker. According to the predicted genre label and the identity of the desired speaker, the GSE vectors on each branch could be obtained by choosing the averaged GSE vectors over the training data of the target genre/speaker. Together with the predicted LPE and text sequence, the speeches of the target speaker reading the material with the predicted style is eventually generated.

The inference results of the audiobook generation system on out-of-set data are provided here, including both the speakers whose training data have seen the genre of given script (Seen), and the speakers whose training data have not (Unseen).

Fairy tales

Script Speaker 02 (Seen) Speaker 22 (Unseen)

Martial arts fiction

Script Speaker 04 (Seen) Speaker 01 (Unseen)


  1. Q. Xie, X. Tian, G. Liu, K. Song, L. Xie, Z. Wu, H. Li, S. Shi, H. Li, F. Hong, H. Bu, and X. Xu, “The multi-speaker multi-style voice cloning challenge 2021,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp.8613–8617.
  2. Q. Xie, T. Li, X. Wang, Z. Wang, L. Xie, G. Yu, and G. Wan, “Multi-speaker multi-style text-to-speech synthesis with single- speaker single-style training data scenarios,” arXiv preprint arXiv:2112.12743, 2021.
  3. Z. Hodari, A. Moinet, S. Karlapati, J. Lorenzo-Trueba, T. Mer- ritt, A. Joly, A. Abbas, P. Karanasou, and T. Drugman, “Camp: a two-stage approach to modelling prosody in context,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6578–6582