Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Submitted to ICASSP 2024.

Download this project as a .zip file Download this project as a tar.gz file

Abstract

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker’s voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

Fig.1: The architecture of the proposed model.

Subjective Evaluation

To demonstrate the performance of our proposed method, some samples are provided for comparison. GT (Reconstructed) means the audio reconstructed from ground truth speech by the EnCodec model. VALL-E means an open-source implementation of VALL-E. Proposed means the proposed model, which considers both a 3-second timbre prompt and a style prompt consisting of ten sentences. Proposed-3s means the baseline model, which shares the same structure and parameters as the proposed model, but only uses a 3-second speech as both timbre prompt and style prompt. Base-S means the style prompt-only baseline model, which shares the same TTS backbone and style prompt as the proposed model, but excludes the timbre prompt. Base-T means the timbre prompt-only baseline model, which shares the same TTS backbone and timbre prompt as the proposed model, but the style prompt is removed. In addition, a pre-trained neural audio codec model, EnCodec, is utilized to generate the waveform.

Target Text	Timbre Prompt	GT (Reconstructed)	VALL-E	Proposed	Proposed-3s	Base-S	Base-T
Also, a draft on futurity, sometimes honored, but generally extended.
And yet, what cause was there for anger?
Perhaps the profession of doing good may be full, but every body should be kind at least to himself.
Tie them down with bladder, and in a few days they will be fit for use.
He isn’t fit to hear what’s said here.
He was young; no spear had touched him, no poison lurked in his wine.

Investigation

To investigate the impact of the prompt with different lengths, we adjust the length of the acoustic prompt and style prompt for VALLE and the proposed model, respectively. For VALL-E, constrained by the structure of the decoder-only language model, we randomly select two utterances of 3s/6s as the prompts for each speaker. We also evaluate our proposed model with various numbers of speech as the style prompt, including 1 sentence, 5 sentences, 10 sentences and 20 sentences. The average duration of sentences is about 6 seconds. The timbre prompt is fixed to a 3-second speech as mentioned above. In particular, when the style prompt consists of only one sentence, the proposed model only uses a 3-second speech as both the timbre prompt and style prompt.

Target Text	Timbre Prompt	GT (Reconstructed)	VALL-E w/ 3s	VALL-E w/ 6s	Proposed w/ 1 sent (3s)	Proposed w/ 5 sent (30s)	Proposed w/ 10 sent (1min)	Proposed w/ 20 sent (2min)
Had she not been assisted?
“Take it of course,” says gringo, “take anything that offers, why not?”
Changes many and great followed in bewildering succession in utah.

Comparison with Official VALL-E

In this section, we directly download samples for the LibriSpeech dataset from the VALL-E demo page, and compare them with the speech samples generated by our proposed model. It is noteworthy that the official VALL-E model is trained on the Librilight dataset contains 60k hours of recordings, and our proposed model is trained only on the LibriSpeech dataset contains approximately 580h hours of recordings.

Target Text	GT (Reconstructed)	Official VALL-E	Proposed
The army found the people in poverty, and left them in comparative wealth.
Instead of shoes, the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid.
Number ten, fresh nelly is waiting on you. Good night husband.