Enhancing Monotonicity for Robust Autoregressive Transformer TTS

Xiangyu Liang, Zhiyong Wu, Runnan Li, Yanqing Liu, Sheng Zhao, Helen Meng

Audio demos

Abstract

With the development of sequence-to-sequence modeling algorithms, Text-to-Speech (TTS) techniques have achieved significant improvement on speech quality and naturalness. These deep learning algorithms, such as recurrent neural network (RNN) and its memory enhanced variations, have shown strong reconstruction ability from input linguistic features to acoustic features. However, the efficiency of these algorithms is limited for its sequential process in both training and inference. Recently, Transformer with superiority in parallelism is proposed to TTS. It employs the positional embedding instead of recurrent mechanism for position modeling and significantly boost training speed. However, this approach lacks monotonic constraint and is deficient with issues like pronunciation skipping. Therefore, in this paper, we propose an monotonicity enhancing approach with the combining use of Stepwise Monotonic Attention (SMA) and multi-head attention for Transformer based TTS system. Experiments show the proposed approach can reduce bad cases from 53 to 1 in 500 sentences' test, together with an improvement for MOS from 4.09 to 4.17 in naturalness test.

Robustness: LJSpeech (Griffin Lim)

0.LJ049-0027: Presidents have made it clear, however, that they did not favor this or any other arrangement which interferes with the privacy of the President and his guests.

Ground truth Baseline SMA tuned(soft) SMA tuned(hard)
Audios

1.LJ049-0177: and Robert I. Bouck, who was in charge of the Protective Research Section of the Secret Service, believed that the accumulation of the facts known to the FBI

Audios

2.LJ050-0061: (b) who have made threats of bodily harm against officials or employees of Federal, state or local government or officials of a foreign government,

Audios

3.LJ050-0263: it will continue to rely in many respects upon the greater resources of the Office of Science and Technology and other agencies.

Audios

Robustness: Out of domain (WaveNet)

and which thus helps to restrain the actions of individuals within due bounds. Justice in this sense of the term is of critical importance,

Baseline SMA tuned(soft) SMA tuned(hard)
Audios

Such cases, however, are not very frequent; and in every part of Europe twenty workmen serve under a master for one that is independent,

Audios

and argued that it could make companies less competitive—despite the disclosure obligations that had been enacted in Europe and Canada.

Audios

Naturalness: Out of domain (WaveNet)

However, melting ice is now thought to be the main reason for rising sea levels. Most glaciers in temperate regions of the world are retreating.

Baseline SMA tuned(soft) SMA tuned(hard)
Audios

And even before that, the Romans ruled, which in their turn came from all over the intercontinental empire.

Audios

Gine Wang-Reese, vice president of political and public affairs at Equinor, wrote to the SEC.

Audios

The chief executive of the company’s Washington, D.C.-based lobbying arm, Francois Badoual,

Audios