PerTTS: Personalized and Controllable Zero-shot Spontaneous Style Text-to-Speech Synthesis

Abstract

In spoken scenarios, achieving personalized and controllable zero-shot spontaneous style speech synthesis is highly significant, particularly in generating natural and expressive speech for unseen speakers under data-limited conditions. Traditional methods typically achieve this by fine-tuning pre-trained multi-speaker speech synthesis models or adopting zero-shot adaptation techniques. However, these methods exhibit limitations in voice cloning and style modeling, struggling to capture fine-grained voice characteristics and complex speaking styles of target speakers. In this paper, we propose PerTTS, a personalized and controllable zero-shot spontaneous speech synthesis method. This approach introduces a personalized speaking style encoder that utilizes pre-trained models and a local prosody encoder to extract semantic, duration, timbre and prosody information from multiple reference utterances of the target speaker, thereby forming a comprehensive personalized representation of speaking style. Furthermore, we employ knowledge distillation to learn spontaneous behavior patterns and incorporate a multi-modal pseudo label detector to extract labels from unlabeled data, enabling modeling and control of spontaneous behaviors. This mechanism significantly enhances the naturalness and spontaneity of the synthesized speech. Experimental results demonstrate that PerTTS significantly outperforms existing models in terms of speaking style similarity and speech naturalness. The introduction of personalized speaking style representations effectively improves style similarity, and the incorporation of spontaneous behavior modeling further improves the naturalness and spontaneity of the synthesized speech, while enabling controllable generation of spontaneous behaviors.

model_all
The architecture of our proposed model.

Audio samples for different models

  • GT : ground truth audio.
  • VALL-E : An open-source implementation3 of VALL-E. We first conduct pre-training on large-scale Chinese datasets and then fine-tune the model on HQ-conversations.
  • BASE-LPE : The VALL-E with the LPE extracted by the local prosody encoder.
  • BASE-style : The VALL-E with the style embedding extracted from the personalized speaking style encoder, which comprises semantic, duration, prosody, and timbre information.
  • PerTTS : PerTTS This is our proposed personalized, controllable zero-shot spontaneous style speech synthesis model. It consists of the backbone of VALL-E, along with a personalized speaking style encoder and a label encoder. In this model, we assume that the spontaneous labels are given.
  • PerTTS(w/o label) / w/ style emb input : The same architecture of PerTTS, where the pseudo labels for spontaneous behaviors are obtained from the output of the NAR predictor.
  • PerTTS(w/o label) / w/o style emb input : The pseudo labels for spontaneous behaviors are obtained from the output of the NAR predictor which was trained without style embedding.
  • Group1

    Target Chinese Text Prompt Speech GT VALL-E BASE-LPE BASE-style(Proposed)
    地势还是啥反正就是,各种各样的环境都非常多样化,所以它的景色也非常的丰富。
    就是对一些基础的问题,但是真的可以回答的很好。
    它就是通过人工智能哎给咱们推荐,你喜欢哪个视频呀你不喜欢哪个视频是吧,通过这个推送。
    对,确实因为人工智能到现在还没有被这个这个普及我感觉没有被普及。
    像像我妈那种工工薪阶层基本都是。
    呃,喜欢泡的还是,直接就是喝水的。

    Group2

    Target Chinese Text GT BASE-style PerTTS(Proposed) PerTTS(w/o label) / w/ style emb input PerTTS(w/o label) / w/o style emb input
    也也给别的地方拉一下旅游,拉拉动一下旅游产业的发展。
    工地好像并不是技术,工地是靠蛮力啊,靠力气。
    那么你一个人管得来台球馆吗?
    让他去带动另一批人,然后就让另一批人去调动下一批人,循环往复嘛对吧,形成一个良性循环。
    然后,他那个就说是预防,预防你的就是预防女生的宫颈癌呀还有这个胸胸什么。
    大多数都很少有冲劲了,我最近不是在看那个电视剧嘛,看那个觉醒年代你有看过吗?

    ABX

    Comparison of BASE-style and PerTTS in spontaneity and naturalness.

    Target Chinese Text BASE-style PerTTS(Proposed)
    也也给别的地方拉一下旅游,拉拉动一下旅游产业的发展。
    对我我这个人也是我现在这个男朋友他老是喜欢打游戏他一打游戏我就感觉。
    那么你一个人管得来台球馆吗?
    上几百万的粉丝他一次广告可能就要十几万几十万。

    Ablation Study

    investigation on speaker embedding

    Compare timbre similarity.

    GT BASE-style without speaker embedding

    investigation on bert embedding and duration embedding

    Compare style similarity.

    Target Chinese Text GT BASE-style without bert embedding and duration embedding
    地势还是啥反正就是,各种各样的环境都非常多样化,所以它的景色也非常的丰富。
    你说这之后这个人工智能,是不是会越来越便利,就是说体会咱们这个生活的。
    像像我妈那种工工薪阶层基本都是。
    最近看你天天吃泡面啊,然后你那个。

    Controllable of spontaneous behaviors

    NOTE: the character with spontaneous label (in GT) is bolded

    Target Chinese Text GT PerTTS(proposed) PerTTS(w/o label) No Label
    ,喜欢泡的还,直接就喝水的。
    就就只能干那体力活的,那他肯定就往那些放那那些岗位上面去了。
    怎么说呢我觉得这个服务员他这行业啊,反正竞争也不是特别大吧,但就很
    啧,我觉得这互联网上去混饭呢都看缘分看赏饭吃。