Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation-based Voice Conversion- Demo

Xintao Zhao, Shuai Wang, Yang Chao, Zhiyong Wu, Helen Meng

ABSTRACT

Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC performance achieved considerable breakthroughs. Recently, self-supervised learning (SSL) methods trained with a large-scale unannotated speech corpus have been applied to downstream tasks focusing on the content information, which is suitable for VC tasks. However, a huge amount of speaker information in SSL representations degrades timbre similarity and the quality of converted speech significantly. To address this problem, we proposed a high-similarity any-to-one voice conversion method with the input of SSL representations. We incorporated adversarial training mechanisms in the synthesis module using external unannotated corpora. Feed-forward transformer-based acoustic model and Hifi-GAN vocoder were used to generate the high-quality waveform. Two auxiliary discriminators were trained to distinguish whether a sequence of mel-spectrograms has been converted by the acoustic model and whether a sequence of content embeddings contains speaker information from external corpora. Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method, which needs a huge amount of annotated corpora for training and is applicable to improve similarity for VC methods with other SSL representations as input.

Conversion Tasks

Our goal is to converte the timbre to target speaker while preserve the fine-grained prosody information contaioned in source waveform, which lead to high naturalness.

No annotated data were used in the whole training steps.

Proposed - the proposed adversarial discriminator method.
Baseline - the proposed method without Embedding Discriminator and Conversion Discriminator.

Below are a few audios samples from the target CSMSC spkr.

Below are a few demo audios.

Hubert Soft Unit experiment

Proposed method and Baseline method using Hubert Soft Unit as SSL representations.

NO	Baseline	Proposed	CTC
1
Source 1 transcript : "Hello,所有女生，你们的魔鬼又来咯。Oh my god，这个颜色也太好看了吧。买它！买它！买它！"
2
Source 2 transcript : "欢迎雁南飞"
3
Source 3 transcript : "对保障困难群众生活发挥了重要作用"
4
Source 4 transcript : "平均每笔贷款约一万元"
5
Source 5 transcript : "房产市场多宗交易停滞"
6
Source 6 transcript : "在彰显立法机关坚决落实税收法定原则决定的同时"

Convect Features experiment

Proposed method and Baseline method using Convect Feature as SSL representations.

NO	Baseline	Proposed	CTC
1
Source 1 transcript : "于是我决定就来谈谈种族问题"
2
Source 2 transcript : "其最近三年营收入年均复合增长超过百分之一"
3
Source 3 transcript : "近两年同比增长百分之一点六和百分之七点二"
4
Source 4 transcript : "对保障困难群众生活发挥了重要作用"
5
Source 5 transcript : "房产市场多宗交易停滞"
6
Source 6 transcript : "切实维护消费者合法权益"

Hubert Raw experiment

Proposed method and Baseline method using Hubert Raw as SSL representations.

NO	Baseline	Proposed	CTC
1
Source 1 transcript : "Hello！所有女生，你们的魔鬼又来咯。Oh my god，这个颜色也太好看了吧。买它！买它！买它！"
2
Source 2 transcript : "黄先生，你们——之前根本不认识吧"
3
Source 3 transcript : "其他风险还包括运营风险渗透市场风险等"
4
Source 4 transcript : "实际上从整体一线市场看"
5
Source 5 transcript : "这一代人属于二战老人"
6
Source 6 transcript : "但是大多数老人还是倾向高服务质量的品牌社区"