DISENTANGLING CONTENT AND FINE-GRAINED PROSODY INFORMATION VIA HYBRID ASR BOTTLENECK FEATURES FOR VOICE CONVERSION - Demo

Xintao Zhao, Feng Liu, Changhe Song, Zhiyong Wu, Shiyin Kang, Deyi Tuo, Helen Meng

ABSTRACT

Non-parallel data voice conversion (VC) have accomplished considerable breakthroughs recently through introducing bottleneck features (BNFs) extracted by an automatic speech recognition(ASR) model. However, selction of BNFs have a significant impact on VC result. For example, when extracting BNFs from ASR trained with Cross Entropy loss (CE-BNFs) and feeding into neural network to train a VC system, the timbre similarity of converted speech is significantly degraded. If BNFs are extracted from ASR trained using Connectionist Temporal Classification loss (CTC-BNFs), the naturalness of the converted speech may decrease. This phenomenon is caused by the difference of information contained in BNFs. In this paper, we proposed an any-to-one VC method using hybrid bottleneck features extracted from CTC-BNFs and CE-BNFs to complement each other advantages. Gradient reverse layer and instance normalization were used to extract prosody information from CE-BNFs and content information from CTC-BNFs. Auto-regressive decoder and Hifi-GAN vocoder were used to generate high-quality waveform. Experimental results show that our proposed method achieves higher similarity, naturalness, quality than baseline method and reveals the differences between the information contained in CE-BNFs and CTC-BNFs as well as the influence they have on the converted speech.

Conversion Tasks

Our goal is to convete the timbre to target speaker while preserve the fine-grained prosody information contaioned in source waveform, which lead to high naturalness.

We will generate waveforms with high timbre similarity, naturalness and quality.

Proposed - the proposed hybrid BNFs VC system.
Baseline - the original Recognition-synthesis VC system using BNFs or PPGs.

Below are a few audio samples from a private spkr.

Below are a few audios samples from a public spkr.

Below are a few demo audios.

CE-BNFs And CE-PPGs Baseline model

BNFs or PPGs extracted from ASR model trained with Cross-Entropy loss are fed into Baseline VC model.

NO	Baseline-PPGs	Baseline-BNFs	Source
1
Source 1 transcript : "娜可露露这个数据有点吓人，0-2-5，恭喜娜可露露成为MVP."
2
Source 2 transcript : "这事儿全古城的尽人皆知啊"
3
Source 3 transcript : "除夕夜，关帝庙的唐道长，有观看星象的习惯"
4
Source 4 transcript : "放在屯丁手里，每赏不过交一仓淡谷子。出兑后的银子，全部归将军衙门"
5
Source 5 transcript : "和莫家的婚约，不就自然落到了我的头上”
6
Source 6 transcript : "人类必须解除武装，进行裸移民。即，在移民过程中不能携带任何重型装备和设施”

CTC-BNFs And CTC-PPGs Baseline model

BNFs or PPGs extracted from ASR model trained with CTC loss are fed into Baseline VC model.

NO	Baseline-PPGs	Baseline-BNFs	Source
1
Source 1 transcript : "娜可露露这个数据有点吓人，0-2-5，恭喜娜可露露成为MVP."
2
Source 2 transcript : "请勿谦让，暗示地劝导谦让。"
3
Source 3 transcript : "这事儿全古城的尽人皆知啊"
4
Source 4 transcript : "除夕夜，关帝庙的唐道长，有观看星象的习惯"
5
Source 5 transcript : "上级没有钱拨下来，适才正穷，口袋瘪瘪的，集资建房也行不通”
6
Source 6 transcript : "在此之前，必须张罗，给屯丁搭上窝棚，安顿下来”

Proposed Method

Proposed hybrid ASR BNFs VC model and Baseline model using CE-BNFs or CTC-BNFs as input.

NO	CTC-BNFs(baseline)	CE-BNFs(baseline)	Proposed
1
Source 1 transcript : "我都让人收拾成啥样了，你才来？"
2
Source 2 transcript : "几个像样的炖菜儿，吃喝起来，滋味儿就不是一个滋味儿。"
3
Source 3 transcript : "黄先生，你们之前根本不认识吧？"
4
Source 4 transcript : "接下来咱们去教室，还是library？"
5
Source 5 transcript : "干什么玩意儿，上学一点道德素质都没有"
6
Source 6 transcript : "粤语(食饭没呀？食佐啦)"
7
Source 7 transcript : "悄悄告诉大家一个秘密，小右我可是精通所有类型的游戏"
8
Source 8 transcript : "中国人民同法国人民一样，对此次火灾深感痛恻"
9
Source 9 transcript : "啊，真香啊。他迅速跑上前"
11
Source 10 transcript : "没看刚才他也做了吗"
12
Source 11 transcript : "哎要进攻咱们就冲啊！"