DISENTANGLING CONTENT AND FINE-GRAINED PROSODY INFORMATION VIA HYBRID ASR BOTTLENECK FEATURES FOR VOICE CONVERSION - Demo
Xintao Zhao, Feng Liu, Changhe Song, Zhiyong Wu, Shiyin Kang, Deyi Tuo, Helen Meng
ABSTRACT
Non-parallel data voice conversion (VC) have accomplished considerable breakthroughs recently through introducing bottleneck features (BNFs) extracted by an automatic speech recognition(ASR) model. However, selction of BNFs have a significant impact on VC result. For example, when extracting BNFs from ASR trained with Cross Entropy loss (CE-BNFs) and feeding into neural network to train a VC system, the timbre similarity of converted speech is significantly degraded. If BNFs are extracted from ASR trained using Connectionist Temporal Classification loss (CTC-BNFs), the naturalness of the converted speech may decrease. This phenomenon is caused by the difference of information contained in BNFs. In this paper, we proposed an any-to-one VC method using hybrid bottleneck features extracted from CTC-BNFs and CE-BNFs to complement each other advantages. Gradient reverse layer and instance normalization were used to extract prosody information from CE-BNFs and content information from CTC-BNFs. Auto-regressive decoder and Hifi-GAN vocoder were used to generate high-quality waveform. Experimental results show that our proposed method achieves higher similarity, naturalness, quality than baseline method and reveals the differences between the information contained in CE-BNFs and CTC-BNFs as well as the influence they have on the converted speech.
Conversion Tasks
Our goal is to convete the timbre to target speaker while preserve the fine-grained prosody information contaioned in source waveform, which lead to high naturalness.
We will generate waveforms with high timbre similarity, naturalness and quality.
- Proposed - the proposed hybrid BNFs VC system.
- Baseline - the original Recognition-synthesis VC system using BNFs or PPGs.
Below are a few audio samples from a private spkr.
Below are a few audios samples from a public spkr.
Below are a few demo audios.
CE-BNFs And CE-PPGs Baseline model
BNFs or PPGs extracted from ASR model trained with Cross-Entropy loss are fed into Baseline VC model.
NO | Baseline-PPGs | Baseline-BNFs | Source |
---|---|---|---|
1 | |||
Source 1 transcript : "娜可露露这个数据有点吓人,0-2-5,恭喜娜可露露成为MVP." | |||
2 | |||
Source 2 transcript : "这事儿全古城的尽人皆知啊" | |||
3 | |||
Source 3 transcript : "除夕夜,关帝庙的唐道长,有观看星象的习惯" | |||
4 | |||
Source 4 transcript : "放在屯丁手里,每赏不过交一仓淡谷子。出兑后的银子,全部归将军衙门" | |||
5 | |||
Source 5 transcript : "和莫家的婚约,不就自然落到了我的头上” | |||
6 | |||
Source 6 transcript : "人类必须解除武装,进行裸移民。即,在移民过程中不能携带任何重型装备和设施” |
CTC-BNFs And CTC-PPGs Baseline model
BNFs or PPGs extracted from ASR model trained with CTC loss are fed into Baseline VC model.
NO | Baseline-PPGs | Baseline-BNFs | Source |
---|---|---|---|
1 | |||
Source 1 transcript : "娜可露露这个数据有点吓人,0-2-5,恭喜娜可露露成为MVP." | |||
2 | |||
Source 2 transcript : "请勿谦让,暗示地劝导谦让。" | |||
3 | |||
Source 3 transcript : "这事儿全古城的尽人皆知啊" | |||
4 | |||
Source 4 transcript : "除夕夜,关帝庙的唐道长,有观看星象的习惯" | |||
5 | |||
Source 5 transcript : "上级没有钱拨下来,适才正穷,口袋瘪瘪的,集资建房也行不通” | |||
6 | |||
Source 6 transcript : "在此之前,必须张罗,给屯丁搭上窝棚,安顿下来” |
Proposed Method
Proposed hybrid ASR BNFs VC model and Baseline model using CE-BNFs or CTC-BNFs as input.
NO | CTC-BNFs(baseline) | CE-BNFs(baseline) | Proposed |
---|---|---|---|
1 | |||
Source 1 transcript : "我都让人收拾成啥样了,你才来?" | |||
2 | |||
Source 2 transcript : "几个像样的炖菜儿,吃喝起来,滋味儿就不是一个滋味儿。" | |||
3 | |||
Source 3 transcript : "黄先生,你们之前根本不认识吧?" | |||
4 | |||
Source 4 transcript : "接下来咱们去教室,还是library?" | |||
5 | |||
Source 5 transcript : "干什么玩意儿,上学一点道德素质都没有" | |||
6 | |||
Source 6 transcript : "粤语(食饭没呀?食佐啦)" | |||
7 | |||
Source 7 transcript : "悄悄告诉大家一个秘密,小右我可是精通所有类型的游戏" | |||
8 | |||
Source 8 transcript : "中国人民同法国人民一样,对此次火灾深感痛恻" | |||
9 | |||
Source 9 transcript : "啊,真香啊。他迅速跑上前" | |||
11 | |||
Source 10 transcript : "没看刚才他也做了吗" | |||
12 | |||
Source 11 transcript : "哎要进攻咱们就冲啊!" |