DISENTANGLING CONTENT AND FINE-GRAINED PROSODY INFORMATION VIA HYBRID ASR BOTTLENECK FEATURES FOR VOICE CONVERSION - Demo

Xintao Zhao, Feng Liu, Changhe Song, Zhiyong Wu, Shiyin Kang, Deyi Tuo, Helen Meng



ABSTRACT

Non-parallel data voice conversion (VC) have accomplished considerable breakthroughs recently through introducing bottleneck features (BNFs) extracted by an automatic speech recognition(ASR) model. However, selction of BNFs have a significant impact on VC result. For example, when extracting BNFs from ASR trained with Cross Entropy loss (CE-BNFs) and feeding into neural network to train a VC system, the timbre similarity of converted speech is significantly degraded. If BNFs are extracted from ASR trained using Connectionist Temporal Classification loss (CTC-BNFs), the naturalness of the converted speech may decrease. This phenomenon is caused by the difference of information contained in BNFs. In this paper, we proposed an any-to-one VC method using hybrid bottleneck features extracted from CTC-BNFs and CE-BNFs to complement each other advantages. Gradient reverse layer and instance normalization were used to extract prosody information from CE-BNFs and content information from CTC-BNFs. Auto-regressive decoder and Hifi-GAN vocoder were used to generate high-quality waveform. Experimental results show that our proposed method achieves higher similarity, naturalness, quality than baseline method and reveals the differences between the information contained in CE-BNFs and CTC-BNFs as well as the influence they have on the converted speech.

Conversion Tasks

Our goal is to convete the timbre to target speaker while preserve the fine-grained prosody information contaioned in source waveform, which lead to high naturalness.

We will generate waveforms with high timbre similarity, naturalness and quality.

  • Proposed - the proposed hybrid BNFs VC system.
  • Baseline - the original Recognition-synthesis VC system using BNFs or PPGs.

Below are a few audio samples from a private spkr.

Below are a few audios samples from a public spkr.

Below are a few demo audios.

CE-BNFs And CE-PPGs Baseline model

BNFs or PPGs extracted from ASR model trained with Cross-Entropy loss are fed into Baseline VC model.

NO Baseline-PPGs Baseline-BNFs Source
1
Source 1 transcript : "娜可露露这个数据有点吓人,0-2-5,恭喜娜可露露成为MVP."
2
Source 2 transcript : "这事儿全古城的尽人皆知啊"
3
Source 3 transcript : "除夕夜,关帝庙的唐道长,有观看星象的习惯"
4
Source 4 transcript : "放在屯丁手里,每赏不过交一仓淡谷子。出兑后的银子,全部归将军衙门"
5
Source 5 transcript : "和莫家的婚约,不就自然落到了我的头上”
6
Source 6 transcript : "人类必须解除武装,进行裸移民。即,在移民过程中不能携带任何重型装备和设施”

CTC-BNFs And CTC-PPGs Baseline model

BNFs or PPGs extracted from ASR model trained with CTC loss are fed into Baseline VC model.

NO Baseline-PPGs Baseline-BNFs Source
1
Source 1 transcript : "娜可露露这个数据有点吓人,0-2-5,恭喜娜可露露成为MVP."
2
Source 2 transcript : "请勿谦让,暗示地劝导谦让。"
3
Source 3 transcript : "这事儿全古城的尽人皆知啊"
4
Source 4 transcript : "除夕夜,关帝庙的唐道长,有观看星象的习惯"
5
Source 5 transcript : "上级没有钱拨下来,适才正穷,口袋瘪瘪的,集资建房也行不通”
6
Source 6 transcript : "在此之前,必须张罗,给屯丁搭上窝棚,安顿下来”

Proposed Method

Proposed hybrid ASR BNFs VC model and Baseline model using CE-BNFs or CTC-BNFs as input.

NO CTC-BNFs(baseline) CE-BNFs(baseline) Proposed
1
Source 1 transcript : "我都让人收拾成啥样了,你才来?"
2
Source 2 transcript : "几个像样的炖菜儿,吃喝起来,滋味儿就不是一个滋味儿。"
3
Source 3 transcript : "黄先生,你们之前根本不认识吧?"
4
Source 4 transcript : "接下来咱们去教室,还是library?"
5
Source 5 transcript : "干什么玩意儿,上学一点道德素质都没有"
6
Source 6 transcript : "粤语(食饭没呀?食佐啦)"
7
Source 7 transcript : "悄悄告诉大家一个秘密,小右我可是精通所有类型的游戏"
8
Source 8 transcript : "中国人民同法国人民一样,对此次火灾深感痛恻"
9
Source 9 transcript : "啊,真香啊。他迅速跑上前"
11
Source 10 transcript : "没看刚才他也做了吗"
12
Source 11 transcript : "哎要进攻咱们就冲啊!"
Back to Top Back to Section Start