Adversarially Learning Disentangled Speech Representations For Robust Multi-factor Voice Conversion - Demo
Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng
ABSTRACT
Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speech factors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment, which however is hard to ensure robust speech representation disentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial Mask-And-Predict (MAP) network inspired by BERT. The adversarial network is used to minimize the correlations between the speech representations, by randomly masking and predicting one of the representations from the others. Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors by increasing the speech quality MOS from 2.79 to 3.30 and decreasing the MCD from 3.89 to 3.58.
Conversion Tasks
Our goal is to show seven types of multi-factor voice conversion tasks.
- Proposed - the disentangled speech representation learning framework based on adversarial learning
- Baseline - the original voice conversion system SpeechFlow
Below are a few demo audios.
Non-parallel Conversion
Source : "It will be crucial."
Target : "We are in the end game."
Task | Proposed | GroundTruth |
---|---|---|
Rhythm | Source:p374 (Male)
Target:p225 (Female) |
|
Pitch | ||
Timbre | ||
Pitch+Timbre | ||
Pitch+Rhythm | ||
Rhythm+Timbre | ||
Rhythm+Pitch+Timbre |
Source : "That time has passed."
Target : "Ms Anderson yesterday put a brave face on the departure."
Task | Proposed | GroundTruth |
---|---|---|
Rhythm | Source:p374 (Male)
Target:p225 (Female) |
|
Pitch | ||
Timbre | ||
Pitch+Timbre | ||
Pitch+Rhythm | ||
Rhythm+Timbre | ||
Rhythm+Pitch+Timbre |
Source : "It will be crucial."
Target : "We are in the end game."
Task | Baseline | Proposed | GroundTruth |
---|---|---|---|
Rhythm | Source:p374 (Female)
Target:p225 (Female) | ||
Pitch | |||
Timbre | |||
Pitch+Timbre | |||
Pitch + Rhythm | |||
Rhythm+Timbre | |||
Rhythm+Pitch+Timbre |
Parallel Conversion
"And we might go back."
Task | Proposed | Baseline | GroundTruth |
---|---|---|---|
Rhythm | Source:p231 (Female)
Target:p250 (Female) |
||
Pitch | |||
Timbre | |||
Pitch+Timbre | |||
Pitch+Rhythm | |||
Rhythm+Timbre | |||
Rhythm+Pitch+Timbre |
"Yet the data is compelling."
Task | Proposed | GroundTruth |
---|---|---|
Rhythm | Source:p272 (Male)
Target:p285 (Male) |
|
Pitch | ||
Timbre | ||
Pitch+Timbre | ||
Pitch+Rhythm | ||
Rhythm+Timbre | ||
Rhythm+Pitch+Timbre |
"It will take place in July."
Task | Proposed | GroundTruth |
---|---|---|
Rhythm | Source:p276 (Female)
Target:p285 (Male) |
|
Pitch | ||
Timbre | ||
Pitch+Timbre | ||
Pitch+Rhythm | ||
Rhythm+Timbre | ||
Rhythm+Pitch+Timbre |
Rhythm
Proposed
Source | Target | Converted |
---|---|---|
"I think, therefore I am ?" | "I think, therefore I am ?" | |
"By then, however, both men were already in the US." | "Three years ago he would have been." | |
"It's the same old story." | "It was fit for royalty." |
Baseline
Source | Target | Converted |
---|---|---|
"And they were being paid ?" | "And they were being paid ?" | |
"He was asked to quit." | "He was asked to quit." | |
"And they were being paid ?" | "And they were being paid ?" |
Pitch
Proposed
Source | Target | Converted |
---|---|---|
"And they were being paid ?" | "And they were being paid ?" | |
"And they were being paid ?" | "And they were being paid ?" | |
"And they were being paid ?" | "And they were being paid ?" |
Baseline
Source | Target | Converted |
---|---|---|
"And they were being paid ?" | "And they were being paid ?" | |
"Neither side can win this war." | "Neither side can win this war." | |
"The weather forecast isn't good." | "The weather forecast isn't good." |
Timbre
Proposed
Source | Target | Converted |
---|---|---|
"Inside, the atmosphere was quiet." | "Inside, the atmosphere was quiet." | |
"She is in their hands." | "You can feel at home in China." | |
"They should stop the bombing." | "I've no idea how it works." |
Baseline
Source | Target | Converted |
---|---|---|
"Inside, the atmosphere was quiet." | "Inside, the atmosphere was quiet." | |
"They know no other way." | "They know no other way." | |
"I think, therefore I am ?" | "Is it in the right place ?" |
Component removed
(Section 4.4 in the paper)
Below are a few examples.
"I must do something about it."
Task | Proposed | Baseline |
---|---|---|
Remove Content | ||
Remove Rhythm | ||
Remove Pitch | ||
Remove Timbre |