Adversarially Learning Disentangled Speech Representations For Robust Multi-factor Voice Conversion - Demo

Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng



ABSTRACT

Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speech factors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment, which however is hard to ensure robust speech representation disentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial Mask-And-Predict (MAP) network inspired by BERT. The adversarial network is used to minimize the correlations between the speech representations, by randomly masking and predicting one of the representations from the others. Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors by increasing the speech quality MOS from 2.79 to 3.30 and decreasing the MCD from 3.89 to 3.58.

Conversion Tasks

Our goal is to show seven types of multi-factor voice conversion tasks.

  • Proposed - the disentangled speech representation learning framework based on adversarial learning
  • Baseline - the original voice conversion system SpeechFlow

Below are a few demo audios.

Non-parallel Conversion

Source : "It will be crucial."

Target : "We are in the end game."

Task Proposed GroundTruth
Rhythm Source:p374 (Male)
Target:p225 (Female)
Pitch
Timbre
Pitch+Timbre
Pitch+Rhythm
Rhythm+Timbre
Rhythm+Pitch+Timbre

Source : "That time has passed."

Target : "Ms Anderson yesterday put a brave face on the departure."

Task Proposed GroundTruth
Rhythm Source:p374 (Male)
Target:p225 (Female)
Pitch
Timbre
Pitch+Timbre
Pitch+Rhythm
Rhythm+Timbre
Rhythm+Pitch+Timbre

Source : "It will be crucial."

Target : "We are in the end game."

Task Baseline Proposed GroundTruth
Rhythm Source:p374 (Female)
Target:p225 (Female)
Pitch
Timbre
Pitch+Timbre
Pitch + Rhythm
Rhythm+Timbre
Rhythm+Pitch+Timbre

Parallel Conversion

"And we might go back."

Task Proposed Baseline GroundTruth
Rhythm Source:p231 (Female)
Target:p250 (Female)
Pitch
Timbre
Pitch+Timbre
Pitch+Rhythm
Rhythm+Timbre
Rhythm+Pitch+Timbre

"Yet the data is compelling."

Task Proposed GroundTruth
Rhythm Source:p272 (Male)
Target:p285 (Male)
Pitch
Timbre
Pitch+Timbre
Pitch+Rhythm
Rhythm+Timbre
Rhythm+Pitch+Timbre

"It will take place in July."

Task Proposed GroundTruth
Rhythm Source:p276 (Female)
Target:p285 (Male)
Pitch
Timbre
Pitch+Timbre
Pitch+Rhythm
Rhythm+Timbre
Rhythm+Pitch+Timbre

Rhythm

Proposed

Source Target Converted
"I think, therefore I am ?" "I think, therefore I am ?"
"By then, however, both men were already in the US." "Three years ago he would have been."
"It's the same old story." "It was fit for royalty."

Baseline

Source Target Converted
"And they were being paid ?" "And they were being paid ?"
"He was asked to quit." "He was asked to quit."
"And they were being paid ?" "And they were being paid ?"

Pitch

Proposed

Source Target Converted
"And they were being paid ?" "And they were being paid ?"
"And they were being paid ?" "And they were being paid ?"
"And they were being paid ?" "And they were being paid ?"

Baseline

Source Target Converted
"And they were being paid ?" "And they were being paid ?"
"Neither side can win this war." "Neither side can win this war."
"The weather forecast isn't good." "The weather forecast isn't good."

Timbre

Proposed

Source Target Converted
"Inside, the atmosphere was quiet." "Inside, the atmosphere was quiet."
"She is in their hands." "You can feel at home in China."
"They should stop the bombing." "I've no idea how it works."

Baseline

Source Target Converted
"Inside, the atmosphere was quiet." "Inside, the atmosphere was quiet."
"They know no other way." "They know no other way."
"I think, therefore I am ?" "Is it in the right place ?"
Back to Top Back to Section Start



Component removed

(Section 4.4 in the paper)

Below are a few examples.

"I must do something about it."

Task Proposed Baseline
Remove Content
Remove Rhythm
Remove Pitch
Remove Timbre
Back to Top Back to Section Start