Paper: Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking

Head Generation Using Phonetic Posteriorgrams

Huirong Huang, Shiyin Kang, Zhiyong Wu, Dongyang Dai, Jia Jia
Tianxiao Fu, Deyi Tuo, Guangzhi Lei, Peng Liu, Dan Su, Dong Yu, Helen Meng

Normal case (Mandarin)

Normal case means that the speaker and language of input speech are the same as those of the training set. Emotion is neutral.

Corresponding model from left to right: Proposed model, MFCC-BLSTM, Groundtruth.


Normal case - 1
我还没开发出这个功能耶。

Normal case - 2
高翔继续笑着说:附近有很多岔路口。

Unseen language (English) or speaker

The speaker or language of input speech in this case is different from those in training set.

Corresponding model from left to right: Proposed model, MFCC-BLSTM, Groundtruth.


Same speaker & Unseen language (Multilingual) - 1
video phone.

Same speaker & Unseen language (Multilingual) - 2
JQP,SWQ,ANC.

Same speaker & Unseen language (Mixlingual) - 1
但是在写字楼里的Vivian和George却有无限可能。

Same speaker & Unseen language (Mixlingual) - 2
奔涌迸发活力,dream照亮现实。

Unseen speaker & Same language - Male
再开启大招往人堆里面一转,我们回血,敌人掉血。

Unseen speaker & Same language - Female
敢冒充,敢冒充二弟是吧,我大哥只有一个二弟对吧?怎么可能有两个二弟。

Unseen speaker & Unseen language - Male, Multilingual
The latter serve as a worm aphrodisiac, getting the hermaphroditic worms to breed more often.

Unseen speaker & Unseen language - Female, Multilingual
As a final step in your daily procedure, dab the wound with iodine or mercurochrome.

Unseen speaker & Unseen language - Male, Mixlingual
我去看了阿汤哥未删减版R级的《American made》。

Unseen speaker & Unseen language - Female, Mixlingual
如果您想像这位博主一样,用少量的基本款搭配出N种不逊于大牌的look。

Proposed model with/without energy

Speaker and language of the speech are the same as training set. Emotion is neutral.

Corresponding model: Left: Proposed model without energy. Right: Proposed model with energy


With/Without energy - 1
你们真讨厌,吃饱了,喝足了,就坐那儿看我说话。Somewhere over the rainbow.
With/Without energy - 2
白日依山尽,黄河入海流。欲穷千里目,更上一层楼。

Normal case with different emotions

Normal case means that the speaker and language of input speech are the same as those of the training set. Neutral case can be found above.

Corresponding model from left to right: Proposed model, MFCC-BLSTM, Groundtruth.


Angry - 1
差之毫厘,谬以千里。错失良机了。

Angry - 2
死有余辜,禽兽不如的东西。

Happy - 1
恭喜!是个男孩,六斤二两。

Happy - 2
这个钢琴曲,太好听了!

Sad - 1
为什么在人群中,看到个相似的背影就难过。

Sad - 2
即使远远地离开了你,我也永远不会和你分离。