Transformer-S2A: Robust and Efficient Speech-to-Animation

Submitted to ICASSP 2022

Digital Domain creates the Digital Avatar and provides all Rendering Demos

Speak Mandarian

The proposed model and baseline are trained on Mandarin dataset.

Upper: baseline (frame-level). Lower: proposed.

Frame Level

Transcription: 随着年轻人聚集,县城就可以不断滚动提升,最终实现真正的品质提升.


Transcription: 随着年轻人聚集,县城就可以不断滚动提升,最终实现真正的品质提升.

Transfer to Unseen Speaker and Language (English)

The proposed S2A model is only trained on Madarian dataset.

The compared baseline (LipSync3D) is trained on English dataset.

Left: proposed. Right: LipSync3D.


I can’t promise that I’ll be an expert at it and be able to help you get better with it, but at least we can have some fun.

Alibility to Sing

Only trained on Mandarin talking dataset.


终于做了这个决定,别人怎么说我不理,只要你也一样的肯定. —— 《勇气》