To demonstrate that our proposed model can significantly transfer the cross-lingual speaking styles both in global and local from source speech to the synthesized speech, some samples are provided for comparison. Source Speech means the source speech in the original language, reconstructed by a vocoder. FastSpeech 2 means an open-source implementation of FastSpeech 2, with no speaking style transfer. Duration Tansfer means duration tansfer model, which predicts the duration of every word in the target speech. Joint Style Transfer means the proposed model, which predicts joint multi-scale cross-lingual speaking style in the target speech. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.
This was the machine that locked away the Destroyer the first time. And it can do it again, too! But we need four Vault Keys to get in. You already got three of them, right?
瀑布密道!太聪明了!我喜欢这招。
Secret waterfall tunnel! Brilliant! Love this gig.
有了这艘飞船,我们就能满世界追杀强盗啦!
With this ship, we can kill bandits all over the worlds!