Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

Submitted to ICASSP 2022

Sample 1

Transcripts

A : let me pour the tea for you.
A : I can’t do this again for three months.
B : are you sad, mom?
A : yes.
A : my little girl is leaving!
B(Synthesized) : mom, I’m not little, I’m seventeen years old!

Audios

conversational context

Baseline approach

Proposed approach

Sample 2

Transcripts

A : and thriller books?
B : yeah.
A : and with the red coffee tables and flowers on the windows?
B : yeah.
A : where you can sit down and drink a delicious hot chocolate?
B(Synthesized) : yes, anne, that’s the one!

Audios

conversational context

Baseline approach

Proposed approach

Sample 3

Transcripts

A : hey, do I know you?
B : hello, Patti!
B : I’m Derrick.
B : you saw me at the teahouse.
A : oh, right!
A(Synthesized) : you look different with sunglasses on.

Audios

conversational context

Baseline approach

Proposed approach

Sample 4

Transcripts

A : but it’s true anne.
A : what’s the matter?
A : why are you so nervous today?
B : i know everything jack.
B : i know all about you and sharon.
A(Synthesized) : and what is it exactly that you know?

Audios

conversational context

Baseline approach

Proposed approach