In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion

Abstract

We propose TES-VC (Text-driven Environment and Speaker controllable Voice Conversion), a text-driven voice conversion framework with independent control of speaker timbre and environmental acoustics. TES-VC processes simultaneous text inputs for target voice and environment, accurately generating speech matching described timbre/environment while preserving source content. Trained on synthetic data with de- coupled vocal/environment features via latent diffusion modeling, our method eliminates interference between attributes. The Retrieval-Based Timbre Control (RBTC) module enables precise manipulation using abstract descriptions without paired data. Experiments confirm TES-VC effectively generates contextually appropriate speech in both timbre and environment with high content retention and superior controllability which demonstrates its potential for widespread applications.

Model Overview

Interpolate start reference image.

Voice conversion with environmental and speaker control

Source speech is drawn from the libriTTS-R-test-clean corpus. For fair comparison, we concatenate environmental and timbre descriptions as VoiceLDM's input prompts while using ground-truth transcripts as content prompts.

Original Enviroment and speaker description VoiceLDM w/o CLAP-timbre adapter TES-VC(ours)

Content text: He might have had that forged check for the face of it, if he'd been sharp. You wouldn't catch 'Rast Hopkins doing such a fool stunt.

Enviroment: Birds are squawking, and ducks are quacking
Speaker: An old man is speaking in a deep voice

Content text: He might have had that forged check for the face of it, if he'd been sharp.

Enviroment: Repeated gunfire and screaming in the background
Speaker: An old man is speaking in a gentle voice

Content text: "We say, of course," somebody exclaimed, "that they give two turns!

Enviroment: Motorcycle starting then driving away
Speaker: A young man is speaking in a clear voice

Content text: There was good reason to stop and think, even for the world's most emotionless man. What would Conseil say?

Enviroment: Food is frying, and a woman talks
Speaker: An elderly woman is speaking

Content text: The salient features of this development of domestic service have already been indicated.

Enviroment: Church bells ringing
Speaker: A woman is speaking in a gentle voice

Content text: A suffocating smell of nitrogen fills the air, it enters the throat, it fills the lungs.

Enviroment: Water is falling, splashing and gurgling
Speaker: A young girl is speaking in a sweet and light voice

Content text: 'I'm glad you like it,' says Wylder, chuckling benignantly on it, over his shoulder.

Enviroment: A crowd applauds for a while
Speaker: A woman is speaking in a rough and deep voice

Content text: He turned seaward from the road at Dollymount and as he passed on to the thin wooden bridge he felt the planks shaking with the tramp of heavily shod feet.

Enviroment: A piano playing as a clock ticks
Speaker: A man is speaking in a hoarse voice

Content text: The pleasant graveyard of my soul with sentimental cypress trees and flowers is filled, that I may stroll in meditation, at my ease.

Enviroment: Nature sounds with birds chirping and singing
Speaker: A young woman is speaking in soft voice

Evaluation Results

TES-VC outperforms the other systems in all five aspects, encluding age/gender recognition accuracy, speaker consistency (S-MOS), environmental faithfulness (E-MOS), and realness (R-MOS).

Interpolate start reference image.