FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

Shuoyi Zhou1, Yixuan Zhou1, Peiji Yang3, Yifan Hu2
Yicheng Zhong3, Zhisheng Wang3, Zhiyong Wu1
1 Shenzhen International Graduate School, Tsinghua University, Shenzhen, China 2 Inner Mongolia University, Hohhot, China 3 Tencent, Shenzhen, China

Abstract

Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling global style. We propose FineCombo-TTS, a unified framework for speech synthesis grounded in reference speech and guided by text descriptions, enabling flexible and precise control over acoustic attributes. Instead of explicit attribute disentanglement, we learn a unified acoustic representation and introduce a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations guided by text descriptions. To support relative attribute control, we construct FineEdit, a structured paired dataset that explicitly encodes source-to-target attribute variations. Experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS.

Contents

Method

FineCombo-TTS is a novel framework that fully integrates reference speech and text descriptions to achieve flexible and precise control in text-to-speech synthesis.

Method Diagram

Joint Control Demos-Pitch Control

Reference Speech Content Description Generated Speech
"I guess I'll get wet," said Toby, ruefully, as he looked up at the lofty seat which he was to occupy. Change the prosody, slow down the speech rate.
The ideal speaker makes his big words stand out like mountain peaks; his unimportant words are submerged like stream beds. Change the prosody, slow down the speech rate, lower the pitch.
I feel shooken up dreadful, he's so awful strong; but I'm not very hurt, only I'm sorry, and I've been telling my Captain about it, and asking Him to forgive me. Change the prosody, raise the volume.
"Horrid, father" whispered Ned, as if he felt that Indians might be listening. Change the prosody, speed up the speech rate, raise the volume.
They were at breakfast, and everybody in the vicinity turned and stared at their table. Change the prosody, lower the volume, raise the pitch.

Joint Control Demos-Emotion Control

Reference Speech Content Description Generated Speech
No admittance except on party business. Change the style, speak with a sad emotion.
I'm sure your friends can wait! Change the style, speak with a happy emotion.
No, I burst the balloon! Change the style, speak with a angry emotion.
But what about this thing, sticky! Change the style, speak with a surprised emotion.
I've just shot a stag. Change the style, speak with a angry emotion.

Joint Control Demos-Timbre Control

Reference Speech Content Description Generated Speech
Chapter ten a warm welcome. Change the timbre to a feminine, clear, slightly muffled, middle-aged, slightly thin, slightly elegant voice.
Her shoes were like fishes. Change the timbre to a very masculine, deep, very thick, very mature, slightly old, cool voice.
Mister share man, I move for a division. Change the timbre to a feminine, very young, light, clear, slightly cute voice.
I do not think it is wise to take her into our confidence. Change the timbre, speak as a feminine,adult-like,tensed,powerful,slightly bright,clear,fluent,intellectual,elegant,slightly sexy,lively,slightly strict,sharp.
"My dear uncle," I said, whilst hot tears trickled down my face. Change the timbre, speak as a adult man, slightly weak,bright,soft.

Description-Only Control Demos

Content Description Generated Speech
The horse itself, in the foreground of the design, stood motionless and statue like-while farther back, its discomfited rider perished by the dagger of a Metzengerstein. Generate a voice, Speaking slowly, she infused her words with abundant energy.
At the unexpected outcry in the rear the Raturans halted, and held a hasty council of war. Generate a voice, With diminished energy, the man speaks rapidly and in a deep pitch.
"Washington ought to send troops and provisions for the forts at once!" replied mr Fulton. Generate a voice, Speaking with fervor, her voice maintains a high pitch.
When it was fixed on, the boat was launched and the voyage began. Generate a voice, A man's low-pitched voice carries normal energy as he speaks rapidly.
It is a lovely story that follows, full of marvel, as how should it not be? Generate a voice, The man maintains a regular pitch while speaking rapidly with an energetic delivery.