StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion
1. Abstract
Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER) and speaker characteristics (i.e., SECS and MOS).
2. Computational Metrics
Table 1: Objective Evaluation of StarVC and Baselines (including Ablations)
SECS-Res and SECS-Wavlm are SECS metrics from Resemblyzer and a fine-tuned WavLM, respectively.
WER-Text and CER-Text evaluate the text generated by StarVC at word-level and character-level accuracy.
Bold values indicate the best results, and underlined values indicate the second-best results.
Model | SECS-Res ↑ | SECS-Wavlm ↑ | WER ↓ | WER-Text ↓ | CER ↓ | CER-Text ↓ |
---|---|---|---|---|---|---|
CosyVoice | 0.839 | 0.478 | 8.24% | / | 4.27% | / |
OpenVoice V2 | 0.771 | 0.284 | 8.17% | / | 4.15% | / |
TriAAN-VC | 0.756 | 0.241 | 19.67% | / | 12.18% | / |
StarVC | 0.835 | 0.472 | 6.27% | 4.95% | 4.09% | 1.51% |
-- w/o multi-stage | 0.812 | 0.429 | 7.24% | 5.09% | 4.60% | 1.61% |
-- w/o text token | 0.771 | 0.382 | 7.30% | / | 4.31% | / |
-- smaller model | 0.750 | 0.383 | 8.04% | 5.33% | 5.67% | 1.56% |
Table 2: MOS Evaluation of StarVC and Baselines with 95% confidence interval
Bold values indicate the best results, and underlined values indicate the second-best results.
Model | SMOS ↑ | NMOS ↑ |
---|---|---|
CosyVoice | 3.94 ± 0.09 | 4.15 ± 0.08 |
OpenVoice V2 | 3.97 ± 0.08 | 4.09 ± 0.08 |
TriAAN-VC | 3.25 ± 0.09 | 3.06 ± 0.10 |
StarVC | 3.98 ± 0.08 | 4.17 ± 0.08 |
3. Demo
- TriAAN-VC: A deep-learning-based framework for any-to-any VC, focusing on disentangling linguistic content and target speaker characteristics.[1]
- OpenVoice V2: A large-scale, zero-shot VC model built upon YourTTS.[2]
- CosyVoice: A diffusion-based speech generation approach from Alibaba, featuring a VC variant.[3]
[1] H. J. Park, S. W. Yang, J. S. Kim, W. Shin, and S. W. Han, “Triaan-vc: Triple adaptive attention normalization for any-to-any voice conversion,” in ICASSP 2023-2023 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[2] Z. Qin, W. Zhao, X. Yu, and X. Sun, “Openvoice: Versatile instant voice cloning,” arXiv preprint arXiv:2312.01479, 2023.
[3] Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y. Yang, H. Hu,S. Zheng, Y. Gu, Z. Ma et al., “Cosyvoice: A scalable multi-lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,” arXiv preprint arXiv:2407.05407, 2024.