XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Published in INTERSPEECH 2024, 2024

Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages. We propose and make publicly available the XTTS system, which builds upon the Tortoise model with several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art results in most of them.

Recommended citation: Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber. (2024). "XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model." INTERSPEECH 2024.
Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Kelly J. Davis

Share on