Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3.
As the first open-source TTS model that tried to combine flow-matching and DiT, StableTTS is a fast and lightweight TTS model for chinese and english speech generation. It has only 10M parameters.
✨ Huggingface demo: chinese_version english_version
We provide pretrained models ready for inference, finetuning and webui. Simply download and place the models in the ./checkpoints
directory to get started.
Model Name | Task Details | Dataset | Download Link |
---|---|---|---|
StableTTS | text to mel | 400h english | 🤗 |
StableTTS | text to mel | 100h chinese | 🤗 |
Vocos | mel to wav | 2k english + chinese + japanese | 🤗 |
Better pretrained models and multilingual models will comming soon...
-
Set up pytorch: Follow the official PyTorch guide to install pytorch and torchaudio. We recommend using the latest version for optimal performance.
-
Install Dependencies: Run the following command to install the required Python packages:
pip install -r requirements.txt
For detailed inference instructions, please refer to inference.ipynb
We also provide a webui based on gradio, please refer to webui.py
Training your models with StableTTS is designed to be straightforward and efficient. Here’s how to get started:
Note: Since we use reference encoder
to capture speaker identity when training, there is no need for a speaker ID in multispeaker synthesis and training.
-
Generate Text and Audio pairs: Generate the text and audio pair filelist as
./filelists/example.txt
. Some recipes of open-source datasets could be found in./recipes
. -
Run Preprocessing: Adjust the
DataConfig
inpreprocess.py
to set your input and output paths, then run the script. This will process the audio and text according to your list, outputting a JSON file with paths to mel features and phonemes. Note: Ensure to switchchinese=False
inDataConfig
for English text processing.
-
Adjust Training Configuration: In
config.py
, modifyTrainConfig
to set your file list path and adjust training parameters as needed. -
Start the Training Process: Launch
train.py
to start training your model.
Note: For finetuning, download the pretrained model and place it in the model_save_path
directory specified in TrainConfig
. Training script will automatically detect and load the pretrained checkpoint.
Feel free to explore and modify settings in config.py
to modify the hyperparameters!
-
We use the Diffusion Convolution Transformer block from Hierspeech++, which is a combination of original DiT and FFT(Feed forward Transformer from fastspeech) for better prosody.
-
In flow-matching decoder, we add a FiLM layer before DiT block to condition timestep embedding into model. We also add three ConvNeXt blocks before DiT. We found it helps with model convergence and better sound quality
The development of our models heavily relies on insights and code from various projects. We express our heartfelt thanks to the creators of the following:
Matcha TTS: Essential flow-matching code.
Grad TTS: Diffusion model structure.
Stable Diffusion 3: Idea of combining flow-matching and DiT.
Vits: Code style and MAS insights, DistributedBucketSampler.
plowtts-pytorch: codes of MAS in training
Bert-VITS2 : numba version of MAS and modern pytorch codes of Vits
fish-speech: dataclass usage and mel-spectrogram transforms using torchaudio
gpt-sovits: melstyle encoder for voice clone
diffsinger: chinese three section phoneme scheme for chinese g2p
coqui xtts: gradio webui
- Release pretrained models.
- Provide detailed finetuning instructions.
- Support Japanese language.
- User friendly preprocess and inference script.
- Enhance documentation and citations.
- Add chinese version of readme.
- Release multilingual checkpoint.
Any organization or individual is prohibited from using any technology in this repo to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.