Fine tuning guide #128

IIEleven11 · 2023-12-04T11:54:01Z

IIEleven11
Dec 4, 2023

I wrote up the process I used to fine tune. Including scripts to segment, transcribe, and create the required data sets. You can find it here https://github.com/IIEleven11/StyleTTS2FineTune.

platform-kit · 2023-12-04T19:28:27Z

platform-kit
Dec 4, 2023

@IIEleven11 This is awesome, thank you!

A few questions, since you seem knowledgable:

How long would it take to train a high fidelity single speaker model on a T4? And how much audio is needed?
Same question for multi-speaker.
Is the StyleTTS2 architecture capable of picking up things like laughing, sighing, etc, like Bark?
What about unusual accents? Say, for example, a German speaking English as a 2nd language?

2 replies

martinambrus Dec 5, 2023

I can probably share my experiences with Google Colab's T4 so far. I can say that it's not very feasible to train a high fidelity model of any kind that way.

The problem is VRAM of that single T4 - it's VERY small. I had to bring batch_size all the way down to 3 and max_len to 200. That means the model can only be properly trained on audio files with a maximum of 2,5 second in length. Any other combination (even batch_size = 2 and max_len = 200) would put the T4 out of memory during the first few epochs.

While that might be enough to train the model to some extent (with lots of epochs), your model probably wouldn't catch all the emotional nuances it needs to perform well on longer sentences and their context.

It could be possible to split audio samples into 2,5s clips where 2-3 words are spoken together with a certain emotion and train the model that way. However, that would be very time consuming - both the preparation phase and the training phase.

If anyone has more insights on how to optimize this more, so we could train our own StyleTTS2 from scratch on a T4, please don't hesitate to share them here.

IIEleven11 Dec 6, 2023
Author

You can rent a A6000 at vast ai for like .20-.30 cents an hour. It's super cheap. My data set was kinda big so it took awhile. I bet you could try maybe a 20 minute data set on the a6000 and get done before you spend a dollar.

IIEleven11 · 2023-12-04T21:02:25Z

IIEleven11
Dec 4, 2023
Author

Well I can only speak from my experience here, but from what I could tell you might not be able to use a T4 for this. I haven't used a T4 though so this could definitely be wrong. My first attempt was with 30 minute dataset on the A6000 and that went at a reasonable speed, around 1.5 days. But my last attempt with the 1 hour dataset took a long time at about 3.5 to 4 days. You would have to probably drop the batch size to 2 and the max_len 100-200. Which is not exactly the best conditions to fine tune with. If we assume quality is the number one priority, it may not be worth it.

As for the ability to show expressive speech, yes it does do that. And it is extremely fast at doing it. When compared to tortoise its absolutely no comparison. In my opinion it renders tortoise obsolete at this point.

I just got done with my last fine tuning run a few hours ago and I was playing with some of the parameters (diffusion, alpha, beta, etc) and I need to keep testing things out, but it sounds very good so far. I am in the process of using it as a base model for an RVC model to be mapped on top of. I have done this with Coqui XTTSv2 and the results were the best i've ever heard. It cleans up a lot of the issues when you have two models of the same speaker being used for inference.

I cant speak on the accents or differences between multi and single speaker. From the information I gathered, fine tuning with the multi speaker option was a more viable option but there is some unclear instruction within the original repo: "Please make sure you have the LibriTTS checkpoint downloaded and unzipped under the folder. The default configuration config_ft.yml finetunes on LJSpeech with 1 hour of speech data" The demo asks for the LibriTTS checkpoint and the config points to that as well, I am unsure where LJspeech fits into that, my guess is that by default its set up to fine tune on the LJ speech dataset with the LibriTTS speaker/s? Hopefully the repo owners can shed some light on that.

3 replies

yl4579 Dec 5, 2023
Maintainer

The example I showed was for single speaker, but you can surely finetune on a multi-speaker dataset and it would work in the same way. You do not need to change anything, just prepare the dataset in the same way as the LibriTTS dataset.

caffeinetoomuch Dec 18, 2023

I just got done with my last fine tuning run a few hours ago and I was playing with some of the parameters (diffusion, alpha, beta, etc) and I need to keep testing things out, but it sounds very good so far. I am in the process of using it as a base model for an RVC model to be mapped on top of. I have done this with Coqui XTTSv2 and the results were the best i've ever heard. It cleans up a lot of the issues when you have two models of the same speaker being used for inference.

Hi, firstly, want to thank for sharing all these information! Could you bit elaborate on the RVC model? Which model are you using? Is there any open source material I can read?

nivibilla Mar 25, 2024

I think what he did is to train both a styletts2 and a rvc model on the same speaker dataset. Then using the styletts2 mode as the initial audio (to get the style, expressions ...) and then cleaning up the voice itself with the RVC.

yl4579 · 2023-12-05T07:50:49Z

yl4579
Dec 5, 2023
Maintainer

Thanks for your guide. I have added it to the README.

1 reply

IIEleven11 Dec 5, 2023
Author

Great! Please, if you see anything that is being done incorrectly or if there is a better option let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning guide #128

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fine tuning guide #128

IIEleven11 Dec 4, 2023

Replies: 3 comments · 6 replies

platform-kit Dec 4, 2023

martinambrus Dec 5, 2023

IIEleven11 Dec 6, 2023 Author

IIEleven11 Dec 4, 2023 Author

yl4579 Dec 5, 2023 Maintainer

caffeinetoomuch Dec 18, 2023

nivibilla Mar 25, 2024

yl4579 Dec 5, 2023 Maintainer

IIEleven11 Dec 5, 2023 Author

IIEleven11
Dec 4, 2023

Replies: 3 comments 6 replies

platform-kit
Dec 4, 2023

IIEleven11 Dec 6, 2023
Author

IIEleven11
Dec 4, 2023
Author

yl4579 Dec 5, 2023
Maintainer

yl4579
Dec 5, 2023
Maintainer

IIEleven11 Dec 5, 2023
Author