Fine tuning guide #128
Replies: 3 comments 6 replies
-
@IIEleven11 This is awesome, thank you! A few questions, since you seem knowledgable:
|
Beta Was this translation helpful? Give feedback.
-
Well I can only speak from my experience here, but from what I could tell you might not be able to use a T4 for this. I haven't used a T4 though so this could definitely be wrong. My first attempt was with 30 minute dataset on the A6000 and that went at a reasonable speed, around 1.5 days. But my last attempt with the 1 hour dataset took a long time at about 3.5 to 4 days. You would have to probably drop the batch size to 2 and the max_len 100-200. Which is not exactly the best conditions to fine tune with. If we assume quality is the number one priority, it may not be worth it. As for the ability to show expressive speech, yes it does do that. And it is extremely fast at doing it. When compared to tortoise its absolutely no comparison. In my opinion it renders tortoise obsolete at this point. I just got done with my last fine tuning run a few hours ago and I was playing with some of the parameters (diffusion, alpha, beta, etc) and I need to keep testing things out, but it sounds very good so far. I am in the process of using it as a base model for an RVC model to be mapped on top of. I have done this with Coqui XTTSv2 and the results were the best i've ever heard. It cleans up a lot of the issues when you have two models of the same speaker being used for inference. I cant speak on the accents or differences between multi and single speaker. From the information I gathered, fine tuning with the multi speaker option was a more viable option but there is some unclear instruction within the original repo: "Please make sure you have the LibriTTS checkpoint downloaded and unzipped under the folder. The default configuration config_ft.yml finetunes on LJSpeech with 1 hour of speech data" The demo asks for the LibriTTS checkpoint and the config points to that as well, I am unsure where LJspeech fits into that, my guess is that by default its set up to fine tune on the LJ speech dataset with the LibriTTS speaker/s? Hopefully the repo owners can shed some light on that. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your guide. I have added it to the README. |
Beta Was this translation helpful? Give feedback.
-
I wrote up the process I used to fine tune. Including scripts to segment, transcribe, and create the required data sets. You can find it here https://github.com/IIEleven11/StyleTTS2FineTune.
Beta Was this translation helpful? Give feedback.
All reactions