-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doesn't work well in speech generation task. #35
Comments
Thanks for your interest. Our model trained on ImageNet typically converge after 50k iterations without ema, and 100k iterations with ema. Have you trained your model for sufficiently long time? |
Yes, I have trained for this long. May I ask if this is the number for converging in AR mode (order=raster, direction=causal, #preds=1)? |
AR has a similar converging epoch as MAR |
OK, thank you so much. |
I'm also trying to train MAR on audio generation task. May I ask what your setting is, how long did you train and how does the model behave? |
I have trained for 300k steps, with each step containing 2k-5k speech frames. Currently, the model's output is mostly noise. How about you? |
I trained a 1B model (bidirection, random order) for almost 200k steps and 2k frames each sample. I'm getting mumbling voice, not noise, but definitely not speech. Which vocal feature/latent are you using (e.g., mel)? |
One experience I had is that DiffLoss with SimpleMLP has difficulty generating a too high-dimensional token (e.g. 256), so you might want to check this in your implementation. |
I have tried using the embeddings from EnCodec as in Naturalspeech2, which has a latent dimension of 128. I also only got some mumbling voice after trying multiple ways (predicting x_0 instead of epsilon, using clamp function in diffusion), though the training curves seems to converged. Do you have any findings? |
You could consider using 1000 diffusion steps to see whether the bad performance is from timestep respacing. |
Thank you for your open-source work. I would like to ask you some questions. I tried to use diffloss for a speech generation task, adopting the next token prediction approach. This corresponds to the order=raster, direction=causal, #preds=1 in your paper. However, it did not converge well. Could you help me analyze what might be causing this issue? Thanks a lot!
The text was updated successfully, but these errors were encountered: