Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't work well in speech generation task. #35

Open
FacePoluke opened this issue Sep 11, 2024 · 10 comments
Open

Doesn't work well in speech generation task. #35

FacePoluke opened this issue Sep 11, 2024 · 10 comments

Comments

@FacePoluke
Copy link

Thank you for your open-source work. I would like to ask you some questions. I tried to use diffloss for a speech generation task, adopting the next token prediction approach. This corresponds to the order=raster, direction=causal, #preds=1 in your paper. However, it did not converge well. Could you help me analyze what might be causing this issue? Thanks a lot!

@LTH14
Copy link
Owner

LTH14 commented Sep 11, 2024

Thanks for your interest. Our model trained on ImageNet typically converge after 50k iterations without ema, and 100k iterations with ema. Have you trained your model for sufficiently long time?

@FacePoluke
Copy link
Author

Thanks for your interest. Our model trained on ImageNet typically converge after 50k iterations without ema, and 100k iterations with ema. Have you trained your model for sufficiently long time?

Yes, I have trained for this long. May I ask if this is the number for converging in AR mode (order=raster, direction=causal, #preds=1)?

@LTH14
Copy link
Owner

LTH14 commented Sep 12, 2024

AR has a similar converging epoch as MAR

@FacePoluke
Copy link
Author

OK, thank you so much.

@zythenoob
Copy link

I'm also trying to train MAR on audio generation task. May I ask what your setting is, how long did you train and how does the model behave?

@FacePoluke
Copy link
Author

FacePoluke commented Sep 13, 2024

I'm also trying to train MAR on audio generation task. May I ask what your setting is, how long did you train and how does the model behave?

I have trained for 300k steps, with each step containing 2k-5k speech frames. Currently, the model's output is mostly noise. How about you?

@zythenoob
Copy link

I'm also trying to train MAR on audio generation task. May I ask what your setting is, how long did you train and how does the model behave?

I have trained for 300k steps, with each step containing 2k-5k speech frames. Currently, the model's output is mostly noise. How about you?

I trained a 1B model (bidirection, random order) for almost 200k steps and 2k frames each sample. I'm getting mumbling voice, not noise, but definitely not speech. Which vocal feature/latent are you using (e.g., mel)?

@LTH14
Copy link
Owner

LTH14 commented Sep 13, 2024

One experience I had is that DiffLoss with SimpleMLP has difficulty generating a too high-dimensional token (e.g. 256), so you might want to check this in your implementation.

@Paulmzr
Copy link

Paulmzr commented Oct 28, 2024

want to check this in your implementation.

I have tried using the embeddings from EnCodec as in Naturalspeech2, which has a latent dimension of 128. I also only got some mumbling voice after trying multiple ways (predicting x_0 instead of epsilon, using clamp function in diffusion), though the training curves seems to converged. Do you have any findings?

@LTH14
Copy link
Owner

LTH14 commented Oct 28, 2024

You could consider using 1000 diffusion steps to see whether the bad performance is from timestep respacing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants