Doesn't work well in speech generation task. #35

FacePoluke · 2024-09-11T08:51:38Z

Thank you for your open-source work. I would like to ask you some questions. I tried to use diffloss for a speech generation task, adopting the next token prediction approach. This corresponds to the order=raster, direction=causal, #preds=1 in your paper. However, it did not converge well. Could you help me analyze what might be causing this issue? Thanks a lot!

LTH14 · 2024-09-11T13:47:42Z

Thanks for your interest. Our model trained on ImageNet typically converge after 50k iterations without ema, and 100k iterations with ema. Have you trained your model for sufficiently long time?

FacePoluke · 2024-09-12T00:27:16Z

Thanks for your interest. Our model trained on ImageNet typically converge after 50k iterations without ema, and 100k iterations with ema. Have you trained your model for sufficiently long time?

Yes, I have trained for this long. May I ask if this is the number for converging in AR mode (order=raster, direction=causal, #preds=1)?

LTH14 · 2024-09-12T04:25:13Z

AR has a similar converging epoch as MAR

FacePoluke · 2024-09-12T07:02:44Z

OK， thank you so much.

zythenoob · 2024-09-12T10:01:38Z

I'm also trying to train MAR on audio generation task. May I ask what your setting is, how long did you train and how does the model behave?

FacePoluke · 2024-09-13T08:19:10Z

I'm also trying to train MAR on audio generation task. May I ask what your setting is, how long did you train and how does the model behave?

I have trained for 300k steps, with each step containing 2k-5k speech frames. Currently, the model's output is mostly noise. How about you?

zythenoob · 2024-09-13T09:04:41Z

I'm also trying to train MAR on audio generation task. May I ask what your setting is, how long did you train and how does the model behave?

I have trained for 300k steps, with each step containing 2k-5k speech frames. Currently, the model's output is mostly noise. How about you?

I trained a 1B model (bidirection, random order) for almost 200k steps and 2k frames each sample. I'm getting mumbling voice, not noise, but definitely not speech. Which vocal feature/latent are you using (e.g., mel)?

LTH14 · 2024-09-13T13:59:41Z

One experience I had is that DiffLoss with SimpleMLP has difficulty generating a too high-dimensional token (e.g. 256), so you might want to check this in your implementation.

Paulmzr · 2024-10-28T10:44:20Z

want to check this in your implementation.

I have tried using the embeddings from EnCodec as in Naturalspeech2, which has a latent dimension of 128. I also only got some mumbling voice after trying multiple ways (predicting x_0 instead of epsilon, using clamp function in diffusion), though the training curves seems to converged. Do you have any findings?

LTH14 · 2024-10-28T14:16:59Z

You could consider using 1000 diffusion steps to see whether the bad performance is from timestep respacing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't work well in speech generation task. #35

Doesn't work well in speech generation task. #35

FacePoluke commented Sep 11, 2024

LTH14 commented Sep 11, 2024

FacePoluke commented Sep 12, 2024

LTH14 commented Sep 12, 2024

FacePoluke commented Sep 12, 2024

zythenoob commented Sep 12, 2024

FacePoluke commented Sep 13, 2024 •

edited

Loading

zythenoob commented Sep 13, 2024

LTH14 commented Sep 13, 2024

Paulmzr commented Oct 28, 2024

LTH14 commented Oct 28, 2024

Doesn't work well in speech generation task. #35

Doesn't work well in speech generation task. #35

Comments

FacePoluke commented Sep 11, 2024

LTH14 commented Sep 11, 2024

FacePoluke commented Sep 12, 2024

LTH14 commented Sep 12, 2024

FacePoluke commented Sep 12, 2024

zythenoob commented Sep 12, 2024

FacePoluke commented Sep 13, 2024 • edited Loading

zythenoob commented Sep 13, 2024

LTH14 commented Sep 13, 2024

Paulmzr commented Oct 28, 2024

LTH14 commented Oct 28, 2024

FacePoluke commented Sep 13, 2024 •

edited

Loading