Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling diffusion MLP for learning higher dimensional VAE feature #55

Open
zythenoob opened this issue Sep 27, 2024 · 7 comments
Open

Comments

@zythenoob
Copy link

In the paper, the dim of VAE latent is 8 or 16 and the experiments covers MLPs of 6-12 blocks. I experimented with an 8-block MLP learning 64 and 1024-dim audio data, while the model struggled to learn 1024-dim (correct sound but noisy), it performed ok on 64-dim.

Despite you mentioned that the size of MLP contributes marginally to the final performance, I wonder if when the model is learning a higher dimensional target, scaling MLP can lead to drastic change in performance? Or alternatively, what design choices for the diffusion part other than MLP may benefit high dimensional learning? Thank you!

@LTH14
Copy link
Owner

LTH14 commented Sep 27, 2024

Yes your observation is correct -- if the target data is too high-dimensional, it would be hard for a simpleMLP to model it. In our paper, the token dimension is always 16 (4x4 for KL-8 and 16 for KL-16). If the dimension of the target is too large (as a reference, a 16x16x3 pixel patch is 768 dimensions), you might need to design a more powerful head for the DiffLoss to model it (for example, model a 16x16x3 patch using a convnet).

@MikeWangWZHL
Copy link

Hi! I observe some potentially similar issues:
I am trying to use this diffusion loss to model a latent space (from a custom continous tokenizer) with 128-dim (with simple AR modeling); I find that during inference sampling, the estimiation of the "pred_xstart" results in very large values; Do you have any insights on why?

@LTH14
Copy link
Owner

LTH14 commented Sep 27, 2024

@MikeWangWZHL I once tried 64-dim and it worked fine -- never tried 128-dim however. If pred_xstart is very large I would suggest either train for longer or use 1000 steps during inference to see whether the problem persists.

@MikeWangWZHL
Copy link

thanks for the quick reply; one thing I find is that the beta for the gen_diffusion at the last timestep (e.g., t=99) is 9.99989999e-01 which is not clamped to say 0.999; this contributes to a very large coefficient in the _predict_xstart_from_eps; is this expected?
thanks in advance!

@LTH14
Copy link
Owner

LTH14 commented Sep 27, 2024

We directly adopted the diffusion code from DiT, so I didn't look into this carefully. You could try to clamp it to see whether it improves the stability.

@Paulmzr
Copy link

Paulmzr commented Oct 28, 2024

Hi! I observe some potentially similar issues: I am trying to use this diffusion loss to model a latent space (from a custom continous tokenizer) with 128-dim (with simple AR modeling); I find that during inference sampling, the estimiation of the "pred_xstart" results in very large values; Do you have any insights on why?

I also found it. I also tried applying the "clip" option to avoid the large x_o prediction, but the generation still failed.

@darkliang
Copy link

In my case, if x_start is very large, using ddpm inference steps = 1000 helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants