-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling diffusion MLP for learning higher dimensional VAE feature #55
Comments
Yes your observation is correct -- if the target data is too high-dimensional, it would be hard for a simpleMLP to model it. In our paper, the token dimension is always 16 (4x4 for KL-8 and 16 for KL-16). If the dimension of the target is too large (as a reference, a 16x16x3 pixel patch is 768 dimensions), you might need to design a more powerful head for the DiffLoss to model it (for example, model a 16x16x3 patch using a convnet). |
Hi! I observe some potentially similar issues: |
@MikeWangWZHL I once tried 64-dim and it worked fine -- never tried 128-dim however. If pred_xstart is very large I would suggest either train for longer or use 1000 steps during inference to see whether the problem persists. |
thanks for the quick reply; one thing I find is that the beta for the |
We directly adopted the diffusion code from DiT, so I didn't look into this carefully. You could try to clamp it to see whether it improves the stability. |
I also found it. I also tried applying the "clip" option to avoid the large x_o prediction, but the generation still failed. |
In my case, if x_start is very large, using ddpm inference steps = 1000 helps. |
In the paper, the dim of VAE latent is 8 or 16 and the experiments covers MLPs of 6-12 blocks. I experimented with an 8-block MLP learning 64 and 1024-dim audio data, while the model struggled to learn 1024-dim (correct sound but noisy), it performed ok on 64-dim.
Despite you mentioned that the size of MLP contributes marginally to the final performance, I wonder if when the model is learning a higher dimensional target, scaling MLP can lead to drastic change in performance? Or alternatively, what design choices for the diffusion part other than MLP may benefit high dimensional learning? Thank you!
The text was updated successfully, but these errors were encountered: