Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Problems #44

Open
drx-code opened this issue Sep 14, 2024 · 1 comment
Open

Training Problems #44

drx-code opened this issue Sep 14, 2024 · 1 comment

Comments

@drx-code
Copy link

Thank you for your great work! I tried to train the model with command

torchrun --nproc_per_node=8 --nnodes=4 --node_rank=0 --master_addr=29.79.8.45 --master_port 9999 main_mar.py --img_size 256 --vae_path pretrained_models/vae/kl16.ckpt --vae_embed_dim 16 --vae_stride 16 --patch_size 1 --model mar_large --diffloss_d 3 --diffloss_w 1024 --epochs 400 --warmup_epochs 100 --batch_size 64 --lr 8.0e-4 --diffusion_batch_mul 4

on 4 nodes, each nodes has 8 NVIDIA H20 GPUs. However, the training speed is much slower: I trained 200 Epochs on ImageNet 256x256 dataset with more than 2 days. Is there any accelerate methods in the training? Thank you.
By the way, the model after training 175 epochs can not generate good samples. Here are some results without cfg:
00108
00164
I wonder to know whether the results are reasonable or not. Thank you!

@LTH14
Copy link
Owner

LTH14 commented Sep 14, 2024

I would recommend using cached VAE latents as described in the Readme. This largely speeds up training in our experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants