Training Problems #44

drx-code · 2024-09-14T07:01:31Z

Thank you for your great work! I tried to train the model with command

torchrun --nproc_per_node=8 --nnodes=4 --node_rank=0 --master_addr=29.79.8.45 --master_port 9999 main_mar.py --img_size 256 --vae_path pretrained_models/vae/kl16.ckpt --vae_embed_dim 16 --vae_stride 16 --patch_size 1 --model mar_large --diffloss_d 3 --diffloss_w 1024 --epochs 400 --warmup_epochs 100 --batch_size 64 --lr 8.0e-4 --diffusion_batch_mul 4

on 4 nodes, each nodes has 8 NVIDIA H20 GPUs. However, the training speed is much slower: I trained 200 Epochs on ImageNet 256x256 dataset with more than 2 days. Is there any accelerate methods in the training? Thank you.
By the way, the model after training 175 epochs can not generate good samples. Here are some results without cfg:

I wonder to know whether the results are reasonable or not. Thank you!

The text was updated successfully, but these errors were encountered:

LTH14 · 2024-09-14T13:13:17Z

I would recommend using cached VAE latents as described in the Readme. This largely speeds up training in our experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Problems #44

Training Problems #44

drx-code commented Sep 14, 2024

LTH14 commented Sep 14, 2024

Training Problems #44

Training Problems #44

Comments

drx-code commented Sep 14, 2024

LTH14 commented Sep 14, 2024