About the convergence trend comparison with Adamw in ViT-H #16

haihai-00 · 2022-10-27T08:58:16Z

Hi,
Thank you very much for your brilliant work on Adan!
And from you paper, it said Adan should get a lower loss (both Train and test) than Adamw according to Figure 1. However, I got a higher training loss with Adan than AdamW in ViT-H:

Steps	Adamw_train_loss	Adan_train_loss
200	6.9077	6.9077
400	6.9074	6.9075
600	6.9068	6.9073
800	6.9061	6.907
1000	6.905	6.9064
1200	6.9036	6.9056
1400	6.9014	6.9044
1600	6.899	6.9028
1800	6.8953	6.9003
2000	6.8911	6.8971
2200	6.8848	6.8929
2400	6.8789	6.8893
2600	6.8699	6.8843
2800	6.8626	6.8805
3000	6.8528	6.8744
3200	6.8402	6.868
3400	6.8293	6.862
3600	6.8172	6.8547
3800	6.7989	6.8465
4000	6.7913	6.8405

I used the same HPs as AdamW and only changed beta from (0.9, 0.999) to (0.9, 0.92, 0.999).
I only trained for few steps to see the trend. But it seems the loss gap from AdamW is quite big, should I change other HPs to better using Adan? How can I get a lower Loss than AdamW?
I noticed that Adan prefers a large batch size in Vision tasks, should we using a larger batch size?
Or should I train with more steps to see the trend?
Thank you!

XingyuXie · 2022-10-27T09:50:50Z

@haihai-00 Hi,
I suggest referring to the HPs we use for ViT-B and ViT-S. At least, you may try the default betas (0.98,0.92,0.99) and set wd to 0.02.
To make more progress on your task, I will try to train ViT-H now. So, please provide more HPs about your setting.

haihai-00 · 2022-10-28T02:53:56Z

Thank you!
We are using the HPs provided from this paper: https://arxiv.org/pdf/2208.06366.pdf F Hyperparameters for Image Classification Fine-tuning for ViT-L/16. This is the HPs provided for ViT-L, but we use it since we didn't find released official HPs for ViT-H.

XingyuXie · 2022-10-28T04:28:43Z

It seems that you are performing the model finetuning and do not train from scratch, right? Actually, we have provided the results for fine-tuning MAE-ViT-Large here.

Moreover, I also fine-tune the MAE-ViT-H on its official pre-trained model (obtain 86.9% after 50 epochs). You may add my WeChat: xyxie_joy, and I can send you the log file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the convergence trend comparison with Adamw in ViT-H #16

About the convergence trend comparison with Adamw in ViT-H #16

haihai-00 commented Oct 27, 2022 •

edited

Loading

XingyuXie commented Oct 27, 2022

haihai-00 commented Oct 28, 2022

XingyuXie commented Oct 28, 2022

About the convergence trend comparison with Adamw in ViT-H #16

About the convergence trend comparison with Adamw in ViT-H #16

Comments

haihai-00 commented Oct 27, 2022 • edited Loading

XingyuXie commented Oct 27, 2022

haihai-00 commented Oct 28, 2022

XingyuXie commented Oct 28, 2022

haihai-00 commented Oct 27, 2022 •

edited

Loading