Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the convergence trend comparison with Adamw in ViT-H #16

Open
haihai-00 opened this issue Oct 27, 2022 · 3 comments
Open

About the convergence trend comparison with Adamw in ViT-H #16

haihai-00 opened this issue Oct 27, 2022 · 3 comments

Comments

@haihai-00
Copy link

haihai-00 commented Oct 27, 2022

Hi,
Thank you very much for your brilliant work on Adan!
And from you paper, it said Adan should get a lower loss (both Train and test) than Adamw according to Figure 1. However, I got a higher training loss with Adan than AdamW in ViT-H:

Steps Adamw_train_loss Adan_train_loss
200 6.9077 6.9077
400 6.9074 6.9075
600 6.9068 6.9073
800 6.9061 6.907
1000 6.905 6.9064
1200 6.9036 6.9056
1400 6.9014 6.9044
1600 6.899 6.9028
1800 6.8953 6.9003
2000 6.8911 6.8971
2200 6.8848 6.8929
2400 6.8789 6.8893
2600 6.8699 6.8843
2800 6.8626 6.8805
3000 6.8528 6.8744
3200 6.8402 6.868
3400 6.8293 6.862
3600 6.8172 6.8547
3800 6.7989 6.8465
4000 6.7913 6.8405

I used the same HPs as AdamW and only changed beta from (0.9, 0.999) to (0.9, 0.92, 0.999).
I only trained for few steps to see the trend. But it seems the loss gap from AdamW is quite big, should I change other HPs to better using Adan? How can I get a lower Loss than AdamW?
I noticed that Adan prefers a large batch size in Vision tasks, should we using a larger batch size?
Or should I train with more steps to see the trend?
Thank you!

@XingyuXie
Copy link
Collaborator

@haihai-00 Hi,
I suggest referring to the HPs we use for ViT-B and ViT-S. At least, you may try the default betas (0.98,0.92,0.99) and set wd to 0.02.
To make more progress on your task, I will try to train ViT-H now. So, please provide more HPs about your setting.

@haihai-00
Copy link
Author

Thank you!
We are using the HPs provided from this paper: https://arxiv.org/pdf/2208.06366.pdf F Hyperparameters for Image Classification Fine-tuning for ViT-L/16. This is the HPs provided for ViT-L, but we use it since we didn't find released official HPs for ViT-H.

@XingyuXie
Copy link
Collaborator

It seems that you are performing the model finetuning and do not train from scratch, right? Actually, we have provided the results for fine-tuning MAE-ViT-Large here.

Moreover, I also fine-tune the MAE-ViT-H on its official pre-trained model (obtain 86.9% after 50 epochs). You may add my WeChat: xyxie_joy, and I can send you the log file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants